TCP router stops working after upstream server crash?

Hi! :wave:

I need some assistance with an issue I am having with Traefik (v2.2.0). I use Traefik as an ingress controller in an AWS EKS cluster.
I have serveral IngressRoutes and IngressRouteTCPs defined through the "Kubernetes IngressRoute CRD" provider and in front of my cluster I have an AWS Network Load Balancer that balances traffic across my traefik pods.

One of my ingresses is a TCP route that I use to expose a gRPC API. The TCP route simply matches against the HostSNI of the request, terminates TLS and forwards the request to a kubernetes service in front of my grpc server.
My gRPC clients all have a gRPC stream (they stay open for a very long time) open at all times (listening for messages from the server) and this has been working fine so far.

Over to the problem I'm having.

Last week I experienced a problem with my gRPC server that caused it to crash. It actually crashed a couple of times and kubernetes used a little bit of time to bring it back. After kubernetes brough the pod back, I noticed that a lot of my gRPC clients were no longer connected to the server (no active stream).
Actually, there was no trace of the clients even attempting to connect to my gRPC server. I tried to manually restart one of the clients and then it was able to reach my gRPC server without problem.
I then performed a restart of my Traefik pods and the gRPC clients was able to reach the gRPC server.

From what I can gather it seems that the requests were, for some reason, no longer being proxied by Traefik because the target server crashed? It does not look like the connection was terminated by Traefik either?
Is this something you have experienced before? Or do you have any pointers on how to figure out what's going on?
I have added some snippets of (what I believe to be) relevant configurations below.

Any input on this would be greatly appreciated. I'm a bit at a loss right now.
Thanks in advance for any tips/pointers/help! :slight_smile: :pray:

// Traefik (kubernetes) service definition (relevant parts):
apiVersion: v1
kind: Service
  name: traefik
  namespace: <namespace>
  annotations: nlb
  externalTrafficPolicy: Local
  type: LoadBalancer
    app: traefik
    - protocol: TCP
      name: websecure
      port: 443
      targetPort: 443
// Traefik deployment definition (relevant parts):
kind: Deployment
apiVersion: apps/v1
  name: traefik
  namespace: ingress
    app: traefik
      app: traefik
        app: traefik
      serviceAccountName: traefik-ingress-controller
        - name: traefik
          image: traefik:v2.2.0
            - --log.level=info
            - --api.dashboard
            - --accesslog
            - --metrics.prometheus
            - --entryPoints.websecure.address=:443
            - --providers.kubernetescrd
            - name: websecure
              containerPort: 443
            - name: admin
              containerPort: 8080
            - name: traefik-config
              mountPath: /traefik
            - name: certs
              mountPath: /certs
        - name: traefik-config
            name: traefik-file-provider
        - name: certs
            secretName: certs
// IngressRoute for the clients having issues
kind: IngressRouteTCP
  name: <service-name>
  namespace: <namespace>
    - websecure
  tls: {}
  - match: HostSNI(`<my-host-name>`)
    kind: Rule
    - name: <my-grpc-service>
      port: 4000

Hello @fh203

Thanks a lot for providing the detailed report.

Seems that it might be the case that the TCP router is not updating the endpoints for a TCP service. Traefik creates the address of endpoints only one time during the router creation. So, I think that it should explain the behavior you have experienced.

It has been already fixed on v2.3 -> Improve service name lookup on TCP routers by ddtmachado · Pull Request #7370 · traefik/traefik · GitHub

Currently, there is v.2.4.6 and I recommend you to consider upgrading Traefik.

According to the specification, gRPC uses http/2 for transport so you should use an HTTP router.

Additionally, there is a good practice to have TCP routers on dedicated entry points instead of mixing HTTP with TCP. Please also note the TCP servers always take precedence.

Again, thanks a lot for reporting that and being a Traefik user :wink:

Thank you so much your swift reply, @jakubhajek!

I have tried the most recent version of Traefik (v2.4.7) in my test environment and I am still able to reproduce my problem. I have been running Traefik with log level DEBUG but I'm not able to spot anything obviously wrong. Do you (or anyone) have anything else I could try?

About using a TCP router for my gRPC traffic. I know that gRPC uses http2 under the hood and that a HTTP router would, technically, be more correct - but that does not seem to work for my use case. All my clients/servers rely on the "ping/pong" semantics built into gRPC to be between the two ends of the "connnection" (gRPC client and gRPC server). In other words, any proxies in-between has to either be transparent or, somehow, respect this requirement. Unfortunately, Traefik (and all other proxies I have tried with gRPC support) intercepts the gRPC pings and responds on behalf of the client/server. I guess this is be expected by a proxy (it's http/2 pings after all), but as I said, it does not really work for my situation. To work around this, I use the TCP router to make Traefik proxy the traffic transparently instead.

You are saying that I should not have a mix of HTTP and TCP routers on the same endpoint. Do you mean that I should not have TCP router(s) and HTTP router(s) on the same port? E.g: I can't have TCP routers and HTTP routers both on :443? Are there technical drawbacks to this or is it fine as long as I am aware that TCP takes precedence?

Again, thank you so much for any and all input! :slight_smile: