Hi!
I need some assistance with an issue I am having with Traefik (v2.2.0). I use Traefik as an ingress controller in an AWS EKS cluster.
I have serveral IngressRoutes and IngressRouteTCPs defined through the "Kubernetes IngressRoute CRD" provider and in front of my cluster I have an AWS Network Load Balancer that balances traffic across my traefik pods.
One of my ingresses is a TCP route that I use to expose a gRPC API. The TCP route simply matches against the HostSNI of the request, terminates TLS and forwards the request to a kubernetes service in front of my grpc server.
My gRPC clients all have a gRPC stream (they stay open for a very long time) open at all times (listening for messages from the server) and this has been working fine so far.
Over to the problem I'm having.
Last week I experienced a problem with my gRPC server that caused it to crash. It actually crashed a couple of times and kubernetes used a little bit of time to bring it back. After kubernetes brough the pod back, I noticed that a lot of my gRPC clients were no longer connected to the server (no active stream).
Actually, there was no trace of the clients even attempting to connect to my gRPC server. I tried to manually restart one of the clients and then it was able to reach my gRPC server without problem.
I then performed a restart of my Traefik pods and the gRPC clients was able to reach the gRPC server.
From what I can gather it seems that the requests were, for some reason, no longer being proxied by Traefik because the target server crashed? It does not look like the connection was terminated by Traefik either?
Is this something you have experienced before? Or do you have any pointers on how to figure out what's going on?
I have added some snippets of (what I believe to be) relevant configurations below.
Any input on this would be greatly appreciated. I'm a bit at a loss right now.
Thanks in advance for any tips/pointers/help!
// Traefik (kubernetes) service definition (relevant parts):
---
apiVersion: v1
kind: Service
metadata:
name: traefik
namespace: <namespace>
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
externalTrafficPolicy: Local
type: LoadBalancer
selector:
app: traefik
ports:
- protocol: TCP
name: websecure
port: 443
targetPort: 443
// Traefik deployment definition (relevant parts):
---
kind: Deployment
apiVersion: apps/v1
metadata:
name: traefik
namespace: ingress
labels:
app: traefik
spec:
selector:
matchLabels:
app: traefik
template:
metadata:
labels:
app: traefik
spec:
serviceAccountName: traefik-ingress-controller
containers:
- name: traefik
image: traefik:v2.2.0
args:
- --log.level=info
- --api.dashboard
- --accesslog
- --metrics.prometheus
- --entryPoints.websecure.address=:443
- --providers.file.directory=/traefik
- --providers.file.watch=true
- --providers.kubernetescrd
ports:
- name: websecure
containerPort: 443
- name: admin
containerPort: 8080
volumeMounts:
- name: traefik-config
mountPath: /traefik
- name: certs
mountPath: /certs
volumes:
- name: traefik-config
configMap:
name: traefik-file-provider
- name: certs
secret:
secretName: certs
// IngressRoute for the clients having issues
---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRouteTCP
metadata:
name: <service-name>
namespace: <namespace>
spec:
entryPoints:
- websecure
tls: {}
routes:
- match: HostSNI(`<my-host-name>`)
kind: Rule
services:
- name: <my-grpc-service>
port: 4000