Hi all,
I've deployed Traefik v2.1.2 on a AKS Kubernetes cluster, along with our application and have run jmeter for performance benchmarks. Those benchmarks have highlighted a bottle neck and I have been hunting down where the performance drop occurs. After some research it looks like traefik is this bottleneck.
Setup
Kubernetes cluster on AKS, 2 nodes, each with 8vCPU and 64GB of memory. Traefik is sitting behind a LoadBalancer Service, DNS is configured to point at the LB public IP adress and all services can be reached. Traefik is configured with the following args:
- "--entrypoints.websecure.forwardedheaders.insecure=true
- "--api.dashboard=true"
- "--api.insecure=true"
- "--ping=true"
- "--providers.kubernetescrd"
- "--log.level=Error"
- "--accesslog=false"
with 4 pods serving traefik
For this app in particular, we have client -> LoadBalancer -> Traefik -> app (2 replicas)
Testing performed
All testing is completed using Jmeter.
- Testing the app without traefik results in approx 800msg/s. I.E going direct to the loadbalancer azure service.
- Using traefik results in approx 100msg/s. I.E via traefik which forwards to the load balanced azure service
Parameters changed
I've tried tweaking more/less replicas for everything, assigning/removing resource requests/limits for the deployments of Traefik reducing logging in traefik, but none of those really have a big impact, performance stays around 100msg/s.
Any idea?
I'm cant think of any thing else to try, any ideas? Obviously Traefik is introducing a 10x decrease in performance so something must be incorrectly configured.
Let me know if anything else is needed to diagnose!