we are running traefik v2 2.5.4 (Chart version 10.6.2) on our kubernetes cluster as ingress. We use custom CRDs to define IngressRoutes. We noticed that our docker pushes to a registry (harbor) behind traefik were really slow so we ran a few tests that tested pure web traffic in a few different scenarios to isolate the problem. For the tests we used
whoami/bench as a server and
wrk as a client.
Our traefik configuration has the following overrides:
deployment: initContainers: - command: - sh - -c - chmod -Rv 600 /data/* image: busybox:1.31.1 name: volume-permissions volumeMounts: - mountPath: /data name: data env: - name: CF_DNS_API_TOKEN valueFrom: secretKeyRef: key: token name: cloudflare-apitoken globalArguments: - --certificatesresolvers.le.acme.caserver=https://acme-v02.api.letsencrypt.org/directory - --certificatesresolvers.le.acme.dnschallenge=true - --certificatesresolvers.le.acme.dnschallenge.provider=cloudflare - --certificatesresolvers.le.acme.storage=/data/acme.json - --certificatesresolvers.le.acme.dnschallenge.resolvers=18.104.22.168:53,22.214.171.124:53 ingressRoute: dashboard: enabled: false logs: general: level: INFO persistence: enabled: true storageClass: longhorn ports: web-int: expose: false port: 9080 protocol: TCP websecure-int: expose: false port: 9443 protocol: TCP tls: certResolver: "" domains:  enabled: false options: "" providers: kubernetesCRD: allowCrossNamespace: true service: externalIPs: - X.X.X.X
Nothing big except adding lets encrypt, although the following tests were performed without tls to rule that out. Also we testes the network between the agents that host the pods and it was no bottleneck. For all the tests we made sure that the server and client pod ran on differrent nodes.
The three pods involved in the tests are:
wrk pod which is the client doing the tests
apiVersion: v1 kind: Pod metadata: name: potts namespace: test spec: containers: - name: web image: skandyla/wrk command: - sleep - "400000000"
whoami/bench pod which handles the incoming requests
apiVersion: v1 kind: Pod metadata: name: whoami namespace: test labels: role: whoami spec: containers: - name: whoami image: containous/whoami ports: - name: web containerPort: 80 protocol: TCP
A service which can expose the
apiVersion: v1 kind: Service metadata: name: my-whomi namespace: test spec: selector: role: whoami ports: - protocol: TCP port: 80 targetPort: 80
To have a baseline we let the two pods talk directly to each other, not using a service or traefik but their internal cluster IP. This is as fast is it could theoretically get:
wrk -t20 -c1000 -d60s -H "Host: doesnt.matter" --latency http://(internal pod IP of whoami):80/bench
Running 1m test @ http://10.12.162.41:80/bench 20 threads and 1000 connections Thread Stats Avg Stdev Max +/- Stdev Latency 5.36ms 6.20ms 237.66ms 98.87% Req/Sec 9.71k 1.53k 28.32k 82.03% Latency Distribution 50% 4.89ms 75% 5.50ms 90% 6.22ms 99% 12.08ms 8865524 requests in 1.00m, 1.04GB read Socket errors: connect 0, read 2869, write 1715713, timeout 0 Requests/sec: 147525.20 Transfer/sec: 17.73MB
We ran the test a few times and got values of around 16-17 MB/s every time.
This is more of an additional test. For this test we used the cluster internal DNS (over a service name). The same service that will be used by the ingress route in the traefik test.
wrk -t20 -c1000 -d60s -H "Host: doesnt.matter" --latency http:/(whoami-service):80/bench
Running 1m test @ http://my-whoami.test:80/bench 20 threads and 1000 connections Thread Stats Avg Stdev Max +/- Stdev Latency 5.69ms 7.58ms 872.81ms 99.12% Req/Sec 9.29k 1.47k 47.03k 88.16% Latency Distribution 50% 5.19ms 75% 5.71ms 90% 6.49ms 99% 12.43ms 6960258 requests in 1.00m, 836.37MB read Socket errors: connect 0, read 3643, write 177908, timeout 0 Requests/sec: 115845.62 Transfer/sec: 13.92MB
There was decline in performance with values around 13MB/s in multiple tests.
We added an
IngressRoute that goes to that service and used the the IP of the control plane so the traffic would be routed over traefik.
wrk -t20 -c1000 -d60s -H "Host: whoami.test" --latency http://(IP of cotnrol plane)0:80/bench
Running 1m test @ http://whoami.test:80/bench 20 threads and 1000 connections Thread Stats Avg Stdev Max +/- Stdev Latency 72.13ms 46.64ms 269.60ms 63.64% Req/Sec 709.05 84.82 1.59k 70.17% Latency Distribution 50% 72.93ms 75% 104.90ms 90% 136.15ms 99% 175.82ms 847338 requests in 1.00m, 82.42MB read Requests/sec: 14104.06 Transfer/sec: 1.37MB
We did expect a drop in performance but the drop looks like 10x when the traffic goes through traefik.
So we are not sure at the moment how to tackle this. Is this expected or did we misconfigure something or would need additional configuration to make this flow fast? Any ideas are welcome