Hey,
we are running traefik v2 2.5.4 (Chart version 10.6.2) on our kubernetes cluster as ingress. We use custom CRDs to define IngressRoutes. We noticed that our docker pushes to a registry (harbor) behind traefik were really slow so we ran a few tests that tested pure web traffic in a few different scenarios to isolate the problem. For the tests we used whoami/bench
as a server and wrk
as a client.
Problem description and setup
Our traefik configuration has the following overrides:
deployment:
initContainers:
- command:
- sh
- -c
- chmod -Rv 600 /data/*
image: busybox:1.31.1
name: volume-permissions
volumeMounts:
- mountPath: /data
name: data
env:
- name: CF_DNS_API_TOKEN
valueFrom:
secretKeyRef:
key: token
name: cloudflare-apitoken
globalArguments:
- --certificatesresolvers.le.acme.caserver=https://acme-v02.api.letsencrypt.org/directory
- --certificatesresolvers.le.acme.dnschallenge=true
- --certificatesresolvers.le.acme.dnschallenge.provider=cloudflare
- --certificatesresolvers.le.acme.storage=/data/acme.json
- --certificatesresolvers.le.acme.dnschallenge.resolvers=1.1.1.1:53,8.8.8.8:53
ingressRoute:
dashboard:
enabled: false
logs:
general:
level: INFO
persistence:
enabled: true
storageClass: longhorn
ports:
web-int:
expose: false
port: 9080
protocol: TCP
websecure-int:
expose: false
port: 9443
protocol: TCP
tls:
certResolver: ""
domains: []
enabled: false
options: ""
providers:
kubernetesCRD:
allowCrossNamespace: true
service:
externalIPs:
- X.X.X.X
Nothing big except adding lets encrypt, although the following tests were performed without tls to rule that out. Also we testes the network between the agents that host the pods and it was no bottleneck. For all the tests we made sure that the server and client pod ran on differrent nodes.
The three pods involved in the tests are:
The wrk
pod which is the client doing the tests
apiVersion: v1
kind: Pod
metadata:
name: potts
namespace: test
spec:
containers:
- name: web
image: skandyla/wrk
command:
- sleep
- "400000000"
The server whoami/bench
pod which handles the incoming requests
apiVersion: v1
kind: Pod
metadata:
name: whoami
namespace: test
labels:
role: whoami
spec:
containers:
- name: whoami
image: containous/whoami
ports:
- name: web
containerPort: 80
protocol: TCP
A service which can expose the whoami
server:
apiVersion: v1
kind: Service
metadata:
name: my-whomi
namespace: test
spec:
selector:
role: whoami
ports:
- protocol: TCP
port: 80
targetPort: 80
Tests
Baseline
To have a baseline we let the two pods talk directly to each other, not using a service or traefik but their internal cluster IP. This is as fast is it could theoretically get:
wrk -t20 -c1000 -d60s -H "Host: doesnt.matter" --latency http://(internal pod IP of whoami):80/bench
Running 1m test @ http://10.12.162.41:80/bench
20 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 5.36ms 6.20ms 237.66ms 98.87%
Req/Sec 9.71k 1.53k 28.32k 82.03%
Latency Distribution
50% 4.89ms
75% 5.50ms
90% 6.22ms
99% 12.08ms
8865524 requests in 1.00m, 1.04GB read
Socket errors: connect 0, read 2869, write 1715713, timeout 0
Requests/sec: 147525.20
Transfer/sec: 17.73MB
We ran the test a few times and got values of around 16-17 MB/s every time.
Using a cluster service
This is more of an additional test. For this test we used the cluster internal DNS (over a service name). The same service that will be used by the ingress route in the traefik test.
wrk -t20 -c1000 -d60s -H "Host: doesnt.matter" --latency http:/(whoami-service):80/bench
Running 1m test @ http://my-whoami.test:80/bench
20 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 5.69ms 7.58ms 872.81ms 99.12%
Req/Sec 9.29k 1.47k 47.03k 88.16%
Latency Distribution
50% 5.19ms
75% 5.71ms
90% 6.49ms
99% 12.43ms
6960258 requests in 1.00m, 836.37MB read
Socket errors: connect 0, read 3643, write 177908, timeout 0
Requests/sec: 115845.62
Transfer/sec: 13.92MB
There was decline in performance with values around 13MB/s in multiple tests.
Using traefik
We added an IngressRoute
that goes to that service and used the the IP of the control plane so the traffic would be routed over traefik.
wrk -t20 -c1000 -d60s -H "Host: whoami.test" --latency http://(IP of cotnrol plane)0:80/bench
Running 1m test @ http://whoami.test:80/bench
20 threads and 1000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 72.13ms 46.64ms 269.60ms 63.64%
Req/Sec 709.05 84.82 1.59k 70.17%
Latency Distribution
50% 72.93ms
75% 104.90ms
90% 136.15ms
99% 175.82ms
847338 requests in 1.00m, 82.42MB read
Requests/sec: 14104.06
Transfer/sec: 1.37MB
We did expect a drop in performance but the drop looks like 10x when the traffic goes through traefik.
Conclusion
So we are not sure at the moment how to tackle this. Is this expected or did we misconfigure something or would need additional configuration to make this flow fast? Any ideas are welcome
Cheers