Huge bandwith/download performance issue

Hi,

We're using traefik to host about a hundred websites (and 3-4 times that for dev, staging, ...) and it's been great. We just noticed a performance issue that we attributed to cheap hosting but upon further investigation it seems it lies with traefik or at least the way we use it.

So, the context:

We use either docker or docker swarm, traefik 1.7 or 2.0 (same results in all cases) and when downloading a large file from a container behind traefik the download is abnormally slow compared to hitting the container directly.

We've tested different kind of servers but basically we have:

[traefik] => [nginx - static file]

If we hit traefik (tested on local machine, curl to localhost) we get about 1000MB/s and when accessing the container we get twice that. The gap worsen even more if we enable gzip on the nginx container we still get about the same perfs when directly fetching the file but it crawls to 30MB/s through traefik

The test machine has 4c/8t, 16GB of RAM, a SSD and basically only me on it.

I can understand a slight hit but not 2-20x less.

Beside swarm, the configuration is pretty vanilla, Simple host rules and a few containers running.

Any clue ?

I tested this locally as well, and found that Traefik actually outperformed a port-mapped connection. The results:

➜  ~ curl -O http://localhost/test.img # Traefik
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 50.0G  100 50.0G    0     0  1417M      0  0:00:36  0:00:36 --:--:-- 1561M
➜  ~ curl -O http://localhost:8080/test.img # Port-mapped to nginx 8080:80
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 50.0G  100 50.0G    0     0  1317M      0  0:00:38  0:00:38 --:--:-- 1398M
# swarm.yaml
version: '3.7'

networks:
  traefik:
    external: true

services:
  proxy:
    image: traefik:v2.1
    command:
      - '--providers.docker=true'
      - '--entryPoints.web.address=:80'
      - '--providers.providersThrottleDuration=2s'
      - '--providers.docker.watch=true'
      - '--providers.docker.swarmMode=true'
      - '--providers.docker.swarmModeRefreshSeconds=15s'
      - '--providers.docker.exposedbydefault=false'
      - '--accessLog.bufferingSize=0'
      - '--ping.entryPoint=web'
    volumes:
      - '/var/run/docker.sock:/var/run/docker.sock:ro'
    ports:
      - '80:80'
    deploy:
      restart_policy:
        condition: any
        delay: 5s
        max_attempts: 3
        window: 120s
      update_config:
        delay: 10s
        order: start-first
        parallelism: 1
      rollback_config:
        parallelism: 0
        order: stop-first
    logging:
      driver: json-file
      options:
        'max-size': '10m'
        'max-file': '5'
    networks:
      - traefik
  nginx:
    image: nginx
    deploy:
      labels:
        - traefik.enable=true
        - traefik.http.services.nginx.loadbalancer.server.port=80
        - traefik.http.routers.nginx.rule=Host(`localhost`)
        - traefik.http.routers.nginx.service=nginx
        - traefik.http.routers.blog.entrypoints=web
        - traefik.docker.network=traefik
      restart_policy:
        condition: any
        delay: 5s
        max_attempts: 3
        window: 120s
      update_config:
        delay: 10s
        order: start-first
        parallelism: 1
      rollback_config:
        parallelism: 0
        order: stop-first
    logging:
      driver: json-file
      options:
        'max-size': '10m'
        'max-file': '5'
    networks:
      - traefik
    ports:
    - 8080:80
    volumes:
    - /home/kcrawley/test-img:/usr/share/nginx/html

In under to understand your problem better would you please share your configuration, docker version, and how to reproduce the results you're experiencing?

Here is a host network download:

➜  ~ docker run -d --network host -v /home/kcrawley/test-img:/usr/share/nginx/html -d nginx
4890ade699db7005fc5b66bade5e5e8b8ebe6123cee07b72bfd2c7f1967f23c0
➜  ~ curl -O http://localhost/test.img
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 50.0G  100 50.0G    0     0  1588M      0  0:00:32  0:00:32 --:--:-- 1542M
1 Like

My setup is very similar. Let me try to simplify and reproduce

1 Like

Docker version: 19.03.11 using overlay2 on KUbuntu 20.04 LTS
Stack file:

version: '3.7'

services:
  traefik:
    image: traefik:2.0
    ports:
      - 80:80
      - 443:443
      - 9090:9090
      - 3000:3000
      - 3001:3001
      - 3002:3002
      - 4000:4000
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./traefik:/etc/traefik
      - ./certs:/etc/certs
    command:
      - "--api=true"
      - "--api.dashboard=true"
      - "--api.insecure=true"
      - "--entrypoints.http.address=:80"
      - "--entrypoints.https.address=:443"
      - "--entrypoints.traefik.address=:9090"
      - "--entrypoints.gulp.address=:3000"
      - "--entrypoints.gulp-ui.address=:3001"
      - "--entrypoints.gulp-weinre.address=:3002"
      - "--entrypoints.vue.address=:4000"
      - "--log=true"
      - "--log.filepath=/etc/traefik/logs/traefik.log"
      - "--log.level=DEBUG"
      - "--providers.docker=true"
      - "--providers.docker.network=test_traefik"
      - "--providers.docker.endpoint=unix:///var/run/docker.sock"
      - "--providers.docker.exposedByDefault=false"
      - "--providers.docker.swarmMode=true"
    networks:
      - traefik
    deploy:
      placement:
        constraints:
          - node.role == manager
      labels:
        - "traefik.enable=false"
  site1:
    image: redacted:staging
    ports:
      - 8888:80
    environment:
      - EPIC_ENV=staging
    networks:
      - traefik
    volumes:
      - ./tmp:/var/www
    deploy:
      labels:
        - "traefik.enable=true"
        - "traefik.http.routers.covid.rule=Host(`covid.docker`)"
        - "traefik.http.routers.covid.entrypoints=http"
        - "traefik.http.services.covid.loadbalancer.server.port=80"
  site2:
    image: redacted:staging
    ports:
      - 8889:80
    environment:
      - EPIC_ENV=staging
    networks:
      - traefik
    volumes:
      - ./tmp:/var/www
    deploy:
      labels:
        - "traefik.enable=true"
        - "traefik.http.routers.covid2.rule=Host(`covid2.docker`)"
        - "traefik.http.routers.covid2.entrypoints=http"
        - "traefik.http.services.covid2.loadbalancer.server.port=80"
networks:
  traefik:
    attachable: true

nginx.conf:

worker_processes 2;
daemon off;
user www-data www-data;

events {
  worker_connections  1024;
}

http {
  include       mime.types;
  default_type  application/octet-stream;

  sendfile        on;

  keepalive_timeout 2;
  client_max_body_size 500m;

  include conf.d/*.conf;
}

site.conf:

server {
  listen 80;

  access_log stdout;
  error_log stderr;

  gzip on;
  gzip_proxied any;
  gzip_types
        image/svg+xml
        text/css
        text/javascript
        text/xml
        text/plain
        application/javascript
        application/x-javascript
        application/json
        application/octet-stream;

  root /var/www/;
  index index.html index.php;
  client_max_body_size 200M;
  add_header  X-Robots-Tag "noindex, nofollow, nosnippet, noarchive";

  error_page 404 /index.html;

  location /robots.txt {
    return 200 "User-agent: *\nDisallow: /\n";
  }

  location / {
    autoindex on;
    try_files $uri $uri/ /index.php?$args;
  }

  location ~ \.php {
    fastcgi_pass  unix:/var/run/php7.0-fpm.sock;
    fastcgi_index index.php;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
    fastcgi_param APPLICATION_ENV dev;
    fastcgi_param X-Forwarded-Proto $http_x_forwarded_proto;
    include fastcgi_params;
    fastcgi_read_timeout 3000;
    fastcgi_intercept_errors on;
  }
}

host -> container

curl localhost:8888/test.img > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2000M  100 2000M    0     0  2493M      0 --:--:-- --:--:-- --:--:-- 2490M

host -> traefik -> container

curl covid.docker/test.img > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  144M    0  144M    0     0  32.4M      0 --:--:--  0:00:04 --:--:-- 32.4M^C

container1 -> container2

curl site2/test.img > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2000M  100 2000M    0     0  2444M      0 --:--:-- --:--:-- --:--:-- 2442M

host -> traefik -> container with gzip disabled

curl covid.docker/test.img > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2000M  100 2000M    0     0  1219M      0  0:00:01  0:00:01 --:--:-- 1218M

So with gzip disabled it's better but definitively not good and on production machines the difference between gzip and not is much smaller and much less close than host to container.

Hi,

I'm experiencing what I believe is the same type of error as you. I haven't done proper testing as you have above, but I can tell you that I currently get limited to 3 mbit/s DL/UL speeds when connecting to my swarm virtual IP (keepalived) or when creating requests externally that's loadbalanced with traefik. When I try to download from the container pointing my url to the current swarm host downloading the same file, I get expected speeds.

I'll do more testing during this weekend to see if I can reproduce any readable results.

To be sure I've tried with the latest nginx image, straight from docker hub. I've just mapped the tmp folder to the test file is available. Same results. About half the bandwidth when going through traefik.

I'm having similar problem, have a 1000/50 connection at home and a self hosted speedtest on my server

If I do it via port mapping I get my 1000 down
Through traefik with tls I get around 300

I've observed that enabling the gzip module while using a proxy causes the nginx CPU to peg @ 100%, but that doesn't appear to happen while connecting directly. This is a strange anomaly, and may be worth further investigation.

I've observed the speed discrepancies while in swarm mode as seen below (this is with the gzip module off):

# swarm mode - traefik proxy (:80) - gzip module disabled
➜  ~ curl --limit-rate 2G -o /dev/null http://localhost/test.img
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 50.0G  100 50.0G    0     0   940M      0  0:00:54  0:00:54 --:--:-- 1301M
# swarm mode - nginx port map (:8080) - gzip module disabled
➜  ~ curl --limit-rate 2G -o /dev/null http://localhost:8080/test.img
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 50.0G  100 50.0G    0     0  2040M      0  0:00:25  0:00:25 --:--:-- 2139M

With the gzip module enabled, I've observed the nginx worker using a lot of CPU:

systemd+  9468 87.9  0.0  11412  3108 ?        S    12:38   1:46 nginx: worker process

and constrained bandwidth compared to the non-proxy connection. this is peculiar because these requests don't have gzip enabled (this would require passing accept-encoding headers, which I'll do in a moment)

# swarm mode - traefik proxy (:80) - gzip module enabled
➜  ~ curl --limit-rate 2G -o /dev/null http://localhost/test.img 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 50.0G    0 50.0G    0     0   498M      0 --:--:--  0:01:42 --:--:--  499M
# swarm mode - nginx port map (:8080) - gzip module enabled
➜  ~ curl --limit-rate 2G -o /dev/null http://localhost:8080/test.img
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 50.0G  100 50.0G    0     0  2022M      0  0:00:25  0:00:25 --:--:-- 2108M

Let's take a look at the speeds when accepting gzip encoding:

# swarm mode - traefik proxy (:80) - gzip module enabled + encoding enabled
➜  test-img curl -H 'Accept-encoding: gzip' --limit-rate 2G -o /dev/null http://localhost/test-5g.img
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 22.3M    0 22.3M    0     0  2246k      0 --:--:--  0:00:10 --:--:-- 2226k
# swarm mode - nginx port map (:8080) - gzip module enabled + encoding enabled
➜  test-img curl -H 'Accept-encoding: gzip' --limit-rate 2G -o /dev/null http://localhost:8080/test-5g.img
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 22.3M    0 22.3M    0     0  2308k      0 --:--:--  0:00:09 --:--:-- 2309k

As you can see here, both downloads are seriously constrained while the nginx server is actually compressing the content. As to the reasons behind having the gzip module enabled and performance degradation, it'd be interesting to see if that occurs with other proxy servers, such as HA proxy.

One thing of note, which is why I believe this might be an issue directly related to how swarm handles networking, this issue doesn't occur when both services are running in host mode. I'm observing the opposite, where direct downloads from nginx are slower.

# host mode - traefik proxy - gzip module disabled
➜  test-00a curl --limit-rate 2G -o /dev/null http://localhost/test.img
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 50.0G  100 50.0G    0     0  1415M      0  0:00:36  0:00:36 --:--:-- 1423M
# host mode - nginx direct - gzip module disabled
➜  test-00a curl --limit-rate 2G -o /dev/null http://localhost:8080/test.img
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 50.0G  100 50.0G    0     0  1093M      0  0:00:46  0:00:46 --:--:-- 1235M

I am using a file provider to directly connect to the service exposed via network_mode: host, which I've included below including the other configuration files.

swarm.yaml:

version: '3.7'

networks:
  traefik:
    external: true

services:
  proxy:
    image: traefik:latest
    command:
      - '--providers.docker=true'
      - '--entryPoints.web.address=:80'
      - '--providers.providersThrottleDuration=2s'
      - '--providers.docker.watch=true'
      - '--providers.docker.swarmMode=true'
      - '--providers.docker.swarmModeRefreshSeconds=15s'
      - '--providers.docker.exposedbydefault=false'
      #- '--providers.file.filename=/etc/traefik/rules.toml'
      - '--ping.entryPoint=web'

    ports:
    - 80:80
    networks:
      - traefik
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./traefik/rules.toml:/etc/traefik/rules.toml
  nginx:
    image: nginx
    networks:
      - traefik
    ports:
      - 8080:8080    
    deploy:
      labels:
        - traefik.enable=true
        - traefik.http.services.nginx.loadbalancer.server.port=8080
        - traefik.http.routers.nginx.rule=Host(`localhost`)
        - traefik.http.routers.nginx.service=nginx
        - traefik.http.routers.nginx.entrypoints=web
        - traefik.docker.network=traefik
    volumes:
    - ./nginx/nginx.conf:/etc/nginx/nginx.conf
    - ./test-img:/usr/share/nginx/html

rules.toml:

[http.routers]
# Define a connection between requests and services
    [http.routers.speedtest]
        rule = "Host(`localhost`)"
        entrypoints = ["web"]
        service = "speedtest"

[http.services]
# Define how to reach an existing service on our infrastructure
    [http.services.speedtest.loadBalancer]
        [[http.services.speedtest.loadBalancer.servers]]
            url = "http://127.0.0.1:8080"

docker-compose.yaml

version: '3.7'

services:
  proxy:
    image: traefik:latest
    command:
      - '--entryPoints.web.address=:80'
      - '--providers.providersThrottleDuration=2s'
      - '--providers.file.filename=/etc/traefik/rules.toml'
      - '--ping.entryPoint=web'
    network_mode: host
    volumes:
      - ./traefik/rules.toml:/etc/traefik/rules.toml
  nginx:
    image: nginx
    network_mode: host
    volumes:
    - ./nginx/nginx.conf:/etc/nginx/nginx.conf
    - ./test-img:/usr/share/nginx/html

nginx.conf:


user  nginx;
worker_processes  1;

error_log  /var/log/nginx/error.log warn;
pid        /var/run/nginx.pid;


events {
    worker_connections  1024;
}


http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;

    sendfile        on;
    #tcp_nopush     on;

    keepalive_timeout  65;

    #gzip  on;
    #gzip_types application/octet-stream;

    server {
        listen       8080;
        listen  [::]:8080;
        server_name  localhost;

        location / {
            root   /usr/share/nginx/html;
            index  index.html index.htm;
        }

        error_page   500 502 503 504  /50x.html;
        location = /50x.html {
            root   /usr/share/nginx/html;
        }
    }
}

I don't have an explanation for these observations, but I've passed them onto the developers to see if there is any further action we can take to better explain what is happening here.

I've noticed this behavior with gzip and the CPU too. Today I tried with Caddy2 for good measure and still half bandwidth when going through Traefik (no gzip enabled)

Please note that will I posted in Traefik v2, this is the same with 1.7 and a vanilla docker setup without swarm mode enabled.

docker-compose will still use overlay networking, which is essentially the same network layer that's available in swarm, I believe swarm mode is just extending a control layer for configuring and mapping across hosts. Performance benchmarks on overlay is not "overly" great. This is an older report, but it appears that one might expect 1/2 the performance when running in overlay network mode. https://www.percona.com/blog/2016/08/03/testing-docker-multi-host-network-performance/

The documentation even recommends using host mode networking for maximum performance. https://docs.docker.com/network/host/

I get that and I just did some more testing to be sure. With the same setup as before, I've added a location /prox with a proxy_pass to the second nginx instance (so passing through overlay).

  • host -> traefik -> nginx1: 1Gbps
  • host -> nginx1: 2.5Gbps
  • nginx1 -> nginx2: 2.5Gbps
  • host -> nginx1 -> nginx2: 2.5Gbps
  • host -> nginx2: 2.5Gbps

gzip disabled as the previously mentioned high issue still applies. But I think that one is more on nginx side.

Would you mind sharing the configuration here? Are you using the VIP assigned to the nginx2 instance for routing?

_kc

I took some time to script and something funny is going on. I can't reproduce the host/nginx1/nginx2 results. It's actually worse than behind traefik

Here's the full test suite: https://transfer.epic.net/lrEuo/test.tgz you just need a tmp folder with a large test.img file.

What I can gather from this is that overlay networking might be an issue, gzip between nginx and a proxy definitively is and that I don't know if there's a way to overcome that. I can't really use host networking as we are hosting 150+ sites so I don't really want to pick a port manually and keep track of that.

Consider using compression middleware with Traefik, rather than at the nginx server side - this will change things considerably in terms of load management, so caveats apply.

I understand, completely. One suggestion might include only exposing specific services at the host layer which demand high throughput and leveraging some other discovery provider for those services. There's no easy solution here, it may be worth investigating other network drivers for Swarm, but YMMV there as well.

For gzip it'll be easier to enable it on traefik side indeed. Our setup being as it is, the load would be pretty much the same so that should work.

Regarding overlay performances, does this issue also exists in a k8s environment ? I didn't want to invest too much time in Kubernetes as Swarm was easy enough but given Docker recent move, the fact that k8s pretty much won as container orchestration goes and this last issue with the overlay network I'm beginning to reconsider it.

The primary different between how pod<>pod communication occurs in K8s is that it's (by default) achieved through IPVS, rather than netfliter/iptables (this wasn't the case until 1.9). In any case, I don't know if there are significant performance advantages with IPVS, it was adopted because it scales effectively past several thousand hosts.

The advantage to using Kubernetes though is that you have the option of going with adopting commercial network drivers that have the full weight of the cloud-native ecosystem driving the development behind them.

This scratches the surface, but if you're looking for maximum throughput and performance, you're probably going to have to go with something other than what comes out-of-the-box.

_kc

Did anyone find a solution for that?

I can confirm it does concern kubernetes as well. I'm running k3s distribution with Traefik ingress and I'm also having very poor performance (https terminated on Traefik towards pod with nginx/php):

  • Transfer directly from pod service (NodePort): ~560 Mbps
  • Through Traefik ingress: ~30 Mbps

As a sidenote, I'm also experiencing issues with running WebDav through Traefik. I really wanted to like Traefik, but these problems continue to pop up :frowning:

I'm running traefik on an Odroid HC4 with nginx based webdav and has got no problems. I'm getting around 600 Mbit/second (75 Mbyte/sec ) which is pretty good for a 1Gbit connection, with ssl, on a single board computer.

I'm using traefik with MetalLB. Maybe it's actually an issue with your service LB? This is my traefik helm config:

persistence:
  enabled: true
  
ports:
  web:
    redirectTo: websecure
  websecure:
    tls:
      enabled: true
      certResolver: le
      domains:
        - main: "*.snip.test"
          sans: ["snip.test"]

logs:
  general:
    level: WARN

deployment:
  podAnnotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9000"

resources:
  requests:
    cpu: 100m
    memory: 50Mi
  limits:
    memory: 150Mi

service:
  spec:
    externalTrafficPolicy: Local
    loadBalancerIP: 10.1.2.3

env:
  - name: DO_AUTH_TOKEN
    valueFrom:
      secretKeyRef:
        name: traefik
        key: DO_AUTH_TOKEN

additionalArguments:
  - "--metrics.prometheus=true"
  - "--serversTransport.insecureSkipVerify=true"
  - "--certificatesresolvers.le.acme.email=xxx"
  - "--certificatesresolvers.le.acme.storage=/data/acme.json"
  - "--certificatesresolvers.le.acme.dnsChallenge.provider=digitalocean"
  - "--certificatesresolvers.le.acme.dnsChallenge.resolvers=8.8.8.8:53"

Here is my webdav ingress route:

---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: webdav
  labels:
    app: webdav
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`webdav.snip.test`)
      kind: Rule
      services:
        - name: webdav
          port: 8080

Traefik pod config:

--global.checknewversion
--global.sendanonymoususage
--entryPoints.traefik.address=:9000/tcp
--entryPoints.web.address=:8000/tcp
--entryPoints.websecure.address=:8443/tcp
--api.dashboard=true
--ping=true
--providers.kubernetescrd
--providers.kubernetesingress
--entrypoints.web.http.redirections.entryPoint.to=:443
--entrypoints.web.http.redirections.entryPoint.scheme=https
--entrypoints.websecure.http.tls=true
--entrypoints.websecure.http.tls.certResolver=le
--entrypoints.websecure.http.tls.domains[0].main=*.snip.test
--entrypoints.websecure.http.tls.domains[0].sans=snip.test
--log.level=WARN
--metrics.prometheus=true
--serversTransport.insecureSkipVerify=true
--certificatesresolvers.le.acme.email=x
--certificatesresolvers.le.acme.storage=/data/acme.json
--certificatesresolvers.le.acme.dnsChallenge.provider=digitalocean
--certificatesresolvers.le.acme.dnsChallenge.resolvers=8.8.8.8:53

Have any of your tests been resolved? How much CPU is treafik taking when performing fully vs cut in half?