High load and timeout

I'm using Traefik 2.9.9 (but I tried also lower versions) as a reverse proxy on a high load production traefik.
This acts like a CDN, forwarding the traffic to an Apache below.

We're experiencing many timeout while in production with a lots of requests coming from all over the world.
As a test, I tried also to make direct requests to the Traefik (so not forwarded to the internal Apache) but the result is the same: from 5% to 30% of timeouts.

I'm using wrk and autocannon to make http stress test.

I would like to understand if there are default limitations in Traefik and how to tune it.

networks:
  web:
    external: true

services:
  traefik:
    container_name: traefik
    image: traefik:v2.9.9
    restart: always
    ports:
      - "80:80"
      - "443:443"
    networks:
      - web
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./traefik/traefik.yml:/etc/traefik/traefik.yml:ro
      - ./traefik/acme.json:/acme.json
      - ./traefik/config:/config
      - ./traefik/logs:/logs

and my config:

api:
  dashboard: true

providers:
  docker:
    exposedByDefault: false
  file:
    directory: /config
    watch: true

entryPoints:
  web:
    address: ":80"
  websecure:
    address: ":443"

certificatesResolvers:
  gec:
    acme:
      email: letsencrypt@mydomain.it
      storage: /acme.json
      httpChallenge:
        entryPoint: web

accessLog:
  filePath: /logs/access.log
  bufferingSize: 100

Thanks in advance

What’s the config for your service(s)?

I tried to stress just the traefik, without calling the Apache. Same result.

Is it possible to view a production configuration?

Anyone knows the default connections limit of traefik?

What did you try to stress test, getting a 404?

Here is a simple Traefik example.

There's no difference between getting 404 or 200: the problem is the TCP socket maybe.

Btw, same test with the Apache below results in the same result of timeouts.

After a http stress test (1000 connections for 30 seconds) if I try to make another call to the same endpoint the server looks stuck. After waiting 20 or 30 seconds it works again.

I cannot understand this situation...

Can you suggest me an useful stress test?

The example you gave me is configured like mine, nothing special different then mine.

I tried your code in my production server, avoiding https and using just http as a test.
The first time I call the autocannon -c 1000 -d 30 "http://myaddress.com/" I received this:

Running 30s test @ http://myaddress.com/
1000 connections


┌─────────┬────────┬────────┬────────┬─────────┬───────────┬───────────┬─────────┐
│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%     │ Avg       │ Stdev     │ Max     │
├─────────┼────────┼────────┼────────┼─────────┼───────────┼───────────┼─────────┤
│ Latency │ 305 ms │ 409 ms │ 930 ms │ 1814 ms │ 465.17 ms │ 234.87 ms │ 4610 ms │
└─────────┴────────┴────────┴────────┴─────────┴───────────┴───────────┴─────────┘
┌───────────┬────────┬────────┬────────┬─────────┬─────────┬────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%   │ Avg     │ Stdev  │ Min    │
├───────────┼────────┼────────┼────────┼─────────┼─────────┼────────┼────────┤
│ Req/Sec   │ 474    │ 474    │ 2071   │ 2297    │ 1943.07 │ 409.64 │ 474    │
├───────────┼────────┼────────┼────────┼─────────┼─────────┼────────┼────────┤
│ Bytes/Sec │ 222 kB │ 222 kB │ 972 kB │ 1.08 MB │ 911 kB  │ 192 kB │ 222 kB │
└───────────┴────────┴────────┴────────┴─────────┴─────────┴────────┴────────┘

Req/Bytes counts sampled once per second.
# of samples: 30

60k requests in 30.18s, 27.3 MB read
264 errors (264 timeouts)

This look like < 1% of timeout. Acceptable.

If I try again:

Running 30s test @ http://myaddress.com/
1000 connections


┌─────────┬────────┬────────┬────────┬────────┬───────────┬───────────┬─────────┐
│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg       │ Stdev     │ Max     │
├─────────┼────────┼────────┼────────┼────────┼───────────┼───────────┼─────────┤
│ Latency │ 136 ms │ 137 ms │ 172 ms │ 187 ms │ 155.09 ms │ 239.32 ms │ 5226 ms │
└─────────┴────────┴────────┴────────┴────────┴───────────┴───────────┴─────────┘
┌───────────┬─────┬──────┬─────────┬─────────┬─────────┬─────────┬───────┐
│ Stat      │ 1%  │ 2.5% │ 50%     │ 97.5%   │ Avg     │ Stdev   │ Min   │
├───────────┼─────┼──────┼─────────┼─────────┼─────────┼─────────┼───────┤
│ Req/Sec   │ 0   │ 0    │ 28      │ 52      │ 27.94   │ 15.27   │ 1     │
├───────────┼─────┼──────┼─────────┼─────────┼─────────┼─────────┼───────┤
│ Bytes/Sec │ 0 B │ 0 B  │ 13.1 kB │ 24.4 kB │ 13.1 kB │ 7.16 kB │ 469 B │
└───────────┴─────┴──────┴─────────┴─────────┴─────────┴─────────┴───────┘

Req/Bytes counts sampled once per second.
# of samples: 30

5k requests in 30.25s, 393 kB read
3k errors (3k timeouts)

Over 50% of timeouts! The behavior is the same as mine.

Same results also on my Mac.

I just ran some tests with autocannon, took some time to prepare. Nodes are all small VMs, each with 2 Intel vCPU, 4GB RAM, basic Debian Linux.

for run in {1..5}; do autocannon -c 1000 -d 30 http://whoami.example.com; done

This is aimed at a single Docker node with Traefik and whoami:

26k requests in 30.28s, 11.7 MB read
24k requests in 30.36s, 10.8 MB read
23k requests in 30.42s, 10.4 MB read
27k requests in 30.5s, 12.2 MB read
21k requests in 30.5s, 9.34 MB read

Next try with debug and access log disabled (before to screen :laughing:):

110k requests in 30.39s, 51.3 MB read
39k requests in 30.37s, 18 MB read
108k requests in 30.37s, 50.5 MB read
51k requests in 30.33s, 23.8 MB read
81k requests in 30.42s, 37.6 MB read

I agree it's strange to see this happening, but I did not get a single error.

This is aimed at a single Docker Swarm node that has Traefik and whoami on the same server, using overlay network:

53k requests in 30.4s, 24.5 MB read
31k requests in 30.35s, 13.9 MB read
59k requests in 30.35s, 27.4 MB read
34k requests in 30.28s, 15.4 MB read
59k requests in 30.37s, 27.4 MB read

This is aimed at a managed loadbalancer in front of 2 Docker Swarm nodes with Traefik and whoami, using overlay network:

143k requests in 30.3s, 66.3 MB read
142k requests in 30.33s, 66 MB read
143k requests in 30.43s, 66.3 MB read
144k requests in 30.31s, 67 MB read
143k requests in 30.32s, 66.4 MB read

Its really strange that despite an extra hop (loadbalancer) and another potential hop (traefik to whoami on same or different node) this has more than double the requests. No errors. Also this tells me the variability is probably not related to autocannon/node/garbage collection.

Cleaned my test-bed, tried again using the same config all the time. Turns out access log is the issue.

With this high amount of traffic just writing the single access line has a dramatic impact, especially when running on cloud servers that might connect the storage via network.

I tried various buffer sizes, every single one was far from perfect:

--accesslog.bufferingsize=1000
--accesslog.bufferingsize=10000
--accesslog.bufferingsize=100000
--accesslog.bufferingsize=1000000

Even writing to RAM disk showed high variability (111k-41k reqs / 30 secs), despite the file being relatively small with 50MB. With really large buffer I started to see errors.

It's interesting to see that with wait=100ms I get consistently more requests and lower variability than with wait=10ms.

for run in {1..5}; do autocannon -c 1000 -d 30 http://whoami.example.com/?wait=100ms; done