About 1% of requests return 502

I'm using traefik behind cloudflare on a small production server, and about 1% of requests that are hitting the machine are returning 502, like the service is unavailable, and i can't find the root cause of the problem.
Here you can see ~160 502 errors and 13k normal responses

Here's the docker-compose logs --tail 200 -f | grep 502 for traefik container

traefik_1  | 172.71.123.178 - - [10/Feb/2024:17:59:59 +0000] "POST /protection/ping HTTP/2.0" 502 11 "-" "-" 3788041 "server-api@docker" "http://172.20.0.9:8000" 0ms
traefik_1  | 162.158.222.19 - - [10/Feb/2024:18:00:03 +0000] "POST /core/ping HTTP/2.0" 502 11 "-" "-" 3788181 "server-api@docker" "http://172.20.0.13:8000" 0ms
traefik_1  | 172.71.122.251 - - [10/Feb/2024:18:00:04 +0000] "POST /core/login HTTP/2.0" 502 11 "-" "-" 3788219 "server-api@docker" "http://172.20.0.9:8000" 2ms
traefik_1  | 172.71.250.96 - - [10/Feb/2024:18:00:05 +0000] "POST /core/ping HTTP/2.0" 502 11 "-" "-" 3788245 "server-api@docker" "http://172.20.0.10:8000" 4ms
traefik_1  | 162.158.111.62 - - [10/Feb/2024:18:00:09 +0000] "POST /core/encrypt HTTP/2.0" 502 11 "-" "-" 3788417 "server-api@docker" "http://172.20.0.9:8000" 1ms
traefik_1  | 162.158.183.35 - - [10/Feb/2024:18:00:12 +0000] "GET /launcher/ping HTTP/2.0" 200 2 "-" "-" 3788502 "server-api@docker" "http://172.20.0.10:8000" 109ms

(the last request is 200 OK)

Keep in mind, there are hundreds of other requests performing normally and hitting the service in the container:

  api:
    <<: *container-server
    image: server:${APP__VERSION:-master}
    command:
      - uwsgi
      - --ini=uwsgi.ini
    expose:
      - 8000
    environment:
      PYTHONIOENCODING: "UTF-8"
      LANG: "C.UTF-8"
      LC_ALL: "C.UTF-8"
      VIRTUAL_HOST: $VIRTUAL_HOST
    labels:
      - traefik.enable=true
      - traefik.http.routers.${COMPOSE_PROJECT_NAME}-api.tls=true
      - traefik.http.routers.${COMPOSE_PROJECT_NAME}-api.tls.certresolver=letsencrypt
      - traefik.http.routers.${COMPOSE_PROJECT_NAME}-api.rule=Host(`$VIRTUAL_HOST`)
      - traefik.http.routers.${COMPOSE_PROJECT_NAME}-api.entrypoints=https
      - traefik.http.services.${COMPOSE_PROJECT_NAME}-api.loadbalancer.server.port=8000
      - traefik.http.services.${COMPOSE_PROJECT_NAME}-api.loadbalancer.server.scheme=http
      - traefik.http.routers.${COMPOSE_PROJECT_NAME}-api-http.rule=Host(`$VIRTUAL_HOST`)
      - traefik.http.routers.${COMPOSE_PROJECT_NAME}-api-http.entrypoints=http
      - traefik.http.routers.${COMPOSE_PROJECT_NAME}-api-http.middlewares=default-https-redirect@file
    healthcheck:
      test: curl -fsSL http://localhost:8000/ht/?format=json
      interval: 15s
      retries: 5
      start_period: 5s
      timeout: 10s

uwsgi settings:

[uwsgi]
module = app.wsgi
http = :8000
strict = true
master = true
enable-threads = true
vacuum = true                        ; Delete sockets during shutdown
single-interpreter = true
die-on-term = true                   ; Shutdown when receiving SIGTERM (default is respawn)
need-app = true
log-x-forwarded-for = true
thunder-lock = true

buffer-size = 65535
post-buffering = 8192
gevent = 10

harakiri = 605                       ; forcefully kill workers if they don't respond
harakiri-verbose = true
py-call-osafterfork = true           ; allow workers to trap signals

max-requests = 1500                  ; Restart workers after this many requests
max-worker-lifetime = 6000           ; Restart workers after this many seconds
reload-on-rss = 2048                 ; Restart workers after this much resident memory
worker-reload-mercy = 600            ; How long to wait before forcefully killing workers

cheaper-algo = busyness
processes = 40                       ; Maximum number of workers allowed
cheaper = 12                         ; Minimum number of workers allowed
cheaper-initial = 16                 ; Workers created at startup
cheaper-overload = 15                ; Length of a cycle in seconds
cheaper-step = 4                     ; How many workers to spawn at a time

cheaper-busyness-multiplier = 5      ; How many cycles to wait before killing workers
cheaper-busyness-min = 20            ; Below this threshold, kill workers (if stable for multiplier cycles)
cheaper-busyness-max = 80            ; Above this threshold, spawn new workers
cheaper-busyness-backlog-alert = 16  ; Spawn emergency workers if more than this many requests are waiting in the queue
cheaper-busyness-backlog-step = 2    ; How many emergency workers to create if there are too many requests in the queue

listen = 200                         ; Increase listen queue size to handle more requests
socket-timeout = 60                  ; Timeout for idle connections

log-4xx = true                       ; Log 4xx errors for better visibility
log-5xx = true                       ; Log 5xx errors for better visibility

procname = "[uwsgi:worker: %n]"
procname-master = "[uwsgi:master: %n]"
route = ^.*ping*$ donotlog:

Here's the traefik.yml itself:

#
#Ansible managed
#
accessLog: {}
api:
  dashboard: true
certificatesResolvers:
  internal-acme:
    acme:
      caServer: https://127.0.0.1:9443/acme/acme/directory
      email: <omitted>
      httpChallenge:
        entryPoint: http
      storage: /etc/traefic-acme/internal-acme.json
      tlsChallenge: {}
  letsencrypt:
    acme:
      email: <omitted>
      httpChallenge:
        entryPoint: http
      storage: /etc/traefic-acme/letsencrypt.json
entryPoints:
  api:
    address: :8080
    http:
      redirections:
        entryPoint:
          permanent: true
          scheme: https
          to: api
  http:
    address: :80
    forwardedHeaders:
      insecure: false
      trustedIPs: []
  https:
    address: :443
    forwardedHeaders:
      insecure: false
      trustedIPs: []
  ping:
    address: 127.0.0.1:8082
global:
  checkNewVersion: true
  sendAnonymousUsage: false
log:
  level: error
metrics:
  prometheus:
    addEntryPointsLabels: true
    addServicesLabels: true
    entryPoint: api
    manualRouting: true
pilot:
  dashboard: false
ping:
  entryPoint: ping
  manualRouting: false
  terminatingStatusCode: 503
providers:
  docker:
    exposedByDefault: false
    watch: true
  file:
    directory: /etc/traefik/dynamic
    watch: true
serversTransport:
  insecureSkipVerify: true
tls:
  routers:
    default-http-router:
      entryPoints:
      - http
      priority: 1
      rule: HostRegexp(`{host:.*}`)
      service: empty-backend@file
    default-https-router:
      entryPoints:
      - https
      priority: 1
      rule: HostRegexp(`{host:.*}`)
      service: empty-backend@file
      tls:
        options: no-strict-sni@file

the load isn't that high and there are no requests in listen queue, based on this config i shouldn't get any 502 errors, the healthcheck on server service doesn't fail and hundreds of other requests perform normally, but i am still getting these seemingly random 502 errors, did anyone encounter this issue? No idea where to continue to look for misconfiguration

I was thinking something socket related but config inside the docker container seems to be ok:

/ # ulimit -n
1048576
/ # sysctl net.core.somaxconn
net.core.somaxconn = 15000
/ # sysctl net.ipv4.tcp_max_syn_backlog
net.ipv4.tcp_max_syn_backlog = 1024

Enabled debug logging, here's what i see on the failed requests:

532-traefik_1  | time="2024-02-11T11:37:32Z" level=debug msg="'502 Bad Gateway' caused by: EOF"
533:traefik_1  | 172.71.182.165 - - [11/Feb/2024:11:37:32 +0000] "POST /protection/ping HTTP/2.0" 502 11 "-" "-" 724 "server-api@docker" "http://172.20.0.5:8000" 1ms

701-traefik_1  | time="2024-02-11T11:37:39Z" level=debug msg="'502 Bad Gateway' caused by: readfrom tcp 172.20.0.1:59260->172.20.0.5:8000: write tcp 172.20.0.1:59260->172.20.0.5:8000: use of closed network connection"
702:traefik_1  | 172.68.190.147 - - [11/Feb/2024:11:37:39 +0000] "POST /protection/info_character HTTP/2.0" 502 11 "-" "-" 889 "server-api@docker" "http://172.20.0.5:8000" 1ms

846-traefik_1  | time="2024-02-11T11:37:44Z" level=debug msg="'502 Bad Gateway' caused by: read tcp 172.20.0.1:60430->172.20.0.5:8000: read: connection reset by peer"
847:traefik_1  | 162.158.87.221 - - [11/Feb/2024:11:37:44 +0000] "POST /protection/ping HTTP/2.0" 502 11 "-" "-" 1032 "server-api@docker" "http://172.20.0.5:8000" 0ms

Bad gateway errors are mostly caused when using an Docker overlay network when MTU is not adjusted to lower VLAN/vSwitch/VPN MTU and requests are larger than 1400 bytes.

Most requests like ping have an empty body and an empty 200 response, just an imalive request, so i'm not sure mtu has any impact there. But i'll check anyway

I managed to get the problem under somewhat acceptable ranges by using --scale api=10 when deploying. Now the errors occur much less frequently, but still do.
now instead of having a busyness in uwsgi i'm just deploying 60 workers (6x10)

I'm thinking i should try to change maxidleconnsperhost Routing & Load Balancing Overview |Traefik Docs - Traefik this setting in traefik itself, but i can only try in less busy hours, probably tuesday
or try using gunicorn, but not sure how that can help