I'm using traefik behind cloudflare on a small production server, and about 1% of requests that are hitting the machine are returning 502, like the service is unavailable, and i can't find the root cause of the problem.
Here you can see ~160 502 errors and 13k normal responses
Here's the docker-compose logs --tail 200 -f | grep 502
for traefik container
traefik_1 | 172.71.123.178 - - [10/Feb/2024:17:59:59 +0000] "POST /protection/ping HTTP/2.0" 502 11 "-" "-" 3788041 "server-api@docker" "http://172.20.0.9:8000" 0ms
traefik_1 | 162.158.222.19 - - [10/Feb/2024:18:00:03 +0000] "POST /core/ping HTTP/2.0" 502 11 "-" "-" 3788181 "server-api@docker" "http://172.20.0.13:8000" 0ms
traefik_1 | 172.71.122.251 - - [10/Feb/2024:18:00:04 +0000] "POST /core/login HTTP/2.0" 502 11 "-" "-" 3788219 "server-api@docker" "http://172.20.0.9:8000" 2ms
traefik_1 | 172.71.250.96 - - [10/Feb/2024:18:00:05 +0000] "POST /core/ping HTTP/2.0" 502 11 "-" "-" 3788245 "server-api@docker" "http://172.20.0.10:8000" 4ms
traefik_1 | 162.158.111.62 - - [10/Feb/2024:18:00:09 +0000] "POST /core/encrypt HTTP/2.0" 502 11 "-" "-" 3788417 "server-api@docker" "http://172.20.0.9:8000" 1ms
traefik_1 | 162.158.183.35 - - [10/Feb/2024:18:00:12 +0000] "GET /launcher/ping HTTP/2.0" 200 2 "-" "-" 3788502 "server-api@docker" "http://172.20.0.10:8000" 109ms
(the last request is 200 OK)
Keep in mind, there are hundreds of other requests performing normally and hitting the service in the container:
api:
<<: *container-server
image: server:${APP__VERSION:-master}
command:
- uwsgi
- --ini=uwsgi.ini
expose:
- 8000
environment:
PYTHONIOENCODING: "UTF-8"
LANG: "C.UTF-8"
LC_ALL: "C.UTF-8"
VIRTUAL_HOST: $VIRTUAL_HOST
labels:
- traefik.enable=true
- traefik.http.routers.${COMPOSE_PROJECT_NAME}-api.tls=true
- traefik.http.routers.${COMPOSE_PROJECT_NAME}-api.tls.certresolver=letsencrypt
- traefik.http.routers.${COMPOSE_PROJECT_NAME}-api.rule=Host(`$VIRTUAL_HOST`)
- traefik.http.routers.${COMPOSE_PROJECT_NAME}-api.entrypoints=https
- traefik.http.services.${COMPOSE_PROJECT_NAME}-api.loadbalancer.server.port=8000
- traefik.http.services.${COMPOSE_PROJECT_NAME}-api.loadbalancer.server.scheme=http
- traefik.http.routers.${COMPOSE_PROJECT_NAME}-api-http.rule=Host(`$VIRTUAL_HOST`)
- traefik.http.routers.${COMPOSE_PROJECT_NAME}-api-http.entrypoints=http
- traefik.http.routers.${COMPOSE_PROJECT_NAME}-api-http.middlewares=default-https-redirect@file
healthcheck:
test: curl -fsSL http://localhost:8000/ht/?format=json
interval: 15s
retries: 5
start_period: 5s
timeout: 10s
uwsgi settings:
[uwsgi]
module = app.wsgi
http = :8000
strict = true
master = true
enable-threads = true
vacuum = true ; Delete sockets during shutdown
single-interpreter = true
die-on-term = true ; Shutdown when receiving SIGTERM (default is respawn)
need-app = true
log-x-forwarded-for = true
thunder-lock = true
buffer-size = 65535
post-buffering = 8192
gevent = 10
harakiri = 605 ; forcefully kill workers if they don't respond
harakiri-verbose = true
py-call-osafterfork = true ; allow workers to trap signals
max-requests = 1500 ; Restart workers after this many requests
max-worker-lifetime = 6000 ; Restart workers after this many seconds
reload-on-rss = 2048 ; Restart workers after this much resident memory
worker-reload-mercy = 600 ; How long to wait before forcefully killing workers
cheaper-algo = busyness
processes = 40 ; Maximum number of workers allowed
cheaper = 12 ; Minimum number of workers allowed
cheaper-initial = 16 ; Workers created at startup
cheaper-overload = 15 ; Length of a cycle in seconds
cheaper-step = 4 ; How many workers to spawn at a time
cheaper-busyness-multiplier = 5 ; How many cycles to wait before killing workers
cheaper-busyness-min = 20 ; Below this threshold, kill workers (if stable for multiplier cycles)
cheaper-busyness-max = 80 ; Above this threshold, spawn new workers
cheaper-busyness-backlog-alert = 16 ; Spawn emergency workers if more than this many requests are waiting in the queue
cheaper-busyness-backlog-step = 2 ; How many emergency workers to create if there are too many requests in the queue
listen = 200 ; Increase listen queue size to handle more requests
socket-timeout = 60 ; Timeout for idle connections
log-4xx = true ; Log 4xx errors for better visibility
log-5xx = true ; Log 5xx errors for better visibility
procname = "[uwsgi:worker: %n]"
procname-master = "[uwsgi:master: %n]"
route = ^.*ping*$ donotlog:
Here's the traefik.yml itself:
#
#Ansible managed
#
accessLog: {}
api:
dashboard: true
certificatesResolvers:
internal-acme:
acme:
caServer: https://127.0.0.1:9443/acme/acme/directory
email: <omitted>
httpChallenge:
entryPoint: http
storage: /etc/traefic-acme/internal-acme.json
tlsChallenge: {}
letsencrypt:
acme:
email: <omitted>
httpChallenge:
entryPoint: http
storage: /etc/traefic-acme/letsencrypt.json
entryPoints:
api:
address: :8080
http:
redirections:
entryPoint:
permanent: true
scheme: https
to: api
http:
address: :80
forwardedHeaders:
insecure: false
trustedIPs: []
https:
address: :443
forwardedHeaders:
insecure: false
trustedIPs: []
ping:
address: 127.0.0.1:8082
global:
checkNewVersion: true
sendAnonymousUsage: false
log:
level: error
metrics:
prometheus:
addEntryPointsLabels: true
addServicesLabels: true
entryPoint: api
manualRouting: true
pilot:
dashboard: false
ping:
entryPoint: ping
manualRouting: false
terminatingStatusCode: 503
providers:
docker:
exposedByDefault: false
watch: true
file:
directory: /etc/traefik/dynamic
watch: true
serversTransport:
insecureSkipVerify: true
tls:
routers:
default-http-router:
entryPoints:
- http
priority: 1
rule: HostRegexp(`{host:.*}`)
service: empty-backend@file
default-https-router:
entryPoints:
- https
priority: 1
rule: HostRegexp(`{host:.*}`)
service: empty-backend@file
tls:
options: no-strict-sni@file
the load isn't that high and there are no requests in listen queue, based on this config i shouldn't get any 502 errors, the healthcheck on server
service doesn't fail and hundreds of other requests perform normally, but i am still getting these seemingly random 502 errors, did anyone encounter this issue? No idea where to continue to look for misconfiguration
I was thinking something socket related but config inside the docker container seems to be ok:
/ # ulimit -n
1048576
/ # sysctl net.core.somaxconn
net.core.somaxconn = 15000
/ # sysctl net.ipv4.tcp_max_syn_backlog
net.ipv4.tcp_max_syn_backlog = 1024