Deployments fail using Traefik and Docker Swarm

Sometimes (about 1/3 of the time) when I deploy to my server from my GitHub Actions pipeline it fails and I get a 503 Service Unavailable response from my server.

When I enabled debug logging I saw the following warning:

2025-04-15T17:09:44Z WRN github.com/traefik/traefik/v3/pkg/healthcheck/healthcheck.go:125 > Health check failed. error="HTTP request failed: Get \"http://10.0.0.14:80/\": context deadline exceeded" serviceName=frontend@docker targetURL=http://10.0.0.14:80

My application stack is configured to use Docker Swarm and I deploy with the following command:

$ docker stack deploy --with-registry-auth -c /opt/clever-cash/compose.base.yml -c /opt/clever-cash/stack.test.yml my-app

What can I do to get my deployments to become 100% robust?
Thank you for reading my post and looking forward to your insights :slight_smile:
Cheers,

Florestan

PS: I've attached the full stack trace at the end of this post

compose.yml

services:
  frontend:
    container_name: frontend
    restart: on-failure
    ports:
      - 4200:80
    image: frontend
    build:
      context: ../../../
      cache_from:
        - frontend
      dockerfile: ./frontend/Dockerfile
    logging:
      driver: loki
      options:
        loki-url: /run/secrets/jwt_expire_in/loki-url
        loki-retries: 2
        loki-max-backoff: 800ms
        loki-timeout: 10s
        keep-file: 'true'
        mode: non-blocking
    healthcheck:
      test: ['CMD', 'curl', '-f', 'http://localhost:80/']
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 15s

stack.test.yml

services:
  traefik:
    image: traefik
    command:
      - --providers.docker=true
      - --providers.docker.exposedbydefault=false
      - --certificatesresolvers.dnsresolver.acme.dnschallenge=true
      - --certificatesresolvers.dnsresolver.acme.dnschallenge.provider=cloudflare
      - --certificatesresolvers.dnsresolver.acme.email=admin@example.com
      - --certificatesresolvers.dnsresolver.acme.dnschallenge.delaybeforecheck=0
      - --certificatesresolvers.dnsresolver.acme.storage=/letsencrypt/acme.json
      - --entryPoints.websecure.address=:443
      - --entryPoints.web.address=:80
      - --entrypoints.web.http.redirections.entrypoint.to=websecure
      - --entrypoints.web.http.redirections.entrypoint.scheme=https
      - --serversTransport.forwardingTimeouts.dialTimeout=30s
      - --log.level=DEBUG
    ports:
      - 80:80
      - 443:443
    environment:
      - CF_DNS_API_TOKEN_FILE=/run/secrets/cloudflare_dns_api_token
    secrets:
      - cloudflare_dns_api_token
    volumes:
      - letsencrypt:/letsencrypt
      - /var/run/docker.sock:/var/run/docker.sock
    deploy:
      replicas: 1
      update_config:
        parallelism: 1
        delay: 10s
        order: stop-first
        failure_action: rollback
      restart_policy:
        condition: any

  # FRONTEND
  frontend:
    image: frontend:latest
    labels:
      - traefik.enable=true
      - traefik.http.middlewares.frontend-retry.retry.attempts=5
      - traefik.http.middlewares.frontend-retry.retry.initialinterval=100ms
      # Match healthcheck timeout to container startup time
      - traefik.http.services.frontend.loadbalancer.healthcheck.path=/
      - traefik.http.services.frontend.loadbalancer.healthcheck.interval=10s
      - traefik.http.services.frontend.loadbalancer.healthcheck.timeout=8s
      # Link the middleware to the router (?)
      - traefik.http.routers.frontend-router.middlewares=frontend-retry@docker
      - traefik.http.routers.frontend-router.rule=Host(`example.com`)
      - traefik.http.routers.frontend-router.entrypoints=websecure
      - traefik.http.routers.frontend-router.tls.certresolver=dnsresolver
      - traefik.http.routers.frontend-router.service=frontend
    extra_hosts:
      - host.docker.internal:host-gateway
volumes:
  letsencrypt:

Full log output (log level DEBUG)

2025-04-15T17:08:56Z DBG github.com/traefik/traefik/v3/pkg/provider/acme/provider.go:984 > No ACME certificate generation required for domains ACME CA=https://acme-v02.api.letsencrypt.org/directory acmeCA=https://acme-v02.api.letsencrypt.org/directory domains=["example.com"] providerName=dnsresolver.acme routerName=frontend-router@docker rule=Host(`example.com`)
2025-04-15T17:09:07Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:213 > Service selected by WRR: http://10.0.0.14:80
2025-04-15T17:09:08Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:213 > Service selected by WRR: http://10.0.0.14:80
2025-04-15T17:09:08Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:213 > Service selected by WRR: http://10.0.0.14:80
2025-04-15T17:09:08Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:213 > Service selected by WRR: http://10.0.0.14:80
2025-04-15T17:09:13Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:213 > Service selected by WRR: http://10.0.0.14:80
2025-04-15T17:09:14Z WRN github.com/traefik/traefik/v3/pkg/healthcheck/healthcheck.go:125 > Health check failed. error="HTTP request failed: Get \"http://10.0.0.14:80/\": context deadline exceeded" serviceName=frontend@docker targetURL=http://10.0.0.14:80
2025-04-15T17:09:14Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:146 > Setting status of http://10.0.0.14:80 to DOWN serviceName=frontend@docker
2025-04-15T17:09:14Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:168 > Propagating new DOWN status serviceName=frontend@docker
2025-04-15T17:09:15Z DBG github.com/traefik/traefik/v3/pkg/proxy/httputil/proxy.go:121 > 499 Client Closed Request error="context canceled"
2025-04-15T17:09:15Z DBG github.com/traefik/traefik/v3/pkg/middlewares/retry/retry.go:177 > Final retry attempt failed error="context canceled" middlewareName=frontend-retry@docker middlewareType=Retry
2025-04-15T17:09:24Z WRN github.com/traefik/traefik/v3/pkg/healthcheck/healthcheck.go:125 > Health check failed. error="HTTP request failed: Get \"http://10.0.0.14:80/\": context deadline exceeded" serviceName=frontend@docker targetURL=http://10.0.0.14:80
2025-04-15T17:09:24Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:146 > Setting status of http://10.0.0.14:80 to DOWN serviceName=frontend@docker
2025-04-15T17:09:24Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:163 > Still DOWN, no need to propagate serviceName=frontend@docker
2025-04-15T17:09:34Z WRN github.com/traefik/traefik/v3/pkg/healthcheck/healthcheck.go:125 > Health check failed. error="HTTP request failed: Get \"http://10.0.0.14:80/\": context deadline exceeded" serviceName=frontend@docker targetURL=http://10.0.0.14:80
2025-04-15T17:09:34Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:146 > Setting status of http://10.0.0.14:80 to DOWN serviceName=frontend@docker
2025-04-15T17:09:34Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:163 > Still DOWN, no need to propagate serviceName=frontend@docker
2025-04-15T17:09:37Z DBG github.com/traefik/traefik/v3/pkg/proxy/httputil/proxy.go:121 > 504 Gateway Timeout error="dial tcp 10.0.0.14:80: i/o timeout"
2025-04-15T17:09:37Z DBG github.com/traefik/traefik/v3/pkg/middlewares/retry/retry.go:170 > New attempt 2 for request: /api/users/profile middlewareName=frontend-retry@docker middlewareType=Retry
2025-04-15T17:09:38Z DBG github.com/traefik/traefik/v3/pkg/proxy/httputil/proxy.go:121 > 504 Gateway Timeout error="dial tcp 10.0.0.14:80: i/o timeout"
2025-04-15T17:09:38Z DBG github.com/traefik/traefik/v3/pkg/middlewares/retry/retry.go:170 > New attempt 2 for request: /api/users/profile middlewareName=frontend-retry@docker middlewareType=Retry
2025-04-15T17:09:38Z DBG github.com/traefik/traefik/v3/pkg/proxy/httputil/proxy.go:121 > 504 Gateway Timeout error="dial tcp 10.0.0.14:80: i/o timeout"
2025-04-15T17:09:38Z DBG github.com/traefik/traefik/v3/pkg/middlewares/retry/retry.go:170 > New attempt 2 for request: /ngsw-worker.js middlewareName=frontend-retry@docker middlewareType=Retry
2025-04-15T17:09:43Z DBG github.com/traefik/traefik/v3/pkg/proxy/httputil/proxy.go:121 > 504 Gateway Timeout error="dial tcp 10.0.0.14:80: i/o timeout"
2025-04-15T17:09:43Z DBG github.com/traefik/traefik/v3/pkg/middlewares/retry/retry.go:170 > New attempt 2 for request: /ngsw.json?ngsw-cache-bust=0.8821232093868159 middlewareName=frontend-retry@docker middlewareType=Retry
2025-04-15T17:09:44Z WRN github.com/traefik/traefik/v3/pkg/healthcheck/healthcheck.go:125 > Health check failed. error="HTTP request failed: Get \"http://10.0.0.14:80/\": context deadline exceeded" serviceName=frontend@docker targetURL=http://10.0.0.14:80
2025-04-15T17:09:44Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:146 > Setting status of http://10.0.0.14:80 to DOWN serviceName=frontend@docker
2025-04-15T17:09:44Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:163 > Still DOWN, no need to propagate serviceName=frontend@docker

The Docker Swarm provider pulls the metadata every 30 seconds by default. You need to ensure that a stack deploy ensures a wait period between containers that is larger. Or decrease the fetch interval.

Otherwise all old containers are gone before new ones are registered.

Thank you for your reply. I'm quite new to Traefik, so what would your advice look like when applied to my context?

Am I right to assume I'd be looking at changing these values of the frontend service?

interval: 10s
timeout: 5s
retries: 3
start_period: 15s

Just checked the doc, default is 15 secs, not 30.

Either set interval to something like 20 secs or set providers.swarm.refreshSeconds=5.