Service may need multiple restarts before traefik routing picks it up

joecanto · July 28, 2023, 12:11pm

Hi,
We are increasingly seeing a weird traefik issue on one of our docker swarm setups. We have 3 identical docker swarms but this issue is only happening on one of them.

Basically, if we restart a Jenkins container or remove/re-deploy a service, traefik always sees the service (referenced in the api) but routing to it may not work - even though the service is up and available. This is intermittent and our configs haven't changed in +12 months.

Removing the service, waiting 1-2 mins and re-deploying usually fixes it.

Has anyone seen this behaviour before? Could it be load related? We have ~30 services (jenkins containers) sitting on a multi node docker swarm (19.03). All sharing the same docker network that traefik is listening on.

I've added chunks of the relevant config below and there doesn't seem to be any errors in the logs relating to this so we're a bit stumped. Any insight would be greatly appreciated

# traefik.yml
entryPoints:
  web:
    address: ":80"
    http:
      redirections:
        entrypoint:
          to: websecure
          scheme: https
          permanent: true
  websecure:
    address: ":443"

accessLog:
  filePath: "/etc/traefik/logs/access.log"
  bufferingSize: 100

log:
  filePath: "/etc/traefik/logs/error.log"
  # format: json
  level: ERROR

providers:
  file:
    filename: /etc/traefik/traefik.yml
    watch: true

  docker:
    endpoint: "unix:///var/run/docker.sock"
    swarmMode: true
    swarmModeRefreshSeconds: 30
    watch: true
    constraints: "Label(`traefik.enable`, `true`)"
    exposedByDefault: false

Traefik service

version: '3.5'

networks:
  traefik:
    external: true

services:
  reverse-proxy:
    image: "traefik:v2.10.1"
    volumes:
      - /traefik/traefik.yml:/etc/traefik/traefik.yml
      - /traefik/ssl:/etc/traefik/ssl
      - /traefik/logs:/etc/traefik/logs
       
    ports:
      - "80:80"  # The HTTP port (redirected to 443)
      - "443:443"
    networks:
      - traefik
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    deploy:
      labels:
        - traefik.enable=true
        - traefik.docker.network=traefik
        - traefik.http.services.dummyservice.loadbalancer.server.port=1111
      restart_policy:
        condition: any
        delay: 5s
        max_attempts: 3
        window: 120s
      placement:
        constraints:
        - node.labels.environment == prod
        - node.labels.node.role == manager

and lastly, the services are labelled as follows

deploy:
        labels:
          - traefik.enable=true
          - traefik.docker.lbswarm=true
          - traefik.http.services.{{ SERVICE_NAME }}_jenkins_ci.loadbalancer.server.port=8080
          - traefik.http.routers.{{ SERVICE_NAME }}_jenkins_ci.rule=Host(`region.location.com`) && PathPrefix(`/jenkins/{{ SERVICE_NAME }}`)
          - traefik.http.routers.{{ SERVICE_NAME }}_jenkins_ci.entrypoints=websecure
          - traefik.http.routers.{{ SERVICE_NAME }}_jenkins_ci.tls=true
          - traefik.http.routers.{{ SERVICE_NAME }}_jenkins_ci.service={{ SERVICE_NAME }}_jenkins_ci
          - traefik.docker.network=traefik

bluepuma77 · July 28, 2023, 7:09pm

How long did you wait for the service (not) to be picked up?

Did you inspect that all the variables were substituted correctly?

I am kind of missing a mode:global or replicas:x for the Docker Swarm deploy.

What is traefik.docker.lbswarm=true?

joecanto · July 29, 2023, 12:33pm

Hi,
generally speaking, traefik picks it up almost right away. When it doesn't pick it up right away, it doesn't seem to pick it up at all. The longest we have left a service without restarting again was probably 15-20 mins.

Regarding your second question, I set lbswarm in the early days and kind of forgot about it. IIRC, it tells traefik not to do any load balancing and leave that to the swarm. It seemed like a good idea at the time but I think I'll rip it out in case it potentially has some weird side effects; maybe even my current issue.

Topic		Replies	Views
Traefik stops routing after some minutes (Docker Swarm) Traefik v3 (latest) docker-swarm	4	740	May 24, 2024
Traefik stops working after redeploying any service (Docker Swarm) Traefik v2 docker , docker-swarm	4	645	June 7, 2024
Seamless failover in docker swarm Traefik v2 docker-swarm	2	1058	July 9, 2020
Discovered rules for docker swarm services are sometimes lost Traefik v1 docker-swarm	3	422	November 14, 2022
Traefik not detecting services in Docker swarm Traefik v2 docker , docker-swarm	11	8001	October 5, 2021

Service may need multiple restarts before traefik routing picks it up

Traefik service

Related topics