Service may need multiple restarts before traefik routing picks it up

Hi,
We are increasingly seeing a weird traefik issue on one of our docker swarm setups. We have 3 identical docker swarms but this issue is only happening on one of them.

Basically, if we restart a Jenkins container or remove/re-deploy a service, traefik always sees the service (referenced in the api) but routing to it may not work - even though the service is up and available. This is intermittent and our configs haven't changed in +12 months.

Removing the service, waiting 1-2 mins and re-deploying usually fixes it.

Has anyone seen this behaviour before? Could it be load related? We have ~30 services (jenkins containers) sitting on a multi node docker swarm (19.03). All sharing the same docker network that traefik is listening on.

I've added chunks of the relevant config below and there doesn't seem to be any errors in the logs relating to this so we're a bit stumped. Any insight would be greatly appreciated

# traefik.yml
entryPoints:
  web:
    address: ":80"
    http:
      redirections:
        entrypoint:
          to: websecure
          scheme: https
          permanent: true
  websecure:
    address: ":443"

accessLog:
  filePath: "/etc/traefik/logs/access.log"
  bufferingSize: 100

log:
  filePath: "/etc/traefik/logs/error.log"
  # format: json
  level: ERROR

providers:
  file:
    filename: /etc/traefik/traefik.yml
    watch: true

  docker:
    endpoint: "unix:///var/run/docker.sock"
    swarmMode: true
    swarmModeRefreshSeconds: 30
    watch: true
    constraints: "Label(`traefik.enable`, `true`)"
    exposedByDefault: false

Traefik service

version: '3.5'

networks:
  traefik:
    external: true

services:
  reverse-proxy:
    image: "traefik:v2.10.1"
    volumes:
      - /traefik/traefik.yml:/etc/traefik/traefik.yml
      - /traefik/ssl:/etc/traefik/ssl
      - /traefik/logs:/etc/traefik/logs
       
    ports:
      - "80:80"  # The HTTP port (redirected to 443)
      - "443:443"
    networks:
      - traefik
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    deploy:
      labels:
        - traefik.enable=true
        - traefik.docker.network=traefik
        - traefik.http.services.dummyservice.loadbalancer.server.port=1111
      restart_policy:
        condition: any
        delay: 5s
        max_attempts: 3
        window: 120s
      placement:
        constraints:
        - node.labels.environment == prod
        - node.labels.node.role == manager

and lastly, the services are labelled as follows

deploy:
        labels:
          - traefik.enable=true
          - traefik.docker.lbswarm=true
          - traefik.http.services.{{ SERVICE_NAME }}_jenkins_ci.loadbalancer.server.port=8080
          - traefik.http.routers.{{ SERVICE_NAME }}_jenkins_ci.rule=Host(`region.location.com`) && PathPrefix(`/jenkins/{{ SERVICE_NAME }}`)
          - traefik.http.routers.{{ SERVICE_NAME }}_jenkins_ci.entrypoints=websecure
          - traefik.http.routers.{{ SERVICE_NAME }}_jenkins_ci.tls=true
          - traefik.http.routers.{{ SERVICE_NAME }}_jenkins_ci.service={{ SERVICE_NAME }}_jenkins_ci
          - traefik.docker.network=traefik

How long did you wait for the service (not) to be picked up?

Did you inspect that all the variables were substituted correctly?

I am kind of missing a mode:global or replicas:x for the Docker Swarm deploy.

What is traefik.docker.lbswarm=true?

Hi,
generally speaking, traefik picks it up almost right away. When it doesn't pick it up right away, it doesn't seem to pick it up at all. The longest we have left a service without restarting again was probably 15-20 mins.

Regarding your second question, I set lbswarm in the early days and kind of forgot about it. IIRC, it tells traefik not to do any load balancing and leave that to the swarm. It seemed like a good idea at the time but I think I'll rip it out in case it potentially has some weird side effects; maybe even my current issue.