Services timeout if on a swarm worker node

I'm using traefik for over a year and is working great, but i have trouble making it run in swarm mode.

My setup

I have a docker swarm with 1 manager and 1 worker. Both are on an overlay network called "traefik_net"
Traefik run on the manager node and my web apps on the worker.

The issue

Service discovery work perfectly fine, the only issue is that i cannot access my services if deployed on the worker node, if i deploy them on the manager node i can access them.

If i try to access my webapp it load for few seconds and my browser return a 400 error.

What i've tryed

  • From within the traefik container I can make a wget on a service running in the worker node therefore i assume that the swarm network is working well.
  • Running my apps on the manager node, work well.
  • Look at traefik log on DEBUG mode, there is none when i try to access my service.
  • In the dasboard my services have correct IP:PORT in the 'Servers' section
  • As it work if my services run on the manager i assume my docker compose is correct.

Docker compose

Traefik
version: '3.8'

services:
  traefik:
    image: traefik:latest
    restart: always
    ports:
      - "80:80"
      - "443:443"
      - "8080:8080"

    environment:
      - OVH_APPLICATION_KEY=x
      - OVH_ENDPOINT=ovh-eu
      - OVH_CONSUMER_KEY=x
      - OVH_APPLICATION_SECRET=x

    command:
      - --api.insecure=true
      - --api.dashboard=true
      - --api.debug=true
      - --log.level=DEBUG
      - --providers.docker=true
      - --providers.docker.swarmMode=true
      - --providers.docker.network=traefik_net
      - --providers.docker.exposedByDefault=false
      - --entrypoints.http.address=:80
      - --entrypoints.https.address=:443
      - --entrypoints.http.http.redirections.entrypoint.to=https
      - --entrypoints.http.http.redirections.entrypoint.scheme=https
      - --providers.docker.network=traefik_net
      - --certificatesresolvers.sslresolver.acme.dnschallenge=true
      - --certificatesresolvers.sslresolver.acme.dnschallenge.provider=ovh
      - --certificatesresolvers.sslresolver.acme.email=x
      - --certificatesresolvers.sslresolver.acme.storage=/letsencrypt/acme.json

    deploy:
      mode: global
      placement:
        constraints:
          - node.role == manager
      labels:
        - "traefik.enable=true"
        - "traefik.http.routers.dashboard.entrypoints=https"
        - "traefik.http.routers.dashboard.tls.certresolver=sslresolver"
        - "traefik.http.routers.dashboard.rule=Host(`MY_DOMAIN`)"
        - "traefik.http.services.dumy.loadbalancer.server.port=9999"
        - "traefik.http.routers.dashboard.service=api@internal"
        - "traefik.http.routers.dashboard.middlewares=auth"

    networks:
      - traefik_net

    extra_hosts:
      - "host.docker.internal:host-gateway"

    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - .:/traefik
      - ./letsencrypt:/letsencrypt

networks:
  traefik_net:
    external: true
Wep app
version: '3.8'

services:
  foxtv:
    env_file: .env
    image:  "registry.gitlab.com/..."
    command: npm run start
    deploy:
      replicas: 1
      labels:
        - traefik.enable=true
        - traefik.http.routers.foxtv.rule=Host(`MY_DOMAIN`)
        - traefik.http.routers.foxtv.tls.certresolver=sslresolver
        - traefik.http.routers.foxtv.service=foxtv
        - traefik.http.services.foxtv.loadbalancer.server.port=3010
    networks:
      - traefik_net

Thank you in advance for your help !

Are you using the Docker Swarm Overlay Network over a VLAN/VSWITCH? Then make sure that you have the right MTU set.

It’s tricky to detect, as TCP packets (and http requests) below ~1400 bytes usually work, larger ones fail.

Thanks for the lead but unfortunatly it doesn't seems that's the source of my issue.
I tested to ping between the node with high packet size and it work, also i can access my service if t dont go thought traefik so i think that the network side is ok.

I'm out of idea on where to look :thinking:

How do you know the service is on worker? You set replicas: 1, but no constraint.

You use host.docker.internal, which is only supported by Docker Desktop.

You use --api.insecure=true, make sure to remove this when put in production. It will ignore any middlewares for auth, also you assign middlewares=auth which does not seem to exist.

Note that you can assign the certresolver globally on entrypoint. Maybe check your config against simple Traefik example.

Last note: you use MY_DOMAIN for Traefik dashboard and your service, of course they must be different.

Thanks for this feed back, i was checking where my services was running with docker command but adding a constraint is a good idea !

MY_DOMAIN is obviously set accordingly on my docker compose and i can confirme that it work well because if i force my service on the manager node everything works.

So using this new compose it does work if i use node.role == manager and i get a HTTP ERROR 400 from chrome after ~1min when using node.role == worker.

New docker compose

Wap app
version: '3.8'

services:
  foxtv:
    env_file: .env
    image:  "registry.gitlab.com/..."
    command: npm run start -- --filter=foxtv
    deploy:
      replicas: 1
      labels:
        - traefik.enable=true
        - traefik.http.routers.foxtv.rule=Host(`tv.web02...`)
        - traefik.http.routers.foxtv.service=foxtv
        - traefik.http.services.foxtv.loadbalancer.server.port=3010
        - traefik.http.routers.foxtv.tls.certresolver=sslresolver
      placement:
        constraints:
          - node.role == worker
    networks:
      - traefik_net

So Traefik forward works when the target service runs on manager, but not on worker?

Is traefik_net a Docker Swarm overlay network?

Have you tried with a simple whoami service instead?

Here is the simple Swarm Traefik example.

So Traefik forward works when the target service runs on manager, but not on worker? YES

Is traefik_net a Docker Swarm overlay network? YES
And i can confirm that the network works because i can make a wget from the traefik container (docker exec) to the ip of the swarm network. Also this IP is the same as the one shown in the traefik dashboard (HTTP > Services => Servers) so traefik detect the correct IP and use the correct port

Have you tried with a simple whoami service instead? Not yet

But the service is running ok on worker node? Did you login on worker node to your private repository so it can be pulled?

Yes the service is working fine. I can access it if I use the directly the IP of the worker node.

How is that possible? You don't have a port exposed in your service/container.

You're right for this test I made I had to expose a port. I should have mentioned it in my previous post.

What infra & distribution are you running on?

Try whoami service with /data?size=10000. (Doc)

Its 2 ubuntu VM on one proxmox server, they have a wireguard VPN to communicate (the swarm network is using the interface of this VPN)

I will try a simple who ami service.

When using a VPN, usually the MTU of TCP packets needs to be reduced. Make sure your Docker Swarm overlay network has a MTU that fits inside the VPN MTU.

You can try this (taken from ChatGPT):

Check current MTU:

docker network inspect my_overlay_network

Update MTU:

docker network update --opt com.docker.network.driver.mtu=1400 my_overlay_network
1 Like

A lower MTU has fixed my issue :slight_smile: !

Thanks a lot for the time you took to help me!

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.