Traefik High Availability, Docker Swarm (2 manager, 1 worker)

Greetings,
We need help with Docker networking, specifically with Docker Swarm.

We aim to set up a highly available (HA) Traefik Proxy as the entry point to our system from the public network over ports 80 and 443 (HTTP/HTTPS).
Traefik will be deployed on three manager nodes, and we will use a load balancer in front of our network to distribute traffic and if one manager node goes down, the others should handle incoming requests.

Our current VM setup is Docker Swarm with 3 manager nodes (Traefik) and 1 workers node (nginx, for now).

On manager nodes we deployed Traefik as Docker Swarm service with global deployment mode, and on worker node we have deployed nginx which will server our page.
Traefik on all nodes has ports 80 and 443 exposed to the host, so incoming HTTPS requests are passed to Traefik, which then forwards them to the nginx service on the worker node.

The problem we are facing is that only one Traefik instance successfully accepts connections and forwards them to the nginx service.
If we route all HTTPS requests to the Traefik instance on manager node A, everything works as expected. However, if we route requests to the Traefik instance on manager node B, the requests are not forwarded to the nginx service on the worker node, resulting in a "Bad Gateway" error.

I can ping the nginx service from all Traefik containers, but if I try to use telnet to connect to ports 80/443, the connection works only from the Traefik instance on manager node A.

Traefik stack file:

version: '3.3'
services:
  traefik:
    image: traefik:v3.2.0
    ports:
      - mode: host
        protocol: tcp
        published: 443
        target: 443
      - mode: host
        protocol: tcp
        published: 80
        target: 80
      - mode: host
        protocol: tcp
        published: 5432
        target: 5432
    deploy:
      mode: global
      placement:
        constraints:
          - "node.role==manager"
      labels:
        - traefik.enable=true
        - traefik.docker.network=traefik_traefik-public
        - traefik.http.middlewares.https-redirect.redirectscheme.scheme=https
        - traefik.http.middlewares.https-redirect.redirectscheme.permanent=true
        - traefik.http.routers.traefik-public-http.rule=Host(`traefik.dev.si`)
        - traefik.http.routers.traefik-public-http.entrypoints=http
        - traefik.http.routers.traefik-public-http.middlewares=https-redirect
        - traefik.http.routers.traefik-public-https.rule=Host(`traefik.dev.si`)
        - traefik.http.routers.traefik-public-https.entrypoints=https
        - traefik.http.routers.traefik-public-https.tls=true
        - traefik.http.routers.traefik-public-https.service=api@internal
        - traefik.http.services.traefik-public.loadbalancer.server.port=8080
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    networks:
      - traefik-public

networks:
  traefik-public:
    driver: overlay
    ipam:
      config:
        - subnet: 10.0.10.0/24

Nginx stack file:

version: '3.3'
services:
  nginx:
    image: nginx
    networks:
      - traefik-public
    container_name: nginx
    volumes:
      - app-tmp-public:/var/www/gms/current/public/tmp
    logging:
      driver: json-file
      options:
        max-size: 50m
        max-file: 3
    deploy:
      placement:
        constraints:
          - "node.role!=manager"
      labels:
        traefik.enable: 'true'
        traefik.http.routers.dev-web.rule: 'Host(`test.dev.net`)'
        traefik.http.services.dev-web.loadbalancer.server.port: '80'
        traefik.http.routers.dev-web.entrypoints: 'https'
        traefik.http.routers.dev-web.tls: 'true'
      update_config:
        order: start-first
        failure_action: rollback
        delay: 5s

Docker Swarm Cluster setup
Manager A (Traefik :80 :443)
Manager B (Traefik :80 :443)
Manager C (Traefik :80 :443)
Worker A (nginx)

Overlay network: traefik-public

Works
Public -> Manager A Traefik -> Worker A nginx
Doesn't works
Public -> Manager B Treafik -> Worker A nginx

I would create the Docker Overlay network up front. Then it's independent of the Traefik stack and other target services, then use external: true in compose.

Make sure the Docker Overlay network MTU is set correctly. If you have some VLAN/vSwitch setup, the usable MTU may be smaller. This usually results in "Bad Gateway". Test with ping with payload > 1500 bytes.

Make sure the Docker Overlay network uses a subnet of /16. With /24 you only get 255 IPs in the network, it might create issues when you run many services.

Make sure to set Traefik docker.network in case you use more than one. Note that networks created within compose usually get a project prefix, so use name: in compose.

Finally note that Traefik CE does not have clustered LetsEncrypt support. So you can only use LetsEncrypt with dnsChallenge for an individual TLS cert per node. Or you create TLS certs with an external tool like certbot and supply it to all nodes regularly.

Final note 2: move dynamic config like redirect and TLS to global entrypoint, saves a lot of labels. Compare to simple Traefik Swarm example:

  whoami:
    image: traefik/whoami:v1.10
    hostname: '{{.Node.Hostname}}'
    networks:
      - proxy
    deploy:
      mode: global
      labels:
        - traefik.enable=true
        - traefik.http.routers.whoami.rule=Host(`whoami.example.com`)
        - traefik.http.services.whoami.loadbalancer.server.port=80

Thank you for your advice, i tried that and it doesn't works.

Adding screenshots, left manager node works, right manager doesn't works.
I tried ping (works), telnet(doesn't work) and traceroute telnet(doesn't work).


Inspect of network on worker node where nginx running.

Share your Traefik static config file.

My Traefik static config

providers:
  swarm:
    exposedbydefault: false
    #swarmmode: true
    network: traefik-proxy
  file:
    directory: "/config/dynamic_configs"
entrypoints:
  http:
    address: ":80"
    forwardedHeaders:
      insecure: true
    http:
      redirections:
        entryPoint:
          to: https
          scheme: https
  https:
    address: ":443"
    forwardedHeaders:
      insecure: true
  postgresql:
    address: ":5432"

serversTransport:
  insecureSkipVerify: true

accesslog:
  format: "json"
  addInternals: true
  filePath: "/log/access.log"
  fields:
    defaultmode: keep
    headers:
      defaultmode: keep

log:
  level: debug

api:
  dashboard: true

Your telnet does work, it’s "connected". It is normal you don’t see anything, you probably need to hit return to get a message.

You pinged with payload, was that from within a manager B container to the A workload? Try ping -s 2000 nginx from Traefik B inside container.

It seems you use http-to-https redirect, but I don’t see any LetsEncrypt or custom TLS cert loading.

Telnet works on manager A (connected, left side of screenshot) but it does not work on manager B which is on right side of screenshot where you can see that it is not connected, I canceled command then.
Also traceroute doesn't works from manager B.

On first screenshost everything is executed inside of manager containers, left manager A, right manager B.

Ping was done from inside of manager A (left) and from manager B to nginx container.

For HTTPS i use TLS configuration for certificates in dynamic config.

It seems like a network issue to me. Make sure to get the VMs right and network connected (nodes), and then get Docker Swarm and the Overlay network right (containers). Maybe forums.docker.com can help.