Error 504 with Traefik on Docker Swarm - again

Hello, I'm reaching out to you to try and find a problem I've been encountering with Traefik for the past few days. I've found several similar topics on this forum, but I haven't found a solution to my problem yet. It's very possible that it's a simple mistake, but an outside perspective would be appreciated.

My Docker Swarm setup consists of 3 manager nodes and 5 worker nodes, with Traefik installed on all three manager nodes.

It is deployed via Ansible like this:

- name: Deployement du service traefik
  docker_swarm_service:
    name: traefik
    mode: global
    placement:
      constraints: 
        - node.role == manager
    publish:
    - { published_port: "80", target_port: "80", protocol: tcp, mode: host }
    mounts:
      - source: /var/run/docker.sock
        target: /var/run/docker.sock
        type: bind
    read_only: yes
    restart_config:
      condition: any
      delay: 30s
      max_attempts: 5
    networks:
      - "traefik_net"
    image: "{{ TRAEFIK_IMAGE }}"
    logging:
      driver: local
      options:
        "max-size": 10k
        "max-file": "3"
    args:
    - "--log.level=INFO"
    - "--log.format=json"
    - "--accesslog=true"
    - "--global.checknewversion=false"
    - "--api.dashboard=true"
    - "--ping=true"
    - "--entryPoints.web.address=:80"
    - "--entryPoints.mqtt.address=:8883"
    - "--ping.entryPoint=web"
    - "--providers.docker=true"
    - "--providers.docker.swarmmode=true"
    - "--metrics.prometheus=true"
    labels:
      traefik.enable: "true"
      traefik.swarmmode: "true"
      traefik.docker.network: "traefik_net"
      traefik.http.routers.traefik-public-http.rule: "Host(`{{ DOMAINE_TRAEFIK }}`)"
      traefik.http.routers.traefik-public-http.entrypoints: "web"
      traefik.http.routers.traefik-public-http.service: "api@internal"
      traefik.http.services.traefik-public.loadbalancer.server.port: "8080"
  when: inventory_hostname in groups['managers_tools'][0]

Beforehand, I created a network:

- name: Création du réseau traefik
  docker_network:
    name: traefik_net
    driver: overlay
    attachable: true
    internal: false

And finally, I deploy my service:

- name: Deployement du service speedtest
  docker_swarm_service:
    name: speedtest_service
    image: "{{ SPEEDTEST_IMAGE }}"
    placement:
      constraints: 
        - node.role == worker
    networks:
      - "traefik_net"
    env:
      LS_JAVA_OPTS: "-Xmx512m -Xms512m"
      TITLE: "SpeedTest"
    labels:
      traefik.enable: "true"
      traefik.swarmmode: "true"
      traefik.docker.network: "traefik_net"
      traefik.http.routers.mqtt-speedtest.rule: "Host(`{{ DOMAINE_REQUESTER }}`) && PathPrefix(`/speedtest`)" 
      traefik.http.routers.mqtt-speedtest.entrypoints: "web"
      traefik.http.services.mqtt-speedtest-service.loadbalancer.server.port: "80"
    replicas: "{{ SPEEDTEST_REPLICA }}"
  when: inventory_hostname in groups['managers_tools'][0]

From what I've understood, the issue often stems from the Docker network, which seems to be correctly configured.

The errors observed in the Traefik logs are:


traefik.0.yefrjzphhdm3@server-manager3    | 192.170.0.26 - - [03/Jan/2024:13:10:10 +0000] "GET /speedtest HTTP/1.1" 504 15 "-" "-" 1537 "speedtest@docker" "http://10.0.4.19:80" 30000ms
traefik.0.mocv9qzoinqu@server-manager1    | 192.170.0.26 - - [03/Jan/2024:13:10:56 +0000] "GET /speedtest HTTP/1.1" 504 15 "-" "-" 1550 "speedtest@docker" "http://10.0.4.19:80" 30000ms

Thank you for any assistance you can provide.

( Sorry for my poor English, I'm using a translator )

Well, what is ansible doing and deploying? If you want to use swarm, then you need docker stack deploy. What does docker stack ls tell you?

In regular compose files the networks are defined to be external, when a previous created network should be used. Here with ansible I only see assignment of network to service. Maybe it create more Docker networks that are not connected. Try docker network ls.

With swarm compose the labels need to go inside deploy, not sure about ansible.

Which Traefik version are you using? With v3 naming changed to providers.swarm.

Ansible is used to deploy Docker Swarm on the various machines and start different services.

Here is the output of the 'docker service ls' command, which lists the services in Docker Swarm.

ID             NAME                MODE         REPLICAS   IMAGE                                                                  PORTS
pz3zxzy6ltwm   speedtest_service   replicated   2/2        registry.local.fr/speedtest:latest   
y6eoatyu1m4b   traefik             global       3/3        registry.local.fr/traefik:v2.9.9  

Here is the output of the 'docker network ls' command.

debian@server-manager1:~$ docker network ls
NETWORK ID     NAME              DRIVER    SCOPE
2013df323244   bridge            bridge    local
41032b92d9c6   docker_gwbridge   bridge    local
241c1c2991f4   host              host      local
pjullo0so1ik   ingress           overlay   swarm
53740ee8afea   none              null      local
pv99qk2ix0jo   traefik_net       overlay   swarm

Here is the information about the 'speedtest' service.

debian@server-manager1:~$ docker service inspect speedtest_service
[
    {
        "ID": "evu6gczq9px4nhpvnfczjkj0s",
        "Version": {
            "Index": 2522
        },
        "CreatedAt": "2024-01-03T18:11:39.601420319Z",
        "UpdatedAt": "2024-01-03T18:11:39.613847677Z",
        "Spec": {
            "Name": "speedtest_service",
            "Labels": {
                "traefik.docker.network": "traefik_net",
                "traefik.enable": "true",
                "traefik.http.routers.mqtt-speedtest.entrypoints": "web",
                "traefik.http.routers.mqtt-speedtest.rule": "Host(`request.local.fr`) \u0026\u0026 PathPrefix(`/speedtest`)",
                "traefik.http.services.mqtt-speedtest-service.loadbalancer.server.port": "80",
                "traefik.swarmmode": "true"
            },
            "TaskTemplate": {
                "ContainerSpec": {
                    "Image": "registry.local.fr/speedtest:latest",
                    "Env": [
                        "LS_JAVA_OPTS=-Xmx512m -Xms512m",
                        "TITLE=SpeedTest"
                    ],
                    "StopGracePeriod": 10000000000,
                    "DNSConfig": {},
                    "Isolation": "default"
                },
                "Resources": {},
                "RestartPolicy": {
                    "Condition": "any",
                    "Delay": 5000000000,
                    "MaxAttempts": 0
                },
                "Placement": {
                    "Constraints": [
                        "node.role == worker"
                    ]
                },
                "Networks": [
                    {
                        "Target": "pv99qk2ix0jomqovqwd7rwu6l"
                    }
                ],
                "ForceUpdate": 0,
                "Runtime": "container"
            },
            "Mode": {
                "Replicated": {
                    "Replicas": 2
                }
            },
            "UpdateConfig": {
                "Parallelism": 1,
                "FailureAction": "pause",
                "Monitor": 5000000000,
                "MaxFailureRatio": 0,
                "Order": "stop-first"
            },
            "RollbackConfig": {
                "Parallelism": 1,
                "FailureAction": "pause",
                "Monitor": 5000000000,
                "MaxFailureRatio": 0,
                "Order": "stop-first"
            }
        },
        "Endpoint": {
            "Spec": {},
            "VirtualIPs": [
                {
                    "NetworkID": "pv99qk2ix0jomqovqwd7rwu6l",
                    "Addr": "10.0.4.23/24"
                }
            ]
        }
    }
]

Here is the information about the 'traefik' service.

debian@server-manager1:~$ docker service inspect traefik
[
    {
        "ID": "y6eoatyu1m4b7ch7hjs4nbbml",
        "Version": {
            "Index": 2456
        },
        "CreatedAt": "2024-01-03T11:31:06.577713773Z",
        "UpdatedAt": "2024-01-03T11:31:06.580306843Z",
        "Spec": {
            "Name": "traefik",
            "Labels": {
                "traefik.docker.network": "traefik_net",
                "traefik.enable": "true",
                "traefik.http.routers.traefik-public-http.entrypoints": "web",
                "traefik.http.routers.traefik-public-http.rule": "Host(`traefik.local.fr`)",
                "traefik.http.routers.traefik-public-http.service": "api@internal",
                "traefik.http.services.traefik-public.loadbalancer.server.port": "8080",
                "traefik.swarmmode": "true"
            },
            "TaskTemplate": {
                "ContainerSpec": {
                    "Image": "registry.local.fr/traefik:v2.9.9",
                    "Args": [
                        "--log.level=INFO",
                        "--log.format=json",
                        "--accesslog=true",
                        "--global.checknewversion=false",
                        "--api.dashboard=true",
                        "--ping=true",
                        "--entryPoints.web.address=:80",
                        "--entryPoints.mqtt.address=:8883",
                        "--ping.entryPoint=web",
                        "--providers.docker=true",
                        "--providers.docker.swarmmode=true",
                        "--metrics.prometheus=true"
                    ],
                    "ReadOnly": true,
                    "Mounts": [
                        {
                            "Type": "bind",
                            "Source": "/var/run/docker.sock",
                            "Target": "/var/run/docker.sock"
                        }
                    ],
                    "StopGracePeriod": 10000000000,
                    "DNSConfig": {},
                    "Isolation": "default"
                },
                "Resources": {},
                "RestartPolicy": {
                    "Condition": "any",
                    "Delay": 30000000000,
                    "MaxAttempts": 5,
                    "Window": 0
                },
                "Placement": {
                    "Constraints": [
                        "node.role == manager"
                    ]
                },
                "Networks": [
                    {
                        "Target": "pv99qk2ix0jomqovqwd7rwu6l"
                    }
                ],
                "LogDriver": {
                    "Name": "local",
                    "Options": {
                        "max-file": "3",
                        "max-size": "10k"
                    }
                },
                "ForceUpdate": 0,
                "Runtime": "container"
            },
            "Mode": {
                "Global": {}
            },
            "UpdateConfig": {
                "Parallelism": 1,
                "FailureAction": "pause",
                "Monitor": 5000000000,
                "MaxFailureRatio": 0,
                "Order": "stop-first"
            },
            "RollbackConfig": {
                "Parallelism": 1,
                "FailureAction": "pause",
                "Monitor": 5000000000,
                "MaxFailureRatio": 0,
                "Order": "stop-first"
            },
            "EndpointSpec": {
                "Mode": "vip",
                "Ports": [
                    {
                        "Protocol": "tcp",
                        "TargetPort": 80,
                        "PublishedPort": 80,
                        "PublishMode": "host"
                    }
                ]
            }
        },
        "Endpoint": {
            "Spec": {
                "Mode": "vip",
                "Ports": [
                    {
                        "Protocol": "tcp",
                        "TargetPort": 80,
                        "PublishedPort": 80,
                        "PublishMode": "host"
                    }
                ]
            },
            "Ports": [
                {
                    "Protocol": "tcp",
                    "TargetPort": 80,
                    "PublishedPort": 80,
                    "PublishMode": "host"
                }
            ],
            "VirtualIPs": [
                {
                    "NetworkID": "pv99qk2ix0jomqovqwd7rwu6l",
                    "Addr": "10.0.4.2/24"
                }
            ]
        }
    }
]

We can see that they both use the same network ID (corresponding to 'traefik_net').

I am using Traefik version 2.9.9.

Thank you for your assistance.

Any VPN/VLAN/VSwitch between the nodes? If yes, did you set the MTU for the overlay network accordingly (usually 1400 or below)?

This is a tricky one to find, as ping will work, even curl to whoami can work, only request larger 1400 bytes will fail.

Has your ansible deployment ever worked?

Maybe try a simple Traefik Swarm example with docker stack deploy -c docker-compose.yml traefik.

Yes, the Swarm deployment is working (there are other services running on it).

I have tried accessing the services directly via HTTP (using the machine's IP instead of the virtual network), and everything is working correctly.

The issue is specifically with the Traefik, everything else is ok

What does Traefik dashboard (link) and Traefik debug log (link) tell you?

On the dashboard, everything is okay, there are no errors, and the routes are correct.
In the Traefik logs, I only see these errors:

traefik.0.tjhdyzqxysdu@server-manager1 | 192.170.0.26 - - [04/Jan/2024:10:14:51 +0000] "GET /speedtest HTTP/1.1" 504 15 "-" "-" 5549 "speedtest@docker" "http://10.0.6.31:80" 30001ms
traefik.0.ldqpdyaued39@server-manager2 | 192.170.0.16 - - [04/Jan/2024:10:15:29 +0000] "GET /ping HTTP/1.0" 200 2 "-" "-" 5556 "ping@internal" "-" 0ms
traefik.0.m8i57h0jk7jt@server-manager3 | 192.170.0.26 - User [04/Jan/2024:10:15:25 +0000] "GET /api/overview HTTP/1.1" 200 507 "-" "-" 5555 "traefik-public-http@docker" "-" 2ms

Use 3 backticks before and after code and logs.

It seems that is just the access log, not the Traefik debug log. Check the debug log for what services are discovered with which IPs and if those IPs are used by the target container and if they are reachable by Traefik.

And you could switch the access log to Json format to get more information.

I've just found my problem, it wasn't related to Traefik but rather to the port openings on the machines. By opening the following ports, everything is now working.

Port 2377 TCP 
Port 7946 TCP/UDP
Port 4789 UDP 

Thank you, bluepuma77, for your help.

And I apologize for any inconvenience.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.