Noob connectivity issues when connecting to backend services

I have an annoying problem, that is perhaps related to something I do not understand.
I have a simple docker swarm installation: one manager, two workers.
I want to setup Traefik on our manager to listen on port 443.
Now I don't use let's encrypt (and ACME in general), because i got signed keys and of course CA cert. But that is not relevant at the moment. - I think
I wanted to setup a simple rest service, that will be loadbalanced by Traefik.

I prepared docker-compose file:

version: '3'

services:
  traefik:
    image: traefik:v2.6
#    network_mode: "host"
    command:
      - "--api.insecure=true"
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=true"
      - "--entrypoints.websecure.address=:443"
      - "--providers.docker.swarmMode=true"
      - "--entrypoints.web.http.redirections.entryPoint.to=websecure"
      - "--entrypoints.web.http.redirections.entryPoint.scheme=https"
      - "--log.level=DEBUG"
      - "--providers.file.directory=/config/"
      - "--providers.file.watch=true"
    secrets:
      - ca_cert.crt
      - traefik_cert.key
      - traefik_cert.crt
      - basic-auth
    configs:
      - source: tls_config
        target: /config/traefik.yml
    network_mode: "host"

    ports:
      - "443:443"
      - 80:8080
    networks:
      - traefik-net
    volumes:
      - "traefik-config:/config/"
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
      - "traefik-certs:/run/secrets/"
    deploy:
        placement:
          constraints:
            - node.role == manager

volumes:
  traefik-certs:
  traefik-config:

secrets:
  basic-auth:
    file: ./secrets/htpasswd
  ca_cert.crt:
    file: /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
  traefik_cert.crt:
    file: ./secrets/myorg.net
  traefik_cert.key:
    file: ./secrets/myorg.net

configs:
  tls_config:
    file: $PWD/config/traefik.yml


networks:
  traefik-net:
    driver: overlay
    attachable: true

My idea is to create an overlay internal network, that I would attach my services to, that would dynamically alter the configuration.
my traefik configuration looks like this:

api:
  dashboard: true
  insecure: true
tls:
  certificates:
    certFile: /run/secrets/traefik_cert.crt
    keyFile: /run/secrets/traefik_cert.key
    stores:
      - default
tls:
  options:
    default:
      clientCAFiles:
        - /run/secrets/ca_cert.crt
      sniStrict: true

log:
    level: DEBUG

providers:
  docker:
    swarmMode: true
    watch: true
    exposedByDefault: true
#    network: traefik_traefik-net

tls:
  stores:
    default:
      defaultCertificate:
        certFile: /run/secrets/traefik_cert.crt
        keyFile: /run/secrets/traefik_cert.key

certificatesResolvers:
  certres:
    acme: false
    httpChallenge:
      entryPoint: http
    tlsChallenge:
      entryPoint: websecure

#    defaultChallenge: "tls-sni-01"

I create a stack:

docker stack deploy --with-registry-auth --compose-file docker-compose.yml traefik

So I created a service:

  cpl-test:
    image: cye.myorg.net:5000/cpl-backend:1582
    deploy:
      placement:
          constraints:
            - node.role == worker

      replicas: 2
      labels:
        - traefik.http.routers.cpl-test.rule=Host(`test-cpl.myorg.net`)
        - traefik.http.routers.cpl-test.entrypoints=websecure
        - traefik.http.routers.cpl-test.tls=true
        - traefik.http.routers.cpl-test.service=cpl-test
        - traefik.http.services.cpl-test.loadbalancer.server.port=8091
    networks:
        - traefik_traefik-net
    environment:
      - "SPRING_PROFILES_ACTIVE=test"
    ports:
      - 8087:8091
    volumes:
      - /opt/cplPortal/configuration/test:/config
      - /opt/cplPortal/logs:/logs
networks:
  traefik_traefik-net:
    external: true

My services seem to start ok. I can see in the logs something like this:

 time="2023-03-08T15:26:58Z" level=debug msg="Configuration received from provider docker: {\"http\":{\"routers\":{\"cpl-test\":{\"entryPoints\":[\"websecure\"],\"service\":\"cpl-test\",\"rule\":\"Host(`test-cpl.myorg.net`)\",\"tls\":{}}},\"services\":{\"cpl-test\":{\"loadBalancer\":{\"servers\":[{\"url\":\"http://10.0.0.88:8091\"},{\"url\":\"http://10.0.0.89:8091\"}],\"passHostHeader\":true}}}},\"tcp\":{},\"udp\":{}}" providerName=docker

And now there we got to the place, where things happen without my full understandning:
It seems the hosts:
10.0.0.88 and 10.0.0.89 are not reachable by Traefik (they do not seem to be reachable on both ports 8091 and 8087). I did try to run docker exec -it dockerImage sh
and then I tried to check the hosts with netcat, both are not reachable.
I suppose I'm missing something. But I have not really found out what.
By the way, if I ssh to the hosts, where my workers are, I can freely conncect to locahost:8087 (as that is the port exposed by the service)

Things I already tried:
Traefik.http.services.cpl-test.loadbalancer.server.port - I changed that back and forth from 8087 to 8091 (I assume it should be 8091)

My question is, what am I doing wrong?

First I would make config easier by naming your Docker network:

networks:
  proxy:
    name: proxy

That way it does not change when using compose (with different services) and it stays an easy name :slight_smile:

Second, you have static config in Traefik command and your Traefik static config file (usually traefik.yml with entrypoints, logs, certresolver) - you can only have one.

Third, you have TLS in your static config. TLS cert files need to be loaded in a dynamic config file, which is loaded via provider.file in static config.

Fourth, you have an entrypoint.web redirect, but web is never declared.

I tried to simplify the configuration.
So I came out with something like this:

version: "3.8"
services:
  traefik:
    image: "traefik:v2.9"
    command:
      - --entrypoints.web.address=:80
      - --providers.docker
      - --providers.docker.swarmMode=true
      - "--log.level=DEBUG"
    ports:
      - "80:80"
      - "8080:8080"
    networks:
      - proxy
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
    deploy:
        placement:
          constraints:
            - node.role == manager
  cpl-test:
    image: cye.myorg.net:5000/cpl-backend:1582
    deploy:
      replicas: 2
      labels:
        - traefik.enable=true
        - traefik.http.routers.cpl-test.rule=Host(`test-cpl.myorg.net`)

        - traefik.http.services.cpl-test.loadbalancer.server.port=8091
    networks:
        - proxy
    environment:
      - "SPRING_PROFILES_ACTIVE=test"
    ports:
      - 8087:8091
    volumes:
      - /opt/cplPortal/configuration/test:/config
      - /opt/cplPortal/logs:/logs
networks:
  proxy:
    name: proxy
    driver: overlay
    attachable: true`

I see my endpoints being detected:

traefik_traefik.1.wcvduw21a2j4@si0vm08509 | time="2023-03-09T09:58:15Z" level=debug msg="Configuration received: {\"http\":{\"routers\":{\"cpl-test\":{\"service\":\"cpl-test\",\"rule\":\"Host(test-cpl.myorg.net)\"}},\"services\":{\"cpl-test\":{\"loadBalancer\":{\"servers\":[{\"url\":\"http://10.0.0.81:8091\"},{\"url\":\"http://10.0.0.78:8091\"}],\"passHostHeader\":true}}}},\"tcp\":{},\"udp\":{}}" providerName=docker

now I connect to my docker container which has traefik in it, and I tried to check the connection:
/# nc -w2 -vv 10.0.0.78 8091
dnc: 10.0.0.78 (10.0.0.78:8091): Operation timed out
/ # nc -w2 -vv 10.0.0.81 8091
nc: 10.0.0.81 (10.0.0.81:8091): Operation timed out
sent 0, rcvd 0
/ # nc -w2 -vv 10.0.0.81 8087
nc: 10.0.0.81 (10.0.0.81:8087): Operation timed out
sent 0, rcvd 0
I got firewalls disabled so these would not get in the way

Is you backend a http service? If yes, can you try wget? Also try ping.

Are you connecting your nodes over a vSwitch? Then you should look into setting a smaller MTU for your Docker overlay network.

I tried that. What seems odd, is that pings work (I can see them when I do tcpdump on destination host), but I cannot seem to see the tcp connections. I'm fairly sure it is not the firewall...

Is 8091 the right internal port? :laughing:

I think there is something wrong with my docker installation. As Honestly I cannot get any service to communicate over an overlay network. I can ping one container from another, but cannot create a tcp connection.

I have the same issue, did it ever got resolved?

Share your full Traefik static and dynamic config, and docker-compose.yml if used.

Check simple Traefik Swarm example.

Hi,
yes, the problem was with overlay networks in virtual machines that are run in ESX.
It seems like there is a bug in the virtual network driver. Check how many there are dropped packages in your system. In my case it helped to disable the checksum verification, and it started to work as I needed it.

the same setup works on AWS but when I try to attempt in home lab it fails so I believe I have the same overlay network issue.

how do disable the checksum verification, is it based on machine or on esxi?

no issues with the docker compose file and service file they work fine -- tested in AWS

but looks like an overlay network issue, any way to troubleshoot that?
attached is my vm network config esxi8