Connection refused trying to connect to traefik from inside containers

Hey,

I'm using traefik as a reverse proxy in front of two services running in docker containers.
One is mycustomservice and one is keycloak.
It basically works but after some time traefik refuses any connections to it from mycustomservice. Hence mycustomservice loses access to keycloak.

This is what my docker-compose.yml File looks like:

version: "3"
services:
  traefik:
    image: traefik:v2.5.3
    restart: unless-stopped
    command:
      - "--pilot.dashboard=false"
      - "--log.level=DEBUG"
      - "--api.dashboard=true"
      - "--api.insecure=true"
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=false"
      - "--entrypoints.web.address=:80"
    ports:
      - "80:80"
      - "8080:8080" 
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:ro"

  mycustomservice:
    image: alpine
    restart: unless-stopped
    environment:
      OIDC_AUTHORITY: ${SERVER_URL}/auth/realms/master
    command: |
      sh -c "START=$$(date +%s); while true; do wget -O - ${OIDC_AUTHORITY}/protocol/openid-connect/userinfo; echo \"Elapsed: $$(($$(date +%s) - $${START})) seconds\"; sleep 0.1; done"
    labels:
      traefik.enable: true
      traefik.http.routers.mycustomservice.rule: PathPrefix(`/`)
      traefik.http.routers.mycustomservice.entrypoints: web
    depends_on:
      - keycloak
    volumes:
      - ./data:/data

  keycloak:
    image: quay.io/keycloak/keycloak:15.0.2
    restart: unless-stopped
    environment:
      DB_USER: keycloak
      DB_PASSWORD: keycloak
      DB_VENDOR: POSTGRES
      DB_ADDR: keycloak-postgres

      KEYCLOAK_USER: ${KEYCLOAK_ADMIN_USER:-admin}
      KEYCLOAK_PASSWORD: ${KEYCLOAK_ADMIN_PASSWORD:-admin}

      PROXY_ADDRESS_FORWARDING: "true"
    labels:
      traefik.enable: true
      traefik.http.routers.keycloak.rule: PathPrefix(`/auth`)
      traefik.http.routers.keycloak.entrypoints: web
    depends_on:
      - keycloak-postgres

  keycloak-postgres:
    image: postgres:14.0
    restart: unless-stopped
    environment:
      POSTGRES_DB: keycloak
      POSTGRES_USER: keycloak
      POSTGRES_PASSWORD: keycloak
    volumes:
      - ./data_keycloak_postgres/:/var/lib/postgresql/data

When I docker exec into the mycustomservice container, I can ping google.com or the host ip just fine.
Just connecting to traefik on port 80 of the host ip (for example with telnet) fails with connection refused.
Connecting to traefik on port 80 from outside of the container still works fine.
Restarting the mycustomservice container does not have any effect, restarting traefik makes it work again ... for some time. Sometimes it works only for a few minutes, sometimes it works for 10-20min - but usually not longer than that.

When this happens, other containers from the same docker-compose network lose access too. Thus I'm not 100% positive this is a traefik issue at all, it might be a docker issue of some sort?
edit: it's not only the same docker-compose network - it's all containers on the same host.

I'm obviously not using the ratelimiting middleware and cannot see anything that looks even slightly relevant in the logs.
edit: I've confirmed there is no log entry when my connection attempt is refused.

By now I'm totally lost on what to search for or how to further debug this.
All threads I find are about setups that don't work at all, not ones that break over time. What also really isn't helpful at all, is that I haven't found a reliable way to reproduce the behavior "quickly" - not even with docker-compose up --scale mycustomservice=50, so it somehow doesn't seem to be a load problem per se.
I'm wondering if the real application (and not my placeholder script) maybe does not close connections correctly, any idea if that could cause issues or how to analyze if that could be related?
But why does restarting the application container doesn't help then but restarting traefik does? It all doesn't make any sense to me...

Hope you can help me :slight_smile:

edit:
The host is a pretty vanilla Ubuntu 20.04 installation with Docker version 20.10.8, build 3967b7d28e

FWIW

/ # telnet ${SERVER_IP} 8080
Connected to ${SERVER_IP}

Connecting to the dashboard still works perfectly fine ... :confused:

edit:
same behavior when using 172.17.0.1 as the docker internal ip of the host. Connecting to 8080 works fine, 80 doesn't.

Hello @dschmidt

Thanks for using Traefik and asking the question on our forum.

Once that situation is happening, do you find any relevant information in the log file?

1 Like

Hey @jakubhajek,

thanks for coming back to me!

As I said when I do a "blocked" request, I don't see anything in the logs - I have found sort of a easy/quick reproducer by now. Let me check whether I can find anything in the log, when it flips from working to not-working.

FWIW repeatedly using this function in my app is all I need to do to trigger the behavior:

Looks like a pretty standard http Request to me - I cannot trigger it with the curl invocation from my docker-compose.yml from above..

This is a complete log file of one reproducing run: https://gist.githubusercontent.com/dschmidt/cefd4d47833053b722969fb0571caa39/raw/96fd6a0cfba5f96df59f6b01d021a6d70dea210f/gistfile1.txt

As you can see there is always a request to /api/v1/permissions, followed by a request to /auth/realms/master/protocol/openid-connect/userinfo until it just stops. There are 7 requests to /api/v1/permissions without a followup request to the userinfo endpoint.

server_1             | 1:19PM DBG Could not retrieve UserInfo error="Get \"http://10.100.22.175/auth/realms/master/protocol/openid-connect/userinfo\": read tcp 172.26.0.5:48838->10.100.22.175:80: read: connection reset by peer"
server_1             | 1:19PM DBG Could not retrieve UserInfo error="Get \"http://10.100.22.175/auth/realms/master/protocol/openid-connect/userinfo\": dial tcp 10.100.22.175:80: connect: connection refused"
server_1             | 1:19PM DBG Could not retrieve UserInfo error="Get \"http://10.100.22.175/auth/realms/master/protocol/openid-connect/userinfo\": dial tcp 10.100.22.175:80: connect: connection refused"
server_1             | 1:19PM DBG Could not retrieve UserInfo error="Get \"http://10.100.22.175/auth/realms/master/protocol/openid-connect/userinfo\": dial tcp 10.100.22.175:80: connect: connection refused"
server_1             | 1:19PM DBG Could not retrieve UserInfo error="Get \"http://10.100.22.175/auth/realms/master/protocol/openid-connect/userinfo\": dial tcp 10.100.22.175:80: connect: connection refused"
server_1             | 1:19PM DBG Could not retrieve UserInfo error="Get \"http://10.100.22.175/auth/realms/master/protocol/openid-connect/userinfo\": dial tcp 10.100.22.175:80: connect: connection refused"
server_1             | 1:19PM DBG Could not retrieve UserInfo error="Get \"http://10.100.22.175/auth/realms/master/protocol/openid-connect/userinfo\": dial tcp 10.100.22.175:80: connect: connection refused"

Just out of the topic - what kind of solution are you trying to build?

PS. are you aware of the Middleware OpenID Connect that is already available for Traefik Enterprise?

It's a self hosted application for managing laboratory equipment - I'm trying to provide an example for easy deployment to get started and play around with it.

So I'm not really bound to keycloak or anything, it's just a placeholder (in fact I've been using Kopano Konnect so far - it's just a bit messier to setup for a simple example because it needs LDAP).
Sooo ... Traefik Enterprise would be interesting to test my application with, but not really relevant for this documentation example I'd say, sorry :slight_smile:

Thanks for explaining that. I asked that question because I noticed that you set up the environment with OpenID and Keycloak so I was just curious.

So if you are just developing the labs I would try to use much simpler test applications. e.g. well known traefik/whoami instead of configuring more complex stuff.

Maybe that example can be useful: traefik-proxy/basic-docker-compose at master · jakubhajek/traefik-proxy · GitHub

Well, my config works initially - it just stops working after some time, that's what I can't make any sense of..

Good news for you, bad news for me: I can reproduce this issue with nginx and even without a proxy inbetween just as well.

So this is 99% not a traefik issue but something with Docker and my application - still very confusing.. I will still try to update you here, when I figure it out. In case someone else stumbles upon the same issues.

I could reproduce this in a VM with a fresh Ubuntu 20.04.

I could not reproduce this on my openSUSE Tumbleweed host system and I could not reproduce this on Ubuntu 21.10.
Docker Versions seem rather unrelated.
Ubuntu 20.04: 20.10.8
Ubuntu 21.10: 20.10.7 (yes, lower than 20.04 ...)
openSUSE Tumbleweed: 20.10.6-ce

Either it breaks with Docker 20.10.8 :joy: or it's more likely the kernel version or something else ...

Ubuntu 20.04: 5.4.0-88-generic
Ubuntu 21.10: 5.13.0-19-generic
openSUSE Tumbleweed: 5.14.6-2-default

Consider this resolved :slight_smile:

1 Like