I'm using traefik as a reverse proxy in front of two services running in docker containers.
One is mycustomservice and one is keycloak.
It basically works but after some time traefik refuses any connections to it from mycustomservice. Hence mycustomservice loses access to keycloak.
This is what my docker-compose.yml File looks like:
When I docker exec into the mycustomservice container, I can ping google.com or the host ip just fine.
Just connecting to traefik on port 80 of the host ip (for example with telnet) fails with connection refused.
Connecting to traefik on port 80 from outside of the container still works fine.
Restarting the mycustomservice container does not have any effect, restarting traefik makes it work again ... for some time. Sometimes it works only for a few minutes, sometimes it works for 10-20min - but usually not longer than that.
When this happens, other containers from the same docker-compose network lose access too. Thus I'm not 100% positive this is a traefik issue at all, it might be a docker issue of some sort?
edit: it's not only the same docker-compose network - it's all containers on the same host.
I'm obviously not using the ratelimiting middleware and cannot see anything that looks even slightly relevant in the logs.
edit: I've confirmed there is no log entry when my connection attempt is refused.
By now I'm totally lost on what to search for or how to further debug this.
All threads I find are about setups that don't work at all, not ones that break over time. What also really isn't helpful at all, is that I haven't found a reliable way to reproduce the behavior "quickly" - not even with docker-compose up --scale mycustomservice=50, so it somehow doesn't seem to be a load problem per se.
I'm wondering if the real application (and not my placeholder script) maybe does not close connections correctly, any idea if that could cause issues or how to analyze if that could be related?
But why does restarting the application container doesn't help then but restarting traefik does? It all doesn't make any sense to me...
Hope you can help me
edit:
The host is a pretty vanilla Ubuntu 20.04 installation with Docker version 20.10.8, build 3967b7d28e
As I said when I do a "blocked" request, I don't see anything in the logs - I have found sort of a easy/quick reproducer by now. Let me check whether I can find anything in the log, when it flips from working to not-working.
FWIW repeatedly using this function in my app is all I need to do to trigger the behavior:
Looks like a pretty standard http Request to me - I cannot trigger it with the curl invocation from my docker-compose.yml from above..
As you can see there is always a request to /api/v1/permissions, followed by a request to /auth/realms/master/protocol/openid-connect/userinfo until it just stops. There are 7 requests to /api/v1/permissions without a followup request to the userinfo endpoint.
server_1 | 1:19PM DBG Could not retrieve UserInfo error="Get \"http://10.100.22.175/auth/realms/master/protocol/openid-connect/userinfo\": read tcp 172.26.0.5:48838->10.100.22.175:80: read: connection reset by peer"
server_1 | 1:19PM DBG Could not retrieve UserInfo error="Get \"http://10.100.22.175/auth/realms/master/protocol/openid-connect/userinfo\": dial tcp 10.100.22.175:80: connect: connection refused"
server_1 | 1:19PM DBG Could not retrieve UserInfo error="Get \"http://10.100.22.175/auth/realms/master/protocol/openid-connect/userinfo\": dial tcp 10.100.22.175:80: connect: connection refused"
server_1 | 1:19PM DBG Could not retrieve UserInfo error="Get \"http://10.100.22.175/auth/realms/master/protocol/openid-connect/userinfo\": dial tcp 10.100.22.175:80: connect: connection refused"
server_1 | 1:19PM DBG Could not retrieve UserInfo error="Get \"http://10.100.22.175/auth/realms/master/protocol/openid-connect/userinfo\": dial tcp 10.100.22.175:80: connect: connection refused"
server_1 | 1:19PM DBG Could not retrieve UserInfo error="Get \"http://10.100.22.175/auth/realms/master/protocol/openid-connect/userinfo\": dial tcp 10.100.22.175:80: connect: connection refused"
server_1 | 1:19PM DBG Could not retrieve UserInfo error="Get \"http://10.100.22.175/auth/realms/master/protocol/openid-connect/userinfo\": dial tcp 10.100.22.175:80: connect: connection refused"
It's a self hosted application for managing laboratory equipment - I'm trying to provide an example for easy deployment to get started and play around with it.
So I'm not really bound to keycloak or anything, it's just a placeholder (in fact I've been using Kopano Konnect so far - it's just a bit messier to setup for a simple example because it needs LDAP).
Sooo ... Traefik Enterprise would be interesting to test my application with, but not really relevant for this documentation example I'd say, sorry
Thanks for explaining that. I asked that question because I noticed that you set up the environment with OpenID and Keycloak so I was just curious.
So if you are just developing the labs I would try to use much simpler test applications. e.g. well known traefik/whoami instead of configuring more complex stuff.
Good news for you, bad news for me: I can reproduce this issue with nginx and even without a proxy inbetween just as well.
So this is 99% not a traefik issue but something with Docker and my application - still very confusing.. I will still try to update you here, when I figure it out. In case someone else stumbles upon the same issues.
I could reproduce this in a VM with a fresh Ubuntu 20.04.
I could not reproduce this on my openSUSE Tumbleweed host system and I could not reproduce this on Ubuntu 21.10.
Docker Versions seem rather unrelated.
Ubuntu 20.04: 20.10.8
Ubuntu 21.10: 20.10.7 (yes, lower than 20.04 ...)
openSUSE Tumbleweed: 20.10.6-ce
Either it breaks with Docker 20.10.8 or it's more likely the kernel version or something else ...