I am trying to harden my environment some and having some strange behavior I am hoping someone can help me understand.
I am incorporating a docker socket proxy in my environment. My dev portion is a 3 node Swarm with two Leaders.
I had it working over all services that needed the socket (3 I think) but when progressing I noticed that Traefik would sporadically report it could not resolve the proxy's address... and changed the def from a tcp:// protocol to http://. The (in)specific error is as follows:
Provider connection error error during connect: Get \"http://dockerproxy:2375/v1.24/version\": dial tcp: lookup dockerproxy on 127.0.0.11:53: no such host, retrying in 610.626583ms
Despite the retry this will not work again.
Strangely if I open an interactive shell in Traefik, I can ping dockerproxy by name and it does resolve. More strange than that, to troubleshoot I created a config that only had Traefik and the socket proxy and tested. When having this issue.... I can actually see activity (assumed from Traefik since it is the only running service) over the socket...
time=2025-05-28T13:09:37.430Z level=INFO msg="socket-proxy running and listening..."
time=2025-05-28T13:09:37.430Z level=DEBUG msg="watchdog running"
time=2025-05-28T13:09:37.786Z level=DEBUG msg="allowed request" method=GET URL=/v1.24/version client=10.0.4.4:50798
time=2025-05-28T13:09:37.792Z level=DEBUG msg="allowed request" method=GET URL=/v1.24/services client=10.0.4.4:50798
time=2025-05-28T13:09:37.794Z level=DEBUG msg="allowed request" method=GET URL=/v1.24/version client=10.0.4.4:50798
time=2025-05-28T13:09:37.800Z level=DEBUG msg="allowed request" method=GET URL="/v1.24/networks?filters=%7B%22scope%22%3A%7B%22swarm%22%3Atrue%7D%7D" client=10.0.4.4:50798
time=2025-05-28T13:09:37.802Z level=DEBUG msg="allowed request" method=GET URL="/v1.24/tasks?filters=%7B%22desired-state%22%3A%7B%22running%22%3Atrue%7D%2C%22service%22%3A%7B%22dpz9mw28zcmu882ztc166l0eq%22%3Atrue%7D%7D" client=10.0.4.4:50798
time=2025-05-28T13:09:37.803Z level=DEBUG msg="allowed request" method=GET URL="/v1.24/tasks?filters=%7B%22desired-state%22%3A%7B%22running%22%3Atrue%7D%2C%22service%22%3A%7B%22jbva8z9glfm3ez231xq7fyom9%22%3Atrue%7D%7D" client=10.0.4.4:50798
time=2025-05-28T13:09:52.806Z level=DEBUG msg="allowed request" method=GET URL=/v1.24/services client=10.0.4.4:50798
time=2025-05-28T13:09:52.807Z level=DEBUG msg="allowed request" method=GET URL=/v1.24/version client=10.0.4.4:50798
time=2025-05-28T13:09:52.813Z level=DEBUG msg="allowed request" method=GET URL="/v1.24/networks?filters=%7B%22scope%22%3A%7B%22swarm%22%3Atrue%7D%7D" client=10.0.4.4:50798
time=2025-05-28T13:09:52.815Z level=DEBUG msg="allowed request" method=GET URL="/v1.24/tasks?filters=%7B%22desired-state%22%3A%7B%22running%22%3Atrue%7D%2C%22service%22%3A%7B%22dpz9mw28zcmu882ztc166l0eq%22%3Atrue%7D%7D" client=10.0.4.4:50798
time=2025-05-28T13:09:52.816Z level=DEBUG msg="allowed request" method=GET URL="/v1.24/tasks?filters=%7B%22desired-state%22%3A%7B%22running%22%3Atrue%7D%2C%22service%22%3A%7B%22jbva8z9glfm3ez231xq7fyom9%22%3Atrue%7D%7D" client=10.0.4.4:50798
So it is actually resolving and communicating... just reporting it is not... sometimes. Cycling the stack sporadically reproduces the issue.
I thought maybe this could be because Traefik and the proxy socket are not on the same node, but experimenting with that theory disproved it. I set the proxy to run globally and although that seemed to reduce the issue, I still found the issue occurring.
I am super confused and this reads to me like an issue in Traefik's reporting of service resolution. Any ideas? TiA