Docker Swarm on linux with new 5.x kernels has connectivity issue between nodes in swarm

Trying to set up Traefik here, and I have used a couple of days to track down a cryptic issue. I think you might want some more visibility to in your documentation, as this sounds likely to hit a lot of people trying to setup Docker Swarm on updated linux hosts, and searching for tx-checksum-ip-generic I don't even find anything on the forum here.. Maybe there should be a notice in your docker swarm setup bits?

I've used a couple of days to try and get something working, but your docs in Traefik Docker Routing Documentation - Traefik and a lot of other examples and docs I found elsewhere seemed to not work for me.. Basically, requests to other nodes in swarm cluster just got dropped without any error message or notification.. So I tried a ton of stuff and made a lot of other errors in the process, as the real issue in my opinion has very little visibility.

My root cause, as far as I understand it, is that I got a fresh VM image running Ubuntu with linux kernel 5.15.0-43-generic, and this has a setting that ends up dropping some packages, even though firewall settings are just fine..

Basically, I had to run ethtool -K ens160 tx-checksum-ip-generic off on all my docker nodes to get stuff working again. Some links to where I found the core issue:

And a link to the docker swarm forum where I posted my traefik.yaml and stuff to be able to reproduce issue:

Hi @humbe

Thanks for sharing your issue and findings about the root cause. Just to clarify this is most certainly a network issue on either Docker Swarm, which would not be something new unfortunately, or on the distro/kernel in case and not related to Traefik at all.

It would not be feasible for us to cover for every details or discrepancies in combinations of OS/Kernel/Docker version, especially since Docker is only one of the supported providers for Traefik, that's why we focus the docs on the configuration aspects for the service discovery and not on how to get the networking going for each provider.

Having said that, if you have contributions that would make sense to added to existing docs feel free to submit a pull request or issue with a proposal, here are some guidelines on how to do so.

Finally, thank you again for the report! I'm pretty sure it will be helpful for others on the same situation looking for it on the forums!

Yes.. I realized that after a bit calming down.. I have used docker single host before, but it was my first experience with docker swarm and multi node. Traefik was the first thing I started to add after enabling swarm on the nodes, so the bug made me doubt that any of the documentation was up to date, and what concept I configured in Traefik I didn't get right :wink: Woulda been more obvious if I had a working setup already and noticed it went wrong at some point. There was enough new concepts for me to understand in Traefik that I figured I coulda misunderstood, so the issue being rooted in the kernel did not hit me before quite a while :slight_smile:

The issue does seem to trigger in Docker network overlay, rooted in a change done in 5.x something kernel version, so that is not a Traefik issue. But issue with dependency that is kind of invisibly triggering underneath what you're trying to set up is kinda hard to track down. Not sure how to best increase visibility to the issue, but I thought a start was at least to post here.