Impact of setting `maxIdleConnsPerHost` to 0 or -1

Hello,

We use traefik infront of our production k8s cluster and it is serving around 5000 RPS.
Around 0.1% - 0.5% of the requests started getting 502 responses. Looking into it, the requests never reach the services in k8s cluster, but are visible in traefik logs.

Interestingly this only happened with endpoints that do more upstream requests.

One thing that seems to help is setting maxIdleConnsPerHost to -1 (unlimited) or 0 (disabled).
Looking at the active connections metric, it spikes to around 40-50 active connections at a time. So I tried to also set maxIdleConnsPerHost to 100, but this didn't help.

Using unlimited idle connections in production, or actually disabling idle connections all together seems like it could impact the production cluster performance in the long run.

Has anyone ran into the same problem and is simply disabling idle connections a good strategy?

I was going through similar troubleshooting recently and maybe I can help.

One thing that seems to help is setting maxIdleConnsPerHost to -1 (unlimited) or 0 (disabled).

-1 means disabled, 0 means default which I think is max of 2 idle connections.

When maxIdleConnsPerHost is disabled, it means Traefik will open a new connection to backend each time and close when it’s no longer needed. This does have impact on 502 when pods get rolled/terminated for whatever reason. If you have idle connections enabled, they will stick around and you can get 502 between the time pod enters Terminating state and when Traefik updates it’s routing table to remove the pod from it.

If you don’t have problems with disabled idle connections it’s probably fine. Nginx doesn’t’ have them by default at all - although it’s recommended to enable them for performance reasons.

One thing that might help with 502 and 504 when pods are rolled is setting a preStop hook to allow time for Traefik to update its routing table.

This GH issue was informative `nativeLB: true` breaks round robin load balancing · Issue #10303 · traefik/traefik · GitHub

            lifecycle:
              preStop:
                exec:
                  command:
                  - /bin/sh
                  - -c
                  - sleep 20