Hi everyone,
I have Traefik 2.5.1 running in a GKE Cluster and Google Load Balancer in front of it.
We are getting 502 requests under some conditions, and would like some guidance here on how to proceed.
Google LB has the keepAlive timeout as 600s, as this can't be changed
https://cloud.google.com/load-balancing/docs/https#timeout-keepalive-backends
So, they suggest to set it a bit higher on the proxy side
- "--serversTransport.forwardingTimeouts.idleConnTimeout=650s" - "--entryPoints.web-ext.transport.respondingTimeouts.idleTimeout=650" - "--entryPoints.websec-ext.transport.respondingTimeouts.idleTimeout=650"
And this is how I did it, however we are still having the 502s, while I understand there can be many reasons, such as when deploying or HPA scales down, that's not the case here, we have added a few mechanisms to cover on those.
I found this comment, suggesting
MaxIdleConnsPerHost = -1
and while I tried it, it didn't help for our case
https://github.com/traefik/traefik/issues/3237#issuecomment-514178590
Then, someone says also about a NodeJS timeout
https://github.com/traefik/traefik/issues/3237#issuecomment-585062945
And I wonder, if we disable keepAlive on Traefik, settings from downstream services shouldn't really matter right, or am I missing something here?
What I'd like to achieve is that we can have a way to stop 502s related to this race condition caused by different idleTimeouts, and no matter what applications have configured on their side, Traefik is the one in the "lead" let's say.
Some nice article as reference Reverse Proxy, HTTP Keep-Alive Timeout, and sporadic HTTP 502s
Is anyone able to shed a light here?
Thanks in advance.