Troubleshoot 502 errors

Hi everyone,

I have Traefik 2.5.1 running in a GKE Cluster and Google Load Balancer in front of it.

We are getting 502 requests under some conditions, and would like some guidance here on how to proceed.

Google LB has the keepAlive timeout as 600s, as this can't be changed
https://cloud.google.com/load-balancing/docs/https#timeout-keepalive-backends

So, they suggest to set it a bit higher on the proxy side

- "--serversTransport.forwardingTimeouts.idleConnTimeout=650s"
- "--entryPoints.web-ext.transport.respondingTimeouts.idleTimeout=650"
- "--entryPoints.websec-ext.transport.respondingTimeouts.idleTimeout=650"

And this is how I did it, however we are still having the 502s, while I understand there can be many reasons, such as when deploying or HPA scales down, that's not the case here, we have added a few mechanisms to cover on those.

I found this comment, suggesting

MaxIdleConnsPerHost = -1

and while I tried it, it didn't help for our case

https://github.com/traefik/traefik/issues/3237#issuecomment-514178590

Then, someone says also about a NodeJS timeout
https://github.com/traefik/traefik/issues/3237#issuecomment-585062945

And I wonder, if we disable keepAlive on Traefik, settings from downstream services shouldn't really matter right, or am I missing something here?

What I'd like to achieve is that we can have a way to stop 502s related to this race condition caused by different idleTimeouts, and no matter what applications have configured on their side, Traefik is the one in the "lead" let's say.

Some nice article as reference Reverse Proxy, HTTP Keep-Alive Timeout, and sporadic HTTP 502s

Is anyone able to shed a light here?

Thanks in advance.

502 is "Bad Gateway".

This usually happens when Traefik in Docker tries to forward a request to a target service via a Docker network Traefik itself is not attached to. So when an IP connection to a target service can not be established.

Maybe try enabling Traefik access log in JSON format to be able to differentiate between OriginStatus and DownstreamStatus. And enable Traefik debug log.

Furthermore you should upgrade from a 2-3 year old Traefik version.

Hey,

Thanks for your reply. The thing is, the 502s are intermittent, so between thousands of 200 requests, we'll get some 502.

Anything in particular that I should be looking with debug logs? I already tried to analyse them, but didn't find much at the time, could give another shot.

Regarding upgrade, yes, I know, but our team is not in capacity to do this work right now.

You are probably right.

Continue to use Traefik v2.5.1 from Aug 20, 2021. Ignore 4104 commits that happened in the meantime, for bug fixes and potential security improvements. Ignore the 8 CVE entries (short for Common Vulnerabilities and Exposures) for Traefik since that release.

Let your team be busy doing their work and waste your and forum members time instead to try to debug something, that might have been resolved already.

/irony off