Intermittent 502's downstream with Nomad caused by : EOF

rlandingham · October 10, 2024, 7:17pm

Hello,

We have been debugging an issue for the past few weeks and seemed to try everything we could research online to try and resolve our intermittent 502 errors.

Overview:
We are running Nomad with Consul and Traefik. All our Traefik endpoints are behind an AWS ALB. Here is a small diagram of the setup.

Here is also an example of a 502:
Example 502:
{
"ClientAddr": "<client_ip>:26694",
"ClientHost": "<client_host_ip>",
"ClientPort": "26694",
"ClientUsername": "-",
"DownstreamContentSize": 185,
"DownstreamStatus": 502,
"Duration": 48968850,
"OriginContentSize": 185,
"OriginDuration": 48930581,
"OriginStatus": 502,
"Overhead": 38269,
"RequestAddr": "app.domain.io",
"RequestContentSize": 0,
"RequestCount": 838372,
"RequestHost": "app.domain.io",
"RequestMethod": "GET",
"RequestPath": "/",
"RequestPort": "-",
"RequestProtocol": "HTTP/1.1",
"RequestScheme": "http",
"RetryAttempts": 0,
"RouterName": "app@consulcatalog",
"ServiceAddr": "<service_ip>:27120",
"ServiceName": "app@consulcatalog",
"ServiceURL": {
"Scheme": "https",
"Opaque": "",
"User": null,
"Host": ":27120",
"Path": "",
"RawPath": "",
"OmitHost": false,
"ForceQuery": false,
"RawQuery": "",
"Fragment": "",
"RawFragment": ""
},
"StartLocal": "2024-10-10T17:27:49.241367787Z",
"StartUTC": "2024-10-10T17:27:49.241367787Z",
"entryPointName": "lb",
"level": "info",
"msg": "",
"time": "2024-10-10T17:27:49Z"
}

For the most part, our applications run fine, no 502's but intermittently we get 502's. We had auto scaling in place and though the scaling up and scaling down of our API endpoints were not being handled properly but even after disabling all auto scaling, we are still seeing 502s, even after increasing the number of workers to an absurd amount. (20 Tasks, when normally all traffic can be served with 2-3).

Then we looked into Keep-alive settings from ALB, to Traefik to our App and got those all aligned, but still getting 502s. The last piece i could potentially think of is something with Traefik or the Envoy Proxy Sidecar, but settings there seem fine and are provided in the screenshot above.

Would be interested if anyone has any follow up suggestions around debugging and trying to see what else I could find for this issue.

Retries is the next thing on our list to implement, but I am curious around finding the root cause.

Topic		Replies	Views
Traefik (consulcatalog) , Nomad and Consul - 502 errors during deployment Traefik v2 consul-catalog	4	530	September 25, 2023
Traefik and Nomad Integration gives 502 bad gateway error Traefik v2 consul-catalog	1	938	November 7, 2022
Http 502 status after scale up Traefik v2 ecs	2	806	April 26, 2021
Sporadic 502 on single instance, not every Traefik v2 kubernetes-crd , kubernetes-ingress , middleware	0	640	January 27, 2021
Bad Gateway 502 with docker containers Traefik v2 docker	12	30923	November 21, 2019

Intermittent 502's downstream with Nomad caused by : EOF

Related topics