Intermittent 502's downstream with Nomad caused by : EOF

Hello,

We have been debugging an issue for the past few weeks and seemed to try everything we could research online to try and resolve our intermittent 502 errors.

Overview:
We are running Nomad with Consul and Traefik. All our Traefik endpoints are behind an AWS ALB. Here is a small diagram of the setup.

Here is also an example of a 502:
Example 502:
{
"ClientAddr": "<client_ip>:26694",
"ClientHost": "<client_host_ip>",
"ClientPort": "26694",
"ClientUsername": "-",
"DownstreamContentSize": 185,
"DownstreamStatus": 502,
"Duration": 48968850,
"OriginContentSize": 185,
"OriginDuration": 48930581,
"OriginStatus": 502,
"Overhead": 38269,
"RequestAddr": "app.domain.io",
"RequestContentSize": 0,
"RequestCount": 838372,
"RequestHost": "app.domain.io",
"RequestMethod": "GET",
"RequestPath": "/",
"RequestPort": "-",
"RequestProtocol": "HTTP/1.1",
"RequestScheme": "http",
"RetryAttempts": 0,
"RouterName": "app@consulcatalog",
"ServiceAddr": "<service_ip>:27120",
"ServiceName": "app@consulcatalog",
"ServiceURL": {
"Scheme": "https",
"Opaque": "",
"User": null,
"Host": ":27120",
"Path": "",
"RawPath": "",
"OmitHost": false,
"ForceQuery": false,
"RawQuery": "",
"Fragment": "",
"RawFragment": ""
},
"StartLocal": "2024-10-10T17:27:49.241367787Z",
"StartUTC": "2024-10-10T17:27:49.241367787Z",
"entryPointName": "lb",
"level": "info",
"msg": "",
"time": "2024-10-10T17:27:49Z"
}

For the most part, our applications run fine, no 502's but intermittently we get 502's. We had auto scaling in place and though the scaling up and scaling down of our API endpoints were not being handled properly but even after disabling all auto scaling, we are still seeing 502s, even after increasing the number of workers to an absurd amount. (20 Tasks, when normally all traffic can be served with 2-3).

Then we looked into Keep-alive settings from ALB, to Traefik to our App and got those all aligned, but still getting 502s. The last piece i could potentially think of is something with Traefik or the Envoy Proxy Sidecar, but settings there seem fine and are provided in the screenshot above.

Would be interested if anyone has any follow up suggestions around debugging and trying to see what else I could find for this issue.

Retries is the next thing on our list to implement, but I am curious around finding the root cause.