Observing 5xx requests with Content-Size header > 0 but RequestContentSize = 0

We are sporadically seeing 5xx replies on some routes entering our k8s cluster via a replicated traefik ingress controller. What seems common to all those requests (POST) is that when looking at the debug logs of Traefik we can see that a Content-Lenght header with a value > 0 is present, but RequestContentSize size is logged as 0.

On the same route, a lot of successful requests work correctly and in these cases, Content-Length and RequestContentSize match. These requests originate from the same clients, so it's pretty unlikely that there's a malicious client spoofing headers.

Does anyone have a clue what could be causing this situation?

Hello @languitar

Have you noticed anything unusual in your log files? Would it possible to enable debug log level ?

Thank you,

The difference in Content-Length and RequestContentSize is the only unusual thing I can find in those logs. Things are already running with DEBUG logging. Here is one of the logs showing this issue (slightly redacted):

{
    "ClientAddr": "10.1.118.85:37844",
    "ClientHost": "79.250.204.175",
    "ClientPort": "37844",
    "ClientUsername": "-",
    "DownstreamContentSize": 21,
    "DownstreamStatus": 500,
    "Duration": 1532776,
    "OriginContentSize": 21,
    "OriginDuration": 1412976,
    "OriginStatus": 500,
    "Overhead": 119800,
    "RequestAddr": "content.internal.production.xxxxxxx.company",
    "RequestContentSize": 0,
    "RequestCount": 586503,
    "RequestHost": "content.internal.production.xxxxxxx.company",
    "RequestMethod": "POST",
    "RequestPath": "/v1/companies/views/list?locale=de_DE\u0026limit=6\u0026offset=0\u0026restore=false\u0026basedOnRestore=false\u0026fromPaginate=false",
    "RequestPort": "-",
    "RequestProtocol": "HTTP/1.1",
    "RequestScheme": "https",
    "RetryAttempts": 0,
    "RouterName": "api-content-v1-team-architect-content-internal-production-xxxxxxx-company-v1@kubernetes",
    "ServiceAddr": "10.1.124.190:80",
    "ServiceName": "team-architect-api-content-v1-http@kubernetes",
    "ServiceURL": {
        "Scheme": "http",
        "Opaque": "",
        "User": null,
        "Host": "10.1.124.190:80",
        "Path": "",
        "RawPath": "",
        "ForceQuery": false,
        "RawQuery": "",
        "Fragment": "",
        "RawFragment": ""
    },
    "StartLocal": "2021-05-12T04:25:12.008575305Z",
    "StartUTC": "2021-05-12T04:25:12.008575305Z",
    "entryPointName": "websecure",
    "level": "info",
    "msg": "",
    "request_Accept": "application/json, text/plain, */*",
    "request_Accept-Encoding": "gzip, deflate, br",
    "request_Accept-Language": "de-de",
    "request_Authorization": "REDACTED",
    "request_Cloudfront-Forwarded-Proto": "https",
    "request_Cloudfront-Is-Desktop-Viewer": "true",
    "request_Cloudfront-Is-Mobile-Viewer": "false",
    "request_Cloudfront-Is-Smarttv-Viewer": "false",
    "request_Cloudfront-Is-Tablet-Viewer": "false",
    "request_Cloudfront-Viewer-Country": "DE",
    "request_Connection": "Keep-Alive",
    "request_Content-Length": "96",
    "request_Content-Type": "application/json",
    "request_Origin": "https://xxxx.xxx",
    "request_Referer": "https://xxxx.xxx/app/products",
    "request_User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15",
    "request_Via": "1.1 f3df422597c4acc1ee367b77cc862a7c.cloudfront.net (CloudFront)",
    "request_X-Amz-Cf-Id": "Nd9kPdbmCI8ZNcZnarbH3OUcTOs3j8WaPDFzQp53RiXjc_7a7r9rEQ==",
    "request_X-Forwarded-For": "79.250.204.175",
    "request_X-Forwarded-Host": "content.internal.production.xxxxxxx.company",
    "request_X-Forwarded-Port": "443",
    "request_X-Forwarded-Prefix": "/v1",
    "request_X-Forwarded-Proto": "https",
    "request_X-Forwarded-Server": "traefik-6d8bcb4b94-ww97s",
    "request_X-Real-Ip": "10.1.118.85",
    "time": "2021-05-12T04:25:12Z"
}

Does that help?

Maybe some additional information: the called backend would be a uwsgi server. We cannot see any kind of logging or errors for requests failing with these symptoms on the side of the backend.

@jakubhajek any idea how to proceed here. This keeps haunting us :confused:

Hey @languitar

Is it possible to create any reproducible scenario for that case? I am not able to find anything unusual here, so having a reproducible use case would be helpful to provide a solution for that random issue.

We tried hard to find something that reproduces the case but were unable to come up with anything. For most things this happens sporadically. There's only one specific case that seems to happen more often: In front of Traefik (as the k8s ingress) is a Cloudfront distribution with a redirect to prerender.io for bots. The internal bot used by prerender.io triggers this issue reliably on a specific REST request. However, also in this case we were unable to recreate the situation manually, for instance, using curl. So I have nothing to show and if there's anything we can do to further debug this in the production system, this would help a lot.

Thanks for clarifying that. I understand that it is difficult to have a reproducible use case, especially if it is so rare and random.

I will keep in mind any similar issues that might be reported and then perhaps we will have more details to find the root cause and provide the solution.

Thank you, Jakub

Hi again @jakubhajek

I was able to acquire a package capture on the traefik pod including one of the affected requests. Traefik logs one of those requests without RequestContentSize and a 500 reply at:

    "StartLocal": "2021-05-17T15:36:57.809707904Z",
    "StartUTC": "2021-05-17T15:36:57.809707904Z",

At the same time I can find a TCP connection in Wireshark from traefik (10.1.127.51) to the backend server (10.1.106.31) where traefik initiates a request and immediately closes the request without ever sending any data:

Moreover, I cannot find any request to the backend including the Cloudfront Trace ID that results in status code 500. Therefore, I would assume that this is the TCP connection that should have performed the respective HTTP request, but somehow it doesn't.

Any idea what is going on here?

hi again, @languitar
Thanks a lot for investigating the issue. If we suspect that it might be a bug - that obviously might be possible, we can open an issue on our Github repo. However, we need to have a reproducible use case. Once the issue is fixed we have to be sure that it no longer occurs by testing using the test scenario. I am supporting you in finding the root cause of the issue.

Hi @jakubhajek

Should we try to narrow this further down here or in a new issue to open?

My biggest problem with reproducing this right now is that the incoming request is via HTTPS. I would suspect that there is some property of the request coming from prerender.io that makes this issue at least likelier to appear but I can't look into the HTTP traffic due to the encryption. Any idea on how to narrow down the case?

Alright, I was able to severely narrow this down.

In the end, the services that are affected most by this issue are based on uwsgi. I could replicate the issue in a docker-compose project with one of our services and a bunch of parallel curl for loops. The problem seems again to stem from diverging ideas on how HTTP keep-alive should work. I've seen some issues on that topic in the Github issue tracker over the last few years and it seems we ended up in comparable cases.

For uwsgi, changing the server invocation from

uwsgi --ini /uwsgi.ini --protocol http --socket 0.0.0.0:80 --show-config

to

uwsgi --ini /uwsgi.ini --http 0.0.0.0:80 --http-keepalive --show-config

resolves the issue.

However, sporadically also other services are affected based on other frameworks (node services, keycloak). How should we fix this there? And does traefik always assume keep-alive support, even if the response headers do not indicate this?

Johannes

1 Like

Here is a reproduction repository where I tried to reproduce the problem with the smallest setup possible. Strangely, the configuration of uwsgi that works in our production system now still creates 500 and 502 replies here with the included check.sh file. One can find logs such as

time="2021-05-18T15:38:03Z" level=debug msg="'500 Internal Server Error' caused by: http: server closed idle connection"

Does that help you in understanding what is wrong here?

Hello Johannes,

Thanks a lot for spending time investigating the issue and creating the reproducible scenario. I appreciate the contribution of the work to manage that. Let me discuss that internally and I will back to you shortly.

Have a great day,
Jakub

Hello Johannes,

Thank you once again for contributing to that issue.
We discussed that internal and technically speaking we are on it, you can track the progress referring to the Github issue.

Best,
Jakub

1 Like

Hi @jakubhajek

thanks for the feedback and approaching this issue. We'd be very happy to find a solution for this and if there is further need for assistance, I or one of my colleagues can help out.

Cheers
Johannes