Traefik 2.11.2 crashing due to OOM

Hi all,

We have been using traefik for 3/4 years now without issue, however over the weekend we seen weird behaviour, which we are hoping someone will have experienced before, and can fill in some of the blanks for us.

On our EKS cluster we have traefik running as a daemonset with memory requests and limits both set to 1Gb.

Using datadog we can clearly see that the traefik pods are running fine at around 200Mi, we then had a product teams pod go into a crashloopbackoff (due to the product pod not having enough memory to handle ingress traffic)

Once the above pod starts crashing we then see 1 minutes later traefik crashing due to OOM errors.

Has anyone seen this before? Any recommendations?

Is this because all the inbound traffic is still going through traefik, however due to the downstream pods failing it has nowhere to send them etc, even with the above we would not have expected traefik to crash :frowning:

Thanks

Try to upgrade to latest v2.11.11 (or v3).

If it still crashes, you could create a Github issue. It would help if you can provide an example to reproduce the error.

After doing some extensive testing we have been able to reproduce the issue, however we are looking some guidance on a fix.

When the number of open connections spiral out of control for a deployment, causing the pods for that deployment to crash, after a given period of time, due to the continuous increase in open connections traefik is holding onto, it causes traefik to crash due to OOM.

Our concern is obviously when traefik pods fail its a SPOF for the entire cluster, we are looking at ways to protect traefik from this.

We would rather not use rate limits on individual ingresses as this could lead to performance issues etc and worst case if every ingress route hit the limit at the same time, the issue could occur again - Traefik RateLimit Documentation - Traefik

But rather we are wondering is there any setting we could use for traefik to protect itself against this? :slight_smile:

Thank you

You could create an issue on Traefik Github. But you should enable the devs to be able to reproduce the issue by themselves.

Thanks I'll open an issue :slight_smile: