High load and disk usage crashed the ingress controller

radicaled42 · May 19, 2023, 9:00pm

We have a single Traefik pod running with Let's Encrypt to generate the certificates. It has limits of 6Gb of RAM, 3000m of CPU and its running in a isolated node (nothing else runs in the same node). The node has a disk of 50Gb.
We were running v20.8.0 at the time of the incidents.
In that cluster we have three important services, influxDB, Artifactory and Sentry

For the past 76 days the Traefik pod was running without issues.
Yesterday we had an incident where Traefik filled the RAM of the pod and then filled the disk of the node and then crashed.
At that moment it wasn't doing anything out of the ordinary. Artifactory and Sentry were running as usual and InfluxDB has a live sync stream and a batch job.
When we stopped the batch job Traefik started to work normally again.
Then we re-launch the job again and run without problems for 4hs after a while Traefik crashed again.

For the first measure, we jumped to the latest Traefik v23.0.1
We are trying to move to a HA configuration with Cert-Manager. But at the same time we are exploring other options.

My question would be, modifying the memResponseBodyBytes and memRequestBodyBytes could help with this issue?

memResponseBodyBytes - Traefik Buffering Documentation - Traefik
memRequestBodyBytes - Traefik Buffering Documentation - Traefik

Thanks

Topic		Replies	Views
Traefik 2.11.2 crashing due to OOM Traefik v2 kubernetes-ingress	4	203	October 24, 2024
Traefic Memory leak and OOM kill Traefik v2 kubernetes-crd , kubernetes-ingress	1	1656	October 28, 2021
Traefik oom killed handling 20k ingresses Traefik v2 kubernetes-ingress	2	301	May 27, 2024
OOMKilled Possible Memory Leak with config Traefik v2 kubernetes-ingress	0	898	May 23, 2023
Upload size >2GB? Traefik v2 file	4	5546	September 18, 2024

High load and disk usage crashed the ingress controller

Related topics