We have a single Traefik pod running with Let's Encrypt to generate the certificates. It has limits of 6Gb
of RAM, 3000m
of CPU and its running in a isolated node (nothing else runs in the same node). The node has a disk of 50Gb
.
We were running v20.8.0
at the time of the incidents.
In that cluster we have three important services, influxDB, Artifactory and Sentry
For the past 76 days the Traefik pod was running without issues.
Yesterday we had an incident where Traefik filled the RAM of the pod and then filled the disk of the node and then crashed.
At that moment it wasn't doing anything out of the ordinary. Artifactory and Sentry were running as usual and InfluxDB has a live sync stream and a batch job.
When we stopped the batch job Traefik started to work normally again.
Then we re-launch the job again and run without problems for 4hs after a while Traefik crashed again.
For the first measure, we jumped to the latest Traefik v23.0.1
We are trying to move to a HA configuration with Cert-Manager. But at the same time we are exploring other options.
My question would be, modifying the memResponseBodyBytes
and memRequestBodyBytes
could help with this issue?
- memResponseBodyBytes - Traefik Buffering Documentation - Traefik
- memRequestBodyBytes - Traefik Buffering Documentation - Traefik
Thanks