Traefik oom killed handling 20k ingresses

hahmadzadeh · May 27, 2024, 11:40am

Hello,

We have a SaaS platform hosting around 20k domains. Each day, we add approximately 600 new customers to our platform, while around 300 customers leave. Our setup includes a Traefik (version 2.9.1) load balancer within a Kubernetes (K8s) cluster. We have 13 replicas, each with 20 GB of memory and an 8-core CPU. Despite the large number of tenants, the request rate is relatively low, averaging around 1K requests per second (rps) for all tenants combined.

However, we have observed that when we add new domains to our cluster, Traefik experiences high memory usage, often leading to out-of-memory (OOM) errors. In our cluster, we use the K8s Ingress provider. Upon reviewing the Traefik code, I found the following comment in the Ingress provider code:

// Note that event is the *first* event that came in during this
// throttling interval -- if we're hitting our throttle, we may have
// dropped events. This is fine, because we don't treat different
// event types differently. But if we do in the future, we'll need to
// track more information about the dropped events.
conf := p.loadConfigurationFromIngresses(ctxLog, k8sClient)

Am I understanding this correctly? Does it mean that each change in a single Ingress causes Traefik to reload and fetch the entire configuration from K8s again? Also, since this operation is performed in a for loop, it has a time complexity of O(n), which is inefficient and may be causing the issues we observe in our cluster.

P.S.: I am aware that the providersThrottleDuration configuration can reduce the number of reloads, but this would negatively impact response times for our users.

bluepuma77 · May 27, 2024, 11:47am

Have you tested with latest v2.11 or v3.0?

hahmadzadeh · May 27, 2024, 1:08pm

No, I did not. However, I could not find any changelog related to the k8s ingress provider.

Topic		Replies	Views
Traefik 2.11.2 crashing due to OOM Traefik v2 kubernetes-ingress	4	164	October 24, 2024
11K routers in K8s, high CPU load on pod / service churn, is this expected and am I doing it right? Traefik v2 kubernetes-crd	0	186	March 2, 2024
Long connections + large config + many config reloads = high memory usage? Traefik v2 kubernetes-crd	2	681	December 15, 2023
Traefic Memory leak and OOM kill Traefik v2 kubernetes-crd , kubernetes-ingress	1	1641	October 28, 2021
High number of Traefik API objects in K8s cluster affecting Traefik performance Traefik v2 kubernetes-crd , kubernetes-ingress , middleware	0	111	February 18, 2024

Traefik oom killed handling 20k ingresses

Related topics