Traefik oom killed handling 20k ingresses

Hello,

We have a SaaS platform hosting around 20k domains. Each day, we add approximately 600 new customers to our platform, while around 300 customers leave. Our setup includes a Traefik (version 2.9.1) load balancer within a Kubernetes (K8s) cluster. We have 13 replicas, each with 20 GB of memory and an 8-core CPU. Despite the large number of tenants, the request rate is relatively low, averaging around 1K requests per second (rps) for all tenants combined.

However, we have observed that when we add new domains to our cluster, Traefik experiences high memory usage, often leading to out-of-memory (OOM) errors. In our cluster, we use the K8s Ingress provider. Upon reviewing the Traefik code, I found the following comment in the Ingress provider code:

// Note that event is the *first* event that came in during this
// throttling interval -- if we're hitting our throttle, we may have
// dropped events. This is fine, because we don't treat different
// event types differently. But if we do in the future, we'll need to
// track more information about the dropped events.
conf := p.loadConfigurationFromIngresses(ctxLog, k8sClient)

Am I understanding this correctly? Does it mean that each change in a single Ingress causes Traefik to reload and fetch the entire configuration from K8s again? Also, since this operation is performed in a for loop, it has a time complexity of O(n), which is inefficient and may be causing the issues we observe in our cluster.

P.S.: I am aware that the providersThrottleDuration configuration can reduce the number of reloads, but this would negatively impact response times for our users.

2 Likes

Have you tested with latest v2.11 or v3.0?

No, I did not. However, I could not find any changelog related to the k8s ingress provider.