Huge consul KV update ends in ressource spikes

Last week we had to migrate about 1000 customers in our DB which lead to over 3000 new ressources created in consul kv that were then picked up by traefik.
This lead to enourmous resouce spikes that eventually ended in traefik being forecfully stopped by our EKS cluster.

No we are back to normal, but I am still wondering, what caused this spikes.
I went thousands of log lines, but the best I can do, is guess.
However, I want to understand it and prevent future events like this.

Can anyone point me into a direction?
I searched Github for possible related memory leaks or such in combination with kv, but found nothing relevant.
Was it just to much updates at once for traefik to handle? Or could something like an error with a router cause this spikes?
Because the way we fixed it, was by rollbacking our changes and removing all the newly created routes.

The environment is:

  • EKS version 1.32
  • Traefik 3.6.7

Thanks in advance for any hint!
Leo

In the meantime, I created some python scripts to simulate behaviour in an isolated environment. I even tested the churn approach from otlp memory leak · Issue #12232 · traefik/traefik · GitHub.
The result is: While there is a significant spike in resources for CPU, there is a very low spike for memory. And even the CPU spike is far beyond from what we saw in production.
Any ideas or is that something for an GitHub issue?

You can reach the devs via Traefik Github for bugs. But you should provide a reproducible example.

Really useful breakdown. I like how the discussion covers both the theory and the practical side of things, which is not always the case in these threads.