Long connections + large config + many config reloads = high memory usage?

mac-chaffee · December 14, 2023, 4:26pm

Recently I experienced a large RAM spike in traefik and I'm trying to figure out why it happened and how to prevent it.

Traefik version: 2.10.1
Provider: Kubernetes CRD
Requests per second: <10
Number of IngressRoutes: 800+, each with 6 rules for ~5000 total rules.
TLS: Enabled, with a wildcard cert assigned to each IngressRoute.
Middleware used: forward-auth
Metrics:

image2874×1532 447 KB
pprof:

(pprof) top
Showing nodes accounting for 84.70MB, 51.06% of 165.86MB total
Dropped 131 nodes (cum <= 0.83MB)
Showing top 10 nodes out of 205
      flat  flat%   sum%        cum   cum%
   14.50MB  8.74%  8.74%    14.50MB  8.74%  github.com/gorilla/mux.mapFromPairsToString
   12.68MB  7.65% 16.39%    12.68MB  7.65%  reflect.unsafe_NewArray
      10MB  6.03% 22.42%       10MB  6.03%  github.com/traefik/traefik/v2/pkg/config/runtime.(*ServiceInfo).UpdateServerStatus
      10MB  6.03% 28.45%       10MB  6.03%  github.com/traefik/traefik/v2/pkg/server/provider.MakeQualifiedName
    7.50MB  4.52% 32.97%     7.50MB  4.52%  github.com/gorilla/mux.(*Router).NewRoute
    6.50MB  3.92% 36.89%     6.50MB  3.92%  github.com/traefik/traefik/v2/pkg/server/service.buildProxy
    6.50MB  3.92% 40.81%     6.50MB  3.92%  sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).literalStore
       6MB  3.62% 44.43%        6MB  3.62%  github.com/traefik/traefik/v2/pkg/middlewares/metrics.NewServiceMiddleware
    5.50MB  3.32% 47.75%     5.50MB  3.32%  reflect.mapassign_faststr
    5.50MB  3.32% 51.06%        6MB  3.62%  github.com/traefik/traefik/v2/pkg/middlewares/auth.NewForward
(pprof) top 10 -cum
Showing nodes accounting for 0, 0% of 165.86MB total
Dropped 131 nodes (cum <= 0.83MB)
Showing top 10 nodes out of 205
      flat  flat%   sum%        cum   cum%
         0     0%     0%   115.81MB 69.83%  github.com/traefik/traefik/v2/pkg/safe.GoWithRecover.func1
         0     0%     0%   115.31MB 69.52%  github.com/traefik/traefik/v2/pkg/safe.(*Pool).GoCtx.func1
         0     0%     0%   103.33MB 62.30%  github.com/traefik/traefik/v2/pkg/server.(*ConfigurationWatcher).applyConfigurations
         0     0%     0%    88.75MB 53.51%  main.switchRouter.func1
         0     0%     0%    84.01MB 50.65%  github.com/traefik/traefik/v2/pkg/server.(*RouterFactory).CreateRouters
         0     0%     0%    81.51MB 49.14%  github.com/traefik/traefik/v2/pkg/server/router.(*Manager).BuildHandlers
         0     0%     0%    81.51MB 49.14%  github.com/traefik/traefik/v2/pkg/server/router.(*Manager).buildEntryPointHandler
         0     0%     0%    55.01MB 33.16%  github.com/traefik/traefik/v2/pkg/server/router.(*Manager).buildRouterHandler
         0     0%     0%    52.01MB 31.36%  github.com/traefik/traefik/v2/pkg/server/router.(*Manager).buildHTTPHandler
         0     0%     0%    41.51MB 25.02%  github.com/traefik/traefik/v2/pkg/server/service.(*InternalHandlers).BuildHTTP

What happened was that an unrelated client-side bug resulted in leaked connections (top graph) that stayed open for several minutes. I would expect traefik to be able to handle many more than 200 open connections, but each connection seemed to use up nearly 3MB of RAM.

Since I have long connections + large config + many config reloads, this theory seems the most likely to me so far: Possible memory leak, regexp instructions consuming a lot of memory · Issue #8044 · traefik/traefik · GitHub

Is it possible that old connections can hold on to old versions of the dynamic config? And is it also possible that the dynamic config struct can get very large, like maybe storing one copy of the TLS cert per IngressRoute?

If so, is there any way to reduce the per-connection RAM impact?

Also, traefik doesn't do any buffering by default, right? So the request body size shouldn't matter? Request byte rate was consistently 2.5MiB/s before and after the spike.

Thanks in advance!

mac-chaffee · December 14, 2023, 8:39pm

And is it also possible that the dynamic config struct can get very large, like maybe storing one copy of the TLS cert per IngressRoute?

Looks like if you have several IngressRoutes that all refer to the same TLS cert, only one copy will be stored because the certs are stored in a map where the "key" is the namespace + name of the Secret.: traefik/pkg/provider/kubernetes/crd/kubernetes_http.go at 0ee377bc9f036124b063e7abc3f0958d51ace5fb · traefik/traefik · GitHub

So that's not the cause of the problem I don't think

bluepuma77 · December 15, 2023, 8:52am

If you think this is an issue/bug, you should copy it to Traefik Github (link).

Topic		Replies	Views
OOMKilled Possible Memory Leak with config Traefik v2 kubernetes-ingress	0	927	May 23, 2023
Extremely high traefik peak memory usage (14GB) Traefik v1 docker	0	2328	July 3, 2019
Traefik RAM and CPU recommendations for k8s Traefik v2 kubernetes-ingress	1	9807	November 4, 2019
High load and disk usage crashed the ingress controller Traefik v2 kubernetes-ingress	0	580	May 19, 2023
High CPU usage? Traefik v2 kubernetes-crd , kubernetes-ingress	0	1054	October 18, 2019

Long connections + large config + many config reloads = high memory usage?

Related topics