Long connections + large config + many config reloads = high memory usage?

Recently I experienced a large RAM spike in traefik and I'm trying to figure out why it happened and how to prevent it.

  • Traefik version: 2.10.1

  • Provider: Kubernetes CRD

  • Requests per second: <10

  • Number of IngressRoutes: 800+, each with 6 rules for ~5000 total rules.

  • TLS: Enabled, with a wildcard cert assigned to each IngressRoute.

  • Middleware used: forward-auth

  • Metrics:

  • pprof:

(pprof) top
Showing nodes accounting for 84.70MB, 51.06% of 165.86MB total
Dropped 131 nodes (cum <= 0.83MB)
Showing top 10 nodes out of 205
      flat  flat%   sum%        cum   cum%
   14.50MB  8.74%  8.74%    14.50MB  8.74%  github.com/gorilla/mux.mapFromPairsToString
   12.68MB  7.65% 16.39%    12.68MB  7.65%  reflect.unsafe_NewArray
      10MB  6.03% 22.42%       10MB  6.03%  github.com/traefik/traefik/v2/pkg/config/runtime.(*ServiceInfo).UpdateServerStatus
      10MB  6.03% 28.45%       10MB  6.03%  github.com/traefik/traefik/v2/pkg/server/provider.MakeQualifiedName
    7.50MB  4.52% 32.97%     7.50MB  4.52%  github.com/gorilla/mux.(*Router).NewRoute
    6.50MB  3.92% 36.89%     6.50MB  3.92%  github.com/traefik/traefik/v2/pkg/server/service.buildProxy
    6.50MB  3.92% 40.81%     6.50MB  3.92%  sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).literalStore
       6MB  3.62% 44.43%        6MB  3.62%  github.com/traefik/traefik/v2/pkg/middlewares/metrics.NewServiceMiddleware
    5.50MB  3.32% 47.75%     5.50MB  3.32%  reflect.mapassign_faststr
    5.50MB  3.32% 51.06%        6MB  3.62%  github.com/traefik/traefik/v2/pkg/middlewares/auth.NewForward
(pprof) top 10 -cum
Showing nodes accounting for 0, 0% of 165.86MB total
Dropped 131 nodes (cum <= 0.83MB)
Showing top 10 nodes out of 205
      flat  flat%   sum%        cum   cum%
         0     0%     0%   115.81MB 69.83%  github.com/traefik/traefik/v2/pkg/safe.GoWithRecover.func1
         0     0%     0%   115.31MB 69.52%  github.com/traefik/traefik/v2/pkg/safe.(*Pool).GoCtx.func1
         0     0%     0%   103.33MB 62.30%  github.com/traefik/traefik/v2/pkg/server.(*ConfigurationWatcher).applyConfigurations
         0     0%     0%    88.75MB 53.51%  main.switchRouter.func1
         0     0%     0%    84.01MB 50.65%  github.com/traefik/traefik/v2/pkg/server.(*RouterFactory).CreateRouters
         0     0%     0%    81.51MB 49.14%  github.com/traefik/traefik/v2/pkg/server/router.(*Manager).BuildHandlers
         0     0%     0%    81.51MB 49.14%  github.com/traefik/traefik/v2/pkg/server/router.(*Manager).buildEntryPointHandler
         0     0%     0%    55.01MB 33.16%  github.com/traefik/traefik/v2/pkg/server/router.(*Manager).buildRouterHandler
         0     0%     0%    52.01MB 31.36%  github.com/traefik/traefik/v2/pkg/server/router.(*Manager).buildHTTPHandler
         0     0%     0%    41.51MB 25.02%  github.com/traefik/traefik/v2/pkg/server/service.(*InternalHandlers).BuildHTTP

What happened was that an unrelated client-side bug resulted in leaked connections (top graph) that stayed open for several minutes. I would expect traefik to be able to handle many more than 200 open connections, but each connection seemed to use up nearly 3MB of RAM.

Since I have long connections + large config + many config reloads, this theory seems the most likely to me so far: Possible memory leak, regexp instructions consuming a lot of memory · Issue #8044 · traefik/traefik · GitHub

Is it possible that old connections can hold on to old versions of the dynamic config? And is it also possible that the dynamic config struct can get very large, like maybe storing one copy of the TLS cert per IngressRoute?

If so, is there any way to reduce the per-connection RAM impact?

Also, traefik doesn't do any buffering by default, right? So the request body size shouldn't matter? Request byte rate was consistently 2.5MiB/s before and after the spike.

Thanks in advance!

And is it also possible that the dynamic config struct can get very large, like maybe storing one copy of the TLS cert per IngressRoute?

Looks like if you have several IngressRoutes that all refer to the same TLS cert, only one copy will be stored because the certs are stored in a map where the "key" is the namespace + name of the Secret.: traefik/pkg/provider/kubernetes/crd/kubernetes_http.go at 0ee377bc9f036124b063e7abc3f0958d51ace5fb · traefik/traefik · GitHub

So that's not the cause of the problem I don't think

If you think this is an issue/bug, you should copy it to Traefik Github (link).