Recently I experienced a large RAM spike in traefik and I'm trying to figure out why it happened and how to prevent it.
-
Traefik version: 2.10.1
-
Provider: Kubernetes CRD
-
Requests per second: <10
-
Number of IngressRoutes: 800+, each with 6 rules for ~5000 total rules.
-
TLS: Enabled, with a wildcard cert assigned to each IngressRoute.
-
Middleware used: forward-auth
-
Metrics:
-
pprof:
(pprof) top
Showing nodes accounting for 84.70MB, 51.06% of 165.86MB total
Dropped 131 nodes (cum <= 0.83MB)
Showing top 10 nodes out of 205
flat flat% sum% cum cum%
14.50MB 8.74% 8.74% 14.50MB 8.74% github.com/gorilla/mux.mapFromPairsToString
12.68MB 7.65% 16.39% 12.68MB 7.65% reflect.unsafe_NewArray
10MB 6.03% 22.42% 10MB 6.03% github.com/traefik/traefik/v2/pkg/config/runtime.(*ServiceInfo).UpdateServerStatus
10MB 6.03% 28.45% 10MB 6.03% github.com/traefik/traefik/v2/pkg/server/provider.MakeQualifiedName
7.50MB 4.52% 32.97% 7.50MB 4.52% github.com/gorilla/mux.(*Router).NewRoute
6.50MB 3.92% 36.89% 6.50MB 3.92% github.com/traefik/traefik/v2/pkg/server/service.buildProxy
6.50MB 3.92% 40.81% 6.50MB 3.92% sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).literalStore
6MB 3.62% 44.43% 6MB 3.62% github.com/traefik/traefik/v2/pkg/middlewares/metrics.NewServiceMiddleware
5.50MB 3.32% 47.75% 5.50MB 3.32% reflect.mapassign_faststr
5.50MB 3.32% 51.06% 6MB 3.62% github.com/traefik/traefik/v2/pkg/middlewares/auth.NewForward
(pprof) top 10 -cum
Showing nodes accounting for 0, 0% of 165.86MB total
Dropped 131 nodes (cum <= 0.83MB)
Showing top 10 nodes out of 205
flat flat% sum% cum cum%
0 0% 0% 115.81MB 69.83% github.com/traefik/traefik/v2/pkg/safe.GoWithRecover.func1
0 0% 0% 115.31MB 69.52% github.com/traefik/traefik/v2/pkg/safe.(*Pool).GoCtx.func1
0 0% 0% 103.33MB 62.30% github.com/traefik/traefik/v2/pkg/server.(*ConfigurationWatcher).applyConfigurations
0 0% 0% 88.75MB 53.51% main.switchRouter.func1
0 0% 0% 84.01MB 50.65% github.com/traefik/traefik/v2/pkg/server.(*RouterFactory).CreateRouters
0 0% 0% 81.51MB 49.14% github.com/traefik/traefik/v2/pkg/server/router.(*Manager).BuildHandlers
0 0% 0% 81.51MB 49.14% github.com/traefik/traefik/v2/pkg/server/router.(*Manager).buildEntryPointHandler
0 0% 0% 55.01MB 33.16% github.com/traefik/traefik/v2/pkg/server/router.(*Manager).buildRouterHandler
0 0% 0% 52.01MB 31.36% github.com/traefik/traefik/v2/pkg/server/router.(*Manager).buildHTTPHandler
0 0% 0% 41.51MB 25.02% github.com/traefik/traefik/v2/pkg/server/service.(*InternalHandlers).BuildHTTP
What happened was that an unrelated client-side bug resulted in leaked connections (top graph) that stayed open for several minutes. I would expect traefik to be able to handle many more than 200 open connections, but each connection seemed to use up nearly 3MB of RAM.
Since I have long connections + large config + many config reloads, this theory seems the most likely to me so far: Possible memory leak, regexp instructions consuming a lot of memory · Issue #8044 · traefik/traefik · GitHub
Is it possible that old connections can hold on to old versions of the dynamic config? And is it also possible that the dynamic config struct can get very large, like maybe storing one copy of the TLS cert per IngressRoute?
If so, is there any way to reduce the per-connection RAM impact?
Also, traefik doesn't do any buffering by default, right? So the request body size shouldn't matter? Request byte rate was consistently 2.5MiB/s before and after the spike.
Thanks in advance!