High memory usage during DoS

DevilaN · November 29, 2022, 10:37am

During DDoS traefik (it's happening with 2.6.x, 2.8.x, and now even with 2.9.5) starts consuming memory and eventually is OOMKilled.
About 14k req/s ends with consuming about 16GB RAM.

How can I check why it is consuming memory so much?

bluepuma77 · November 29, 2022, 5:14pm

You can enable debug and there is an api to get details.

kong62 · April 27, 2023, 6:38am

I think case of :
--log.format=json
--log.level=INFO
--accesslog=true
--accesslog.format=json

kong62 · May 4, 2023, 10:22am

pyroscope pprof traefik：
accesslog 50%
tcp entrypoint 10%

rtribotte · May 9, 2023, 7:19am

Hello @kong62,

Thanks for sharing this.
Can you please share the pprof sample? It would be easier for anyone to help with the sample than the screenshot.
If I'm not wrong the screenshot you have shared shows the all-time memory consumption per function, but I think it would be more helpful to see the current memory consumption.

DevilaN · August 3, 2023, 9:35pm

I would gladly help, but during attack Traefik becomes non-responsive and there is no possible way to obtain data from debug api when everything is falling apart.
The only dump is from /debug/pprof/goroutine?debug=2 just before when there was no response at all. 4 minutes later docker container was closed due to exceeding cgroup RAM limit.
If you think it can help, then I'll upload this file somewhere to share / debug.
Currently used version from which logs I've got is v2.9.10

DevilaN · October 7, 2023, 7:27pm

After many days spent on debugging and fixing different problems like hitting end of quadruples to create connection between traefik instance and container serving webpage, I finally hit the internal wall inside traefik.
Even rate limiting and cirquit breaker is not helping when single traefik instance is being hit continuously from as little as 200 different IPs. Responding with 429 takes more and more time during attack as it seems that requests are made and queued faster than traefik is able to react with 429 status. This maxes out memory (even 64GB) without any problem in about a minute and all regular requests from users are also queued and processed with delay. Then OOMKiller is doing it's job and until attackers are banned on firewall, this is going to happen.
This makes traefik useless, as there is no possible way to mitigate this issue with firewall (requests are not connections), and it takes only a few bucks a month to rent some proxy network with few thousands of IPs available only to attack from new pack of 2-3 hundreds after previous one has been banned.

bluepuma77 · October 7, 2023, 10:16pm

Interesting findings.

A previous post showed writing access log in JSON can take a lot of time. Can you try to disable or reduce logging to check if your infra is just slow on disk writes?

Did you setup a test scenario with the 200 clients which can be replicated or was it an uncontrolled event?

PS: I read an interesting comment about FreeBSD in a recent HN thread:

And if you're using FreeBSD with a fast NIC, look into the pfilctl command and use a "head" input hook into your NIC driver's ethernet input handling routine. Doing this allows the firewall input checks to happen early, before the packet has been received into the network stack, while its still in DMA-mapped, driver-owned buffers. If the firewall decides a packet is to be dropped, the NIC can recycle the buffer with minimal overhead (skips DMA-unmap, buffer free, ethernet input into the IP stack, and the replacement buffer alloc, DMA map).

This allows for very fast dropping of DOS flood attacks. I've tested this using ipfw up to screening and dropping several 10s of millions of packets on 2014 era Xeons with minimal impact to traffic serving.

DevilaN · October 8, 2023, 11:04am

Access log has been disabled and it didn't help. Still memory usage grows as well as response time from traefik to the point that waiting over 30s for request makes services unusable.

Unfortunately I don't have test setup with 200 clients. This is what I am observing during attacks (or what was observed when access log has been enabled). During today's attack access log has been disabled and it made no change to what's happening.

Thank you for info about firewall tweaking. I might consider limiting incoming packets rate for established connections as well. Wish me luck

bluepuma77 · October 11, 2023, 7:11am

Maybe it’s this new Rapid Reset attack? There is a question post about it in the forum here.

DevilaN · October 12, 2023, 8:36pm

Nice catch! Thank you for pointing this out
I've upgraded to v2.10.5 and I will observe whether that's it or something else.

Topic		Replies	Views
Extremely high traefik peak memory usage (14GB) Traefik v1 docker	0	2328	July 3, 2019
High CPU usage by traefik Traefik v2 docker , docker-swarm	2	2945	February 7, 2020
Traefik - High CPU usage in high load Traefik v2 docker	3	1415	May 24, 2023
Traefik 1.7.12 exiting unexpectedly - swarm mode Traefik v1 docker , docker-swarm	1	2043	July 15, 2019
In my test,the access log make traefik have a poor performance Traefik v2 metrics , plugin	0	676	September 17, 2021

High memory usage during DoS

Related topics