High memory usage during DoS

During DDoS traefik (it's happening with 2.6.x, 2.8.x, and now even with 2.9.5) starts consuming memory and eventually is OOMKilled.
About 14k req/s ends with consuming about 16GB RAM.

How can I check why it is consuming memory so much?

1 Like

You can enable debug and there is an api to get details.

I think case of :
--log.format=json
--log.level=INFO
--accesslog=true
--accesslog.format=json

pyroscope pprof traefik:
accesslog 50%
tcp entrypoint 10%

Hello @kong62,

Thanks for sharing this.
Can you please share the pprof sample? It would be easier for anyone to help with the sample than the screenshot.
If I'm not wrong the screenshot you have shared shows the all-time memory consumption per function, but I think it would be more helpful to see the current memory consumption.

I would gladly help, but during attack Traefik becomes non-responsive and there is no possible way to obtain data from debug api when everything is falling apart.
The only dump is from /debug/pprof/goroutine?debug=2 just before when there was no response at all. 4 minutes later docker container was closed due to exceeding cgroup RAM limit.
If you think it can help, then I'll upload this file somewhere to share / debug.
Currently used version from which logs I've got is v2.9.10

After many days spent on debugging and fixing different problems like hitting end of quadruples to create connection between traefik instance and container serving webpage, I finally hit the internal wall inside traefik.
Even rate limiting and cirquit breaker is not helping when single traefik instance is being hit continuously from as little as 200 different IPs. Responding with 429 takes more and more time during attack as it seems that requests are made and queued faster than traefik is able to react with 429 status. This maxes out memory (even 64GB) without any problem in about a minute and all regular requests from users are also queued and processed with delay. Then OOMKiller is doing it's job and until attackers are banned on firewall, this is going to happen.
This makes traefik useless, as there is no possible way to mitigate this issue with firewall (requests are not connections), and it takes only a few bucks a month to rent some proxy network with few thousands of IPs available only to attack from new pack of 2-3 hundreds after previous one has been banned.

Interesting findings.

A previous post showed writing access log in JSON can take a lot of time. Can you try to disable or reduce logging to check if your infra is just slow on disk writes?

Did you setup a test scenario with the 200 clients which can be replicated or was it an uncontrolled event?

PS: I read an interesting comment about FreeBSD in a recent HN thread:

And if you're using FreeBSD with a fast NIC, look into the pfilctl command and use a "head" input hook into your NIC driver's ethernet input handling routine. Doing this allows the firewall input checks to happen early, before the packet has been received into the network stack, while its still in DMA-mapped, driver-owned buffers. If the firewall decides a packet is to be dropped, the NIC can recycle the buffer with minimal overhead (skips DMA-unmap, buffer free, ethernet input into the IP stack, and the replacement buffer alloc, DMA map).

This allows for very fast dropping of DOS flood attacks. I've tested this using ipfw up to screening and dropping several 10s of millions of packets on 2014 era Xeons with minimal impact to traffic serving.

Access log has been disabled and it didn't help. Still memory usage grows as well as response time from traefik to the point that waiting over 30s for request makes services unusable.

Unfortunately I don't have test setup with 200 clients. This is what I am observing during attacks (or what was observed when access log has been enabled). During today's attack access log has been disabled and it made no change to what's happening.

Thank you for info about firewall tweaking. I might consider limiting incoming packets rate for established connections as well. Wish me luck :slight_smile:

Maybe it’s this new Rapid Reset attack? There is a question post about it in the forum here.

Nice catch! Thank you for pointing this out :slight_smile:
I've upgraded to v2.10.5 and I will observe whether that's it or something else.