Burst of request hung Traefik

Let me try to explain my problem to see if you have any suggestion.
I'm stress testing testing an internal docker registry.
I'm using Traefik 2.9 as my ingress controller. I have installed it with the helm chart.
I have set up the resources limits to 3000m and 8Gi.
At the moment I'm using letsencrypt-acme, so I'm not able to create replicas.

I'm using a docker image of 1.15Gb to do the stress test.
For the test itself, I have created a Job that pulls the image and do a sleep for 10s. I'm running 50 parallels jobs waiting for 150 completions.
I'm using aws fargate to ensure that every time the pod gets created, it will pull the image.

The problem start with the first burst of pods that get created with the Job. From the 50 at least 40 get an ImagePullBackOff. After a few minutes, the situation start to stabilize and the ImagePullBackOff errors start to clean up.
Sometimes traefik pod dies and gets recreated and there are other times that the traefik pod doesn't release the RAM (use 3.5Gb RAM continuously).

I have some questions:

  • Is there any way to play with the Ingress buffer?
  • Is there any way to use replicas and lets encrypt at the same time?

Any suggestion will be very appreciated.

Just for my understanding: you are running a private Docker repository behind Traefik as reverse proxy, the job is pulling the 1.15Gb image through Traefik?

LetsEncrypt with multiple Traefik instances should work in kubernetes, tutorials are available.

You got the idea, I should have added a TLDR.
I've seen similar tutorials but they force you to create a Certificate, in this case you need to create the DNS records manually. But we are using external DNS for that.
I will try to test it on monday.

Thanks