We are periodically experiencing a major service disruption related to an issue where Traefik isn't resilient to a downstream application server failure. We are currently running Traefik version 1.6 within our container platform running in AWS. Whenever there is an underlying AWS EC2 failure on a downstream instance not hosting Traefik (hosting a container instance Traefik is forwarding to), we see Traefik stop forwarding all requests to all backend instances even though only 1 out of 200+ servers failed.
Here are a few observations that occur:
- The only error/message traefik throws when this happens is the following:
2021/02/02 18:10:18 reverseproxy.go:321: httputil: ReverseProxy read error during body copy: http2: server sent GOAWAY and closed the connection; LastStreamID=1999, ErrCode=NO_ERROR, debug=""
Traefik "ping" health checks still return an HTTP 200 status code, so our container platform doesn't automatically restart the Traefik service.
We have a service that rate limits backend member connections (Traefik returns a HTTP 429), and Traefik starts to return the HTTP 429s for those requests during the event, which confirms the requests are at least making it to the Traefik frontend.
All other requests just hang. No response code is returned, and the request connection isn't refused/reset.
Memory usage eventually becomes exhausted as the number of hung connections build
Restarting Traefik resolves the issue
Appreciate any help in-advance to help figure out why Traefik hangs and stops routing requests to all other available and healthy instances. Any pointers to remediate or reduce the time of impact would be much appreciated. Thanks!