Bad Gateway during Helm Upgrade/Rollout

Hello,

Using K8 Rollout strategy and using it with Helm ( or by itself), I get a small "blip" of 502 bad gateway. We really need 0 downtime deployments. We have tried the following:

  • Thinking Traefik needs time to pick up the new rolling out pods, lengthed the "pauseTime" to give Traefik more time to recognize them
  • Using a higher maxSurge

Is there something we are missing here? Our test rig send a request every .5 seconds and then we do a helm upgrade/rollout. That's how we're seeing the blip or a 502. I have some other thoughts:

  • Maybe a pod is being taken down in the middle of a request?
  • Traefik looks at the service, so perhaps there's something at the K8 level where I need to either deploy a new service as the current service may not know those pods are being taken out of rotation?

Thanks.

We think we located the issue. We are running Nginx, and when SIGTERM is sent, our version of Nginx does a fast shutdown instead of a safe shutdown. We updated our docker configs for our containers to set the STOPSIGNAL to SIGQUIT and seems to have fixed the issue.

So it wasn't Traefik, it was our containers/nginx.

Thanks for the update @Souvent22 :slight_smile: