Slight downtime using weighted load balancing

Background
I am trying to set a canary deployment where two containers are load balanced so I can compare error rates between the old and the new container.
I have it mostly working except that when the new container is starting there is a very slight outage that can be seen externally.
At the time of the outage all the traffic should be going to the old container, as I have setup the weighted load balancing like this in the dynamic config.

[http.services]
  [http.services.servicename]
    [[http.services.servicename.weighted.services]]
      name = "servicename@docker"
      weight = 100

At this point there is no other service defined.

When the new container is starting I see this appear in the debug traefik logs, which makes sense as this is the IP of the new container, and it takes a couple of seconds to start.

2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/provider/acme/provider.go:895 > No ACME certificate generation required for domains ACME CA=https://acme-staging-v02.api.letsencrypt.org/directory acmeCA=https://acme-staging-v02.api.letsencrypt.org/directory domains=["<redacted>","<redacted>"] providerName=myresolver.acme routerName=websecure-servicename_relite@docker rule="Host(`<redacted>`) || Host(`<redacted>`)"
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/provider/acme/provider.go:895 > No ACME certificate generation required for domains ACME CA=https://acme-staging-v02.api.letsencrypt.org/directory acmeCA=https://acme-staging-v02.api.letsencrypt.org/directory domains=["<redacted>"] providerName=myresolver.acme routerName=websecure-www2@docker rule=Host(`<redacted>`)
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:196 > Service selected by WRR: 8d45b2ebf5c5a4b2
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/proxy.go:100 > 502 Bad Gateway error="dial tcp 192.168.128.18:80: connect: connection refused"
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:196 > Service selected by WRR: 103c0f26dfd67eba
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:196 > Service selected by WRR: 8d45b2ebf5c5a4b2
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/proxy.go:100 > 502 Bad Gateway error="dial tcp 192.168.128.18:80: connect: connection refused"
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:196 > Service selected by WRR: 103c0f26dfd67eba
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:196 > Service selected by WRR: 8d45b2ebf5c5a4b2
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/proxy.go:100 > 502 Bad Gateway error="dial tcp 192.168.128.18:80: connect: connection refused"
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:196 > Service selected by WRR: 103c0f26dfd67eba
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:196 > Service selected by WRR: 8d45b2ebf5c5a4b2
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/proxy.go:100 > 502 Bad Gateway error="dial tcp 192.168.128.18:80: connect: connection refused"

and it's at this point that I see the outage. I have setup the service labels like this in the new container

  servicename:
    restart: always
    ports:
        - 9423
        - 24231
    labels:
      - traefik.enable=true
      - traefik.http.routers.newservice_relite.rule=Host(`<redacted>`) || Host(`www.<redacted>`)
      - traefik.http.routers.newservice_relite.priority=100
      - traefik.http.services.newservice.loadbalancer.server.port=80
      - traefik.docker.network=web
    networks:
      - web
    environment:
       - LIVE=1

Is there anything I can do to stop users seeing the outage?

BTW the canary setup is mostly based on this article

I have found that and health checks to all my services seems to help to make adding a serving service seamless, removing a secretive is still showing errors however

When adding new services, you need to make sure they only accept requests when they are ready. For example if you start the application, but the application is not ready yet because it needs to load some config, you need to use health checks to only enable it with Traefik when it's ready.

Similar setup is important when you remove a service. If a request is still ongoing and you (hard) kill a service, an error will be generated by Traefik. To avoid this, you need to ensure that your service can receive and process a "SIGTERM", close the port to not accept new connections, health check should fail, but still finish ongoing connections. And you need to ensure that the time between "SIGTERM" and "SIGKIlL" is longer than the typical duration of a request.

1 Like

I had not thought of implementing a controlled shutdown, I had presumed that a docker compose stop would have done that. Thanks for the advice

Docker compose does the regular Docker shutdown, but you need to ensure your app is dealing with it.

If you for example run a script inside of the container, it might not pass the signals on to commands inside.

If you run a node application, you need to ensure you are closing the server port for further connections. Or set a signal that is picked up by the healthcheck.

The healthcheck needs to run often enough, to not let through new requests that take longer than the standard 10 second shutdown delay. Or you can extend the time for Docker shutdowns, but then the whole deploy process takes longer.

1 Like

Perfect, that's what it is. Thanks!

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.