Background
I am trying to set a canary deployment where two containers are load balanced so I can compare error rates between the old and the new container.
I have it mostly working except that when the new container is starting there is a very slight outage that can be seen externally.
At the time of the outage all the traffic should be going to the old container, as I have setup the weighted load balancing like this in the dynamic config.
[http.services]
[http.services.servicename]
[[http.services.servicename.weighted.services]]
name = "servicename@docker"
weight = 100
At this point there is no other service defined.
When the new container is starting I see this appear in the debug traefik logs, which makes sense as this is the IP of the new container, and it takes a couple of seconds to start.
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/provider/acme/provider.go:895 > No ACME certificate generation required for domains ACME CA=https://acme-staging-v02.api.letsencrypt.org/directory acmeCA=https://acme-staging-v02.api.letsencrypt.org/directory domains=["<redacted>","<redacted>"] providerName=myresolver.acme routerName=websecure-servicename_relite@docker rule="Host(`<redacted>`) || Host(`<redacted>`)"
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/provider/acme/provider.go:895 > No ACME certificate generation required for domains ACME CA=https://acme-staging-v02.api.letsencrypt.org/directory acmeCA=https://acme-staging-v02.api.letsencrypt.org/directory domains=["<redacted>"] providerName=myresolver.acme routerName=websecure-www2@docker rule=Host(`<redacted>`)
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:196 > Service selected by WRR: 8d45b2ebf5c5a4b2
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/proxy.go:100 > 502 Bad Gateway error="dial tcp 192.168.128.18:80: connect: connection refused"
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:196 > Service selected by WRR: 103c0f26dfd67eba
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:196 > Service selected by WRR: 8d45b2ebf5c5a4b2
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/proxy.go:100 > 502 Bad Gateway error="dial tcp 192.168.128.18:80: connect: connection refused"
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:196 > Service selected by WRR: 103c0f26dfd67eba
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:196 > Service selected by WRR: 8d45b2ebf5c5a4b2
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/proxy.go:100 > 502 Bad Gateway error="dial tcp 192.168.128.18:80: connect: connection refused"
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:196 > Service selected by WRR: 103c0f26dfd67eba
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/loadbalancer/wrr/wrr.go:196 > Service selected by WRR: 8d45b2ebf5c5a4b2
2024-09-05T12:28:45Z DBG github.com/traefik/traefik/v3/pkg/server/service/proxy.go:100 > 502 Bad Gateway error="dial tcp 192.168.128.18:80: connect: connection refused"
and it's at this point that I see the outage. I have setup the service labels like this in the new container
servicename:
restart: always
ports:
- 9423
- 24231
labels:
- traefik.enable=true
- traefik.http.routers.newservice_relite.rule=Host(`<redacted>`) || Host(`www.<redacted>`)
- traefik.http.routers.newservice_relite.priority=100
- traefik.http.services.newservice.loadbalancer.server.port=80
- traefik.docker.network=web
networks:
- web
environment:
- LIVE=1
Is there anything I can do to stop users seeing the outage?
BTW the canary setup is mostly based on this article