I’m on a continuing journey to achieve true seamless rolling updates and failover using traefik in docker swarm. Some background can be found in a previous issue, which we have now overcome: Seamless failover in docker swarm
Now that we have gotten traefik to use the docker swarm routing mesh, and it is only ever routing to healthy downstream containers, the next hurdle we have hit is the traefik daemon itself.
When I update my traefik service on a node (say, to pick up new TLS files) or after a swarm node is coming back into service and recreating a traefik service, traefik will come up as “healthy” and be routed to before it has actually received any configuration updates from the provider (docker). This means that inbound requests may, for a brief period, be routed to a traefik instance that has nowhere downstream to route them to, resulting in 404s and similar.
We need a healthcheck on the traefik container itself that says “yes I’m not just running but I have received and processed an initial configuration”. I note the internal heathcheck/ping command but this does not check that any configuration has been received; it just confirms that traefik has read the static config and the ping works.
In the absence of a real healthcheck that knows if a configuration has been received, I can force it to wait an arbitrary 30 seconds by adding
healthcheck: test: "true" interval: 30s