I have three servers with Traefik v2 deployed to them. On each server there are multiple Docker containers deployed (using plain Docker; containers carry Traefik config as Docker labels), which are exposed via Traefik. Some services are deployed as a pair to two servers with the DNS A record pointing to both of them. This works as long as the service is operational on both servers. When one of them goes down, e.g., during a rolling deployment, 50% of all requests end up with an error.
Is there a way to configure the Traefik load-balancer for a service so that requests to a service are forwarded to the other server if the local service is down?
I tried adding an extra server to the loadbalancer using file-based config, but that seems to be silently ignored (not sure if it is even possible to mix file-based and Docker-based config).
P.S.: I know that this won't help if the whole server goes down, but I have a solution in place for that scenario that alters the DNS.
Yes and no. If your service does not have a health-check, then requests will keep to be forwarded to the service that is down. However, even with health check it's all not rosy. Health checks are done on an interval, so it is entirely possible, that, if an incoming request arrives after a node went down but before periodical health check kicks in it still will fail.
A better solution would be to remove the node from the load balancer pool before bringing it down.
You can do this with dynamic configuration. If you are using the file provider in
watch mode, and if you update the dynamic configuration file, it should accomplish what you want. You can define both the router and the service in that file, or you can define the router on the docker labels and the service in the configuration file, both should work.
Here is an example:
# This configuration file will work on all nodes without changes
# We are not using TLS for this simple example
# Here we can watch dashboard
# in order to see dashboard on 8080. Can be disabled for prod, or routed via api@internal without `insecure`
# do not try to expose containers when there is no "traefik.enable=true" label
# Even if we have only one dyn.toml file we need to mount directory, or watch (updating config on changes) won't work
# Just a test service to expose
# This is so that other nodes can call it
# For the sake of this example take everything, but of course any valid rule would do
# The service definition will come from the file provider
# Health check is optional see explanation in previous post
path = "/"
interval = "10s"
timeout = "3s"
# This is local backend remove these two lines when it's down
url = "http://whoami"
# This is another node (specify the correct url below)
# remove these two lines while that other node is down
url = "http://another-node:8081"
So this is the basic working framework. This setup will round-robin between all the nodes, it will take a node out of rotation if it's periodic check fails (but does not guarantee absence of missed requests as explained above) and you can take a node out of the balancer manually by removing or commenting out the correspondent lines in the configuration.
Last note: you might consider using kubernetes for that, rolling deployments are working exceptionally well in it without the need to put nodes off line manually, and this load balancing aspect is taken care out of the box.