version: '3.7'
services:
traefik:
# Use the latest Traefik image
image: traefik:v2.4
networks:
- web-public
ports:
# Listen on port 80, default for HTTP, necessary to redirect to HTTPS
- target: 80
published: 80
mode: host
# Listen on port 443, default for HTTPS
- target: 443
published: 443
mode: host
deploy:
mode: global
update_config:
parallelism: 1
delay: 5s
order: start-first
placement:
constraints:
# Make the traefik service run only on the node with this label
# as the node with it has the volume for the certificates
- node.role==manager
volumes:
# Add Docker as a mounted volume, so that Traefik can read the labels of other services
- /var/run/docker.sock:/var/run/docker.sock:ro
# Mount the volume to store the certificates
- certificates:/certificates
configs:
- source: traefik
target: /etc/traefik/traefik.yml
app:
image: myrepo/alpha-app
networks:
- web-public
deploy:
replicas: 1
update_config:
parallelism: 1
delay: 5s
order: start-first
labels:
- "traefik.enable=true"
- "traefik.http.routers.alpha.rule=Host(`alpha.dev.test`)"
- "traefik.http.routers.alpha.entrypoints=websecure"
- "traefik.http.routers.alpha.tls=true"
- "traefik.http.routers.alpha.tls.certresolver=letsencryptresolver"
- "traefik.http.services.alpha.loadbalancer.server.port=443"
- "traefik.http.services.alpha.loadbalancer.server.scheme=https"
volumes:
# Create a volume to store the certificates, there is a constraint to make sure
# Traefik is always deployed to the same Docker node with the same volume containing
# the HTTPS certificates
certificates:
configs:
traefik:
name: "traefik.yml"
file: ./traefik.yml
networks:
# Use the previously created public network "web-public", shared with other
# services that need to be publicly available via this Traefik
web-public:
external: true
Probably the biggest problem is that you have 1 replica. Even with start first as soon as the replacement is running the old will be stopped. And yes, your container helathcheck will have to be returning healthy before traefik will balance to it.
With replicas > 1 and --update-parallelism set to a sensible number you should have replicas still available to handle requests.
Hello Guys!
I've been also using start-first with Swarm. See the example. Additionally, I've been adding health check on a service level, as it is shown in this example. The example code is a little bit out of date but seems that you can refer to it and implement it accordingly.
I was just doing some testing with a Laravel PHP app. I was able to get zero-downtime deployments with deploy.replicas: 1 and order: start-first.
I only noticed the issue with our Node containers, so I think it might be healthcheck related. Makes sense since Node probably takes longer to spin up than PHP.
Healthcheck probes are designed to make sure whether an app is live and ready to accept incoming requests. Kubernetes has two types liveness and readiness that allow assuming that our application is healthy and ready to accept requests.
Docker also has health checks implemented but on the Traefik level, you have to check whether your backend is ready. Here is a link to the official Healthcheck documentation.
I see a problem with examples given by @jakubhajek.
After a successful update, when docker swarm will start to shutdown old container, traefik will see it only in one of next healthchecks.
If request comes just before this "unhealthy" healthcheck, it may be load-balanced to shutting-down container, which may not be able to process this request anymore. It will result in error for client.
Or maybe traefik is notified about shutting down of old container before it makes its healthcheck?
Is there any resolution for that? Because currently I do not see the proper way how to implement zero downtime rolling update in docker swarm, because Traefik does not know about the services being shutting down...
I tried to find some appropriate changes in the release notes but there are nothing in this regard.
I posted this into the "#swarm" Discord channel on "devops.fan" and this is what Martin Braun had to say:
For proper zero downtime you have to tell traefik to use the virtual ip. Look for lbswarm in the settings. Healthchecks are mandatory for this to work properly.
But with traefik you have to use lbswarm in my experience as that works well with docker healthchecks. Otherwise you have to set up healthchecks somehow in traefik as well.
lbswarm (Traefik configuration)
This was a very helpful direction and I will test this out soon!
The first question is about request duration. For short lived requests, it is usually not a problem.
We use a NodeJS application that will not accept new connections when the according signal is sent, ongoing requests will be finished. Docker usually waits 10 secs for a container to exit, then kills it.
Use parameters like --update-order start-first, --update-delay and --update-parallelism to always have another running container.
Be aware that Traefik Docker Swarm provider has a default poll interval of 15 secs, so only then new containers are picked up, so your updates have to be "slower". (Doc)
Check the Docker service update doc, similar parameters are available for Docker stack compose file.
By the way I found the solution with no changes in traefik at all.
The way is that:
In the container I have usual healthcheck (/health) and I added 'special' healthcheck route especially for traefik (/health/lb).
The application has graceful shutdown logic. When any signal comes, then BEFORE we switch the express server off (for node.js apps), I say to this special healthcheck to start returning 500 codes during the next 11 seconds. And ONLY AFTER this 11 seconds I start the procedure of graceful shutdown. So within these 11 seconds the app is still functioning as normal.
In the services definition I added something like this:
So as you see traefik always checks the app for health with this 'special' route every 5 seconds..
I also increased the grace period because by default it is 10 sec but in our case we need more that 11 seconds to exit gracefully.
stop_grace_period: 20s
So when docker starts shutting the container down, we first inform traefik that it is unhealthy. Traefik stops to balance the trafik to this container, and then we safely allow the container to die. Of course the update order should start new container first, like this:
Thats it! Does not matter how many replicas do we have. I tested this approach with jmeter on one replica by sending thousands of api calls per second and continuously updated the app. This is absolutelly ZERO downtime approach with only one drawback - the time to shutdown in more than 11 seconds. But this is not a problem at all!
How are you going to use start-first order for traefik that has ports?
When you have ports, you cannot run 2 traefik containers on the same machine because they both need the same port (i.e. 2nd container will be pending with a message no suitable node ... host-mode port already in use on 1 node). It means order: start-first just does not make any sense.
At least this is the behaviour I get. Could maybe somebody explain what I am doing wrong?
It depends on how you use the ports. When a Docker Swarm service (like Traefik) just declares a port, the the Docker ingress network is used, the port is available on all nodes and connections are forwarded by Docker to an available container, probably round robin.
If you open the port in host mode (see simple Traefik Swarm example), then the local container is using the port exclusively and you need a stop-first policy. We use this, having an externally managed load balancer in front of the Traefik nodes, to ensure HA even during a Traefik upgrade.
The example docker compose (in the first post in this thread) contains mode: host ports. And based on this example, people discuss here order: start-first approach. So, I am confused now
It should work with Traefik and target apps, as in that case the port is not really used by a container, but Docker creates its own ingress network on the port to distribute requests internally.
This should enable zero downtime Traefik deployments. But note that this is not HA, as the node can still fail.