My understanding is that it's desirable that only one instance of Traefik is fetching TLS certs. In my three-node swarm cluster, all three nodes are managers. Thus they all get an instance of Traefik running because of the constraint set in the config. On the backend I have a GlusterFS running and that is where I mount a lot of the volumes for containers. I've read back and forth arguments that this isn't really the right way to do it for Traefik. That instead there should be either just a single instance of the container running on the swarm, OR only one of the nodes should be updating the certs. I'm a little unclear on what the correct position is. Can someone provide some expertise on this? And, if needed, show some configuration about how it's done?
Traefik CE (Community Edition) does not support clustered LetsEncrypt. That means that you can’t really have Traefik generate a shared TLS cert.
Two potential work-arounds:
use dnsChallenge so every instance creates their own individual TLS cert. Important: there is a limit of 5 issues per week per cert, so if you don’t correctly persist the data, next time one of your three might be missing the cert.
using an external tool (like certbot) and supply the certs via files to the instances. Either individually or via a shared folder. Note that you need to touch the config files for the cert files to be reloaded. Or inline the certs into a dynamic config file, probably via a little script.
Or go crazy and load the TLS certs directly via providers.http without any file, little experiment here.
At the end it’s also a question about how you build your HA setup. We run a LB in front of Swarm, 3 Traefik instances with host ports on 3 managers. Via a Docker socket proxy you could place Traefik on worker nodes.
Thanks @bluepuma77 I saw those comments in your example docs as well. Is there a way to tell Swarm to just limit the traefik instance to just 1 instead of global? I've already got a nice tidy keepalived configuration setup on this particular swarm cluster so the traffic is really only hitting one node anyway.
It may work with a single instance, if you share the acme.json via shared folder. If the node or Traefik dies a new replica should be created by Swarm, reading the existing config file and certs.
But keepalived only monitors the node? Then you probably need ingress network for the ports of all nodes to be forwarded to single service, then don’t use Swarm port host mode.
Otherwise the HA challenge might be that re-creating the TLS certs may take some time, depending on the number of domains.
I think in my mind I was envisioning a three node cluster. GlusterFS across all three (what I have now), and then the proxy overlay network where Traefik is attached and the container still using
That one instance of Traefik with have it's certs volume (and others) mounted in the GlusterFS path. So the cert info will always be available on all nodes. If the container dies, Traefik is re-spawned somewhere else, maybe even on the same swarm node. But for the sake of this, lets say it gets instantiated on another node. It's going to get the volume exactly as it was left. Right?
Keepalived can easily track where Traefik is in the cluster because that one node will be the only one with the host based TCP bindings. We can test for that in the keepalived configuration. So the VRRP IP will always follow the Traefik container.