We just started to migrate some of our production websites from from Docker + manual haproxy to Docker Swarm + Traefik on bare metal servers using purchased SSL wildcard certificates. I like the idea to easily scale the containers when more traffic is expected and to have them automatically register with their domain name with the reverse proxy.
We have seen occasional SSL cert errrors in the browser. When refreshing 3 times, it would fail every third time. I checked the logs of Traefik and one node/instance seemed to not receive the request, so I assumed it faulty and just removed it from the load balancer configuration.
Today we saw in the logs that one external python job received SSL errors for a full hour - TWICE!
HTTP connection error: HTTPSConnectionPool(host='domain.tld', port=443): Max retries exceeded with url: /receiver/
(Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1091)')))
Our Traefik SSL configuration includes one private certificate:
This morning we had continuous SSL errors for 1 hour straight - and that twice. Docker Swarm and services were not changed during that time. What could be the reason that Traefik suddenly only responds with the private certificate when the request is for wildcard1? And then works normally again after 1 hour?
Running traefik:v2.4.8 in Docker version 20.10.6, build 370c289 on Debian 10.9 minimal.
One further observation: it seemed that when we had the rare SSL cert error it was mostly Firefox showing "invalid SSL", not Chrome. Not sure if Chrome behaves differently and retries by itself.
Today the external python script sent about 4000 small messages via https to 3 traefik:v2.4.8 CE Docker instances on 3 Docker Swarm Masters. Those distribute the request to 6 containers on 6 worker nodes. Overall it's a very small setup on powerful bare metal servers with Debian. 4 services with 24 containers running, no changes during the day. I still see about 200 SSL errors spread throughout the day:
2021-06-23 16:51:26,112 [ERROR] HTTP connection error: HTTPSConnectionPool(host='domain.tld', port=443): Max retries exceeded with url: /receiver (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1091)')))
[I removed the private cert with the same CN/domain but still SSL errors showed up occasionally]
Finally found the secondary issue: the other Traefik servers had the wrong SSL configuration and still knew about the self signed certificate with the same name. I know I need to provide the SSL certs to all servers manually, but I just forgot that re-launching the Docker Swarm Traefik service is not enough.
docker service rm traefik_traefik
docker stack deploy --compose-file traefik_ssl_dashboard.yml traefik
Because Traefik needs configuration in two files you can't just launch Traefik as a Docker Swarm service but always need to supply at least one config file for SSL to all servers.
IMHO there may be reasons why Traefik has static and dynamic configuration, but from a user perspective it is just extra complexity to have the configuration split in two parts and to have to separately supply a config file for SSL all the time.