SSL outage for 1 hour with Traefik and Docker Swarm

We just started to migrate some of our production websites from from Docker + manual haproxy to Docker Swarm + Traefik on bare metal servers using purchased SSL wildcard certificates. I like the idea to easily scale the containers when more traffic is expected and to have them automatically register with their domain name with the reverse proxy.

We have seen occasional SSL cert errrors in the browser. When refreshing 3 times, it would fail every third time. I checked the logs of Traefik and one node/instance seemed to not receive the request, so I assumed it faulty and just removed it from the load balancer configuration.

Today we saw in the logs that one external python job received SSL errors for a full hour - TWICE!

HTTP connection error: HTTPSConnectionPool(host='domain.tld', port=443): Max retries exceeded with url: /receiver/ 
(Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1091)')))

Our Traefik SSL configuration includes one private certificate:

tls:
  certificates:
    - certFile: /data/traefik/certs/private1.crt 
      keyFile: /data/traefik/certs/private1.key
    - certFile: /data/traefik/certs/wildcard1.crt 
      keyFile: /data/traefik/certs/wildcard1.key
    - certFile: /data/traefik/certs/wildcard2.crt
      keyFile: /data/traefik/certs/wildcard2.key
    - certFile: /data/traefik/certs/wildcard3.crt 
      keyFile: /data/traefik/certs/wildcard3.key
  stores:
    default:
      defaultCertificate: 
        certFile: /data/traefik/certs/wildcard1.crt 
        keyFile: /data/traefik/certs/wildcard1.key

More info about the setup: Reddit post

This morning we had continuous SSL errors for 1 hour straight - and that twice. Docker Swarm and services were not changed during that time. What could be the reason that Traefik suddenly only responds with the private certificate when the request is for wildcard1? And then works normally again after 1 hour?

Running traefik:v2.4.8 in Docker version 20.10.6, build 370c289 on Debian 10.9 minimal.

One further observation: it seemed that when we had the rare SSL cert error it was mostly Firefox showing "invalid SSL", not Chrome. Not sure if Chrome behaves differently and retries by itself.

Thanks
bluepuma

Yeah, it just happened again, I got a security warning in Firefox.

What is happening with Traefik that it sometimes just uses the first certificate instead of matching with the additional wildcard certificates from the config file?

Hello @bluepuma77,

Do any of the domains listed in the private certificate match the requests you are making?

Your code here says that domain.tld is requested...Does private1.crt have a domain/SAN with domain.tld?

Hi @daniel.tomcej, that's a really interesting question.

Yes, the CN is identical in the private and first wildcard certificate ("*.domainA.tld"). I have used this before with haproxy and it seemed to work with traefik - at least most of the time :slight_smile:

I thought Traefik as reverse proxy would automatically find the matching cert for a CN. I did not expect that it suddenly prefers the wrong cert for an hour. So same CN is not recommended?

(I used to deploy two same CN certificates during switch-over period, when a purchased certificate neared end of live.)

What log level setting do I need to discover cert problems in the Traefik log, is INFO enough?

Hi @daniel.tomcej,

I removed the self signed certificate yesterday, so only purchased and valid wildcard certificates with different domains are left in the config:

tls:
  certificates:
    # - certFile: /data/traefik/certs/private1.crt 
    #   keyFile: /data/traefik/certs/private1.key
    - certFile: /data/traefik/certs/wildcard1.crt 
      keyFile: /data/traefik/certs/wildcard1.key
    - certFile: /data/traefik/certs/wildcard2.crt
      keyFile: /data/traefik/certs/wildcard2.key
    - certFile: /data/traefik/certs/wildcard3.crt 
      keyFile: /data/traefik/certs/wildcard3.key
  stores:
    default:
      defaultCertificate: 
        certFile: /data/traefik/certs/wildcard1.crt 
        keyFile: /data/traefik/certs/wildcard1.key

Today the external python script sent about 4000 small messages via https to 3 traefik:v2.4.8 CE Docker instances on 3 Docker Swarm Masters. Those distribute the request to 6 containers on 6 worker nodes. Overall it's a very small setup on powerful bare metal servers with Debian. 4 services with 24 containers running, no changes during the day. I still see about 200 SSL errors spread throughout the day:

2021-06-23 16:51:26,112 [ERROR] HTTP connection error: HTTPSConnectionPool(host='domain.tld', port=443): Max retries exceeded with url: /receiver (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1091)')))

I enabled the traefik logs:

          - --log.level=INFO
          - --log.filepath=/data/traefik/logs/traefik.log

but the only thing I see is boring info messages on all three Traefik instances:

time="2021-06-23T16:51:04+02:00" level=info msg="Skipping same configuration" providerName=docker
time="2021-06-23T16:51:19+02:00" level=info msg="Skipping same configuration" providerName=docker
time="2021-06-23T16:51:34+02:00" level=info msg="Skipping same configuration" providerName=docker
time="2021-06-23T16:51:49+02:00" level=info msg="Skipping same configuration" providerName=docker

Also the access log only shows a few requests but no errors.

Where would I see HTTP requests with invalid certificate?

[I removed the private cert with the same CN/domain but still SSL errors showed up occasionally]

Finally found the secondary issue: the other Traefik servers had the wrong SSL configuration and still knew about the self signed certificate with the same name. I know I need to provide the SSL certs to all servers manually, but I just forgot that re-launching the Docker Swarm Traefik service is not enough.

docker service rm traefik_traefik
docker stack deploy --compose-file traefik_ssl_dashboard.yml traefik

Because Traefik needs configuration in two files you can't just launch Traefik as a Docker Swarm service but always need to supply at least one config file for SSL to all servers.

IMHO there may be reasons why Traefik has static and dynamic configuration, but from a user perspective it is just extra complexity to have the configuration split in two parts and to have to separately supply a config file for SSL all the time.

Sure you can, docker config and docker secrets make these files available on all nodes.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.