Traefik stopped renewing certificates (During secondary validation Invalid Response 404)

Hi all,

i've a strange issue with renewing certificates. Creating new certificates for new domains works, renewing certs has stopped working and i don't know why. I've running traefik 2.10 on a docker swarm and i had 3 managers nodes and on each one traefik instance was running, acme.json is shared via glusterfs on all 3 nodes. Most of the domains are behind Cloudflare, some not. But no difference between them, except on a container (domain behind Cloudflare) that has this additional configuration is renewed regularly:

- "traefik.http.routers.node2.tls.domains[0].main=Redacted.TLD"
- "traefik.http.routers.node2.tls.domains[0].sans=*.Redacted.TLD" 

I don't know what specific change has led to this issue, the only bigger change we had was docker update from 24 to 25. But after that some certs were renewed.

I now get the following error:

Error renewing certificate from LE: {Redacted [Redacted]}" error="error: one or more domains had a problem:\n[Redacted] acme: error: 403 :: urn:ietf:params:acme:error:unauthorized :: During secondary validation: 2606:4700:3032::6815:3bb1: Invalid response from http://Redacted/.well-known/acme-challenge/5Rt5Xm024_p3n9zh9oPoHlQhucm1pcqGgMXwXasYGOE: 404\n" providerName=leresolver.acme ACME CA="https://acme-v02.api.letsencrypt.org/directory"

and lots of these lines:

time="2024-04-26T17:54:58Z" level=error msg="Cannot retrieve the ACME challenge for Redacted (token \"MsmFFcds4QXY3KveynCGr-Y6JfDgkQGWNLHpvRivwRc\")" providerName=acme

traefik.yaml

version: "3.3"

services:

  traefik:
    image: "traefik:v2.10"
    command:
      - "--log.level=DEBUG"
      - --api.dashboard=false
      - --entrypoints.websecure.http3
      - --experimental.http3=true
      - --certificatesresolvers.leresolver.acme.caserver=https://acme-v02.api.letsencrypt.org/directory
      - --certificatesresolvers.leresolver.acme.email=Redacted
      - --certificatesresolvers.leresolver.acme.storage=/le/acme.json
      - --certificatesresolvers.leresolver.acme.httpchallenge=true
      - --certificatesresolvers.leresolver.acme.httpchallenge.entrypoint=web
      - --entrypoints.web.address=:80
      - --entrypoints.websecure.address=:443
      - --providers.docker
      - --providers.docker.exposedbydefault=false
      - --providers.docker.swarmmode=true
      - --providers.docker.network=public
      - --providers.docker.watch
      - --entrypoints.web.proxyProtocol.trustedIPs=Redacted
      - --entrypoints.web.forwardedHeaders.trustedIPs=Redacted
      - --entrypoints.websecure.proxyProtocol.trustedIPs=Redacted
      - --entrypoints.websecure.forwardedHeaders.trustedIPs=Redacted
    ports:
      - "80:80"
      - "443:443/tcp"
      - "443:443/udp"
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
      - "/opt/mnt/traefik/acme.json:/le/acme.json"
    networks:
      - public
    deploy:
      mode: global
      placement:
        constraints: [node.role == manager]
      labels:
        - "traefik.enable=true"
networks:
  public:
    external: true

I have now changed the configuration so that traefik is running on one single node only, but still the same error.

Sample config from one container:

deploy:
      labels:
        - 'traefik.enable=true'        
        - 'traefik.http.routers.node1.rule=Host(`${DOMAIN1}`) || Host(`${DOMAIN2}`)‘
        - "traefik.http.routers.node1.service=node1"
        - "traefik.http.services.node1.loadbalancer.server.port=3000"
        - 'traefik.http.routers.node1.entrypoints=websecure'
        - "traefik.http.middlewares.node1.forwardauth.trustForwardHeader=true"
        - "traefik.http.routers.node1.tls=true"
        - 'traefik.http.routers.node1.tls.certresolver=leresolver'
      placement:
        constraints: [node.role == worker]
      replicas: 4
      mode: replicated
      update_config:
        parallelism: 2
        delay: 10s
        failure_action: rollback
        order: start-first
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s

acme.json is writable and traefik is writting new certificates to the file. To make it more complicated, restarting traefik sometimes helps to renew a certificate that had errors before in the log and wasn't renewed before restart.

I hope I have added all the necessary information.

Update:
I've done some further investigations and checked the certificates in acme.json, in acme.json are new and updated certificates but they are not used by traefik (but traefik has written them).

Now i have an additional error in the log:

time="2024-04-27T08:10:02Z" level=debug msg="legolog: [INFO] [Redacted.TLD] acme: authorization already valid; skipping challenge"
time="2024-04-27T08:10:02Z" level=debug msg="legolog: [INFO] [*.Redacted.TLD] acme: Could not find solver for: dns-01"
time="2024-04-27T08:10:02Z" level=debug msg="legolog: [INFO] Skipping deactivating of valid auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/redacted"
time="2024-04-27T08:10:02Z" level=debug msg="legolog: [INFO] Deactivating auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/redacted"
time="2024-04-27T08:10:02Z" level=error msg="Unable to obtain ACME certificate for domains \"Redacted.TLD,*.Redacted.TLD\"" rule="Host(`sub.Redacted.TLD`)" ACME CA="https://acme-v02.api.letsencrypt.org/directory" providerName=leresolver.acme routerName=redacted@docker error="unable to generate a certificate for the domains [Redacted.TLD *.Redacted.TLD]: error: one or more domains had a problem:\n[*.Redacted.TLD] [*.Redacted.TLD] acme: could not determine solvers\n"

Help is highly appreciated.

Thanks

Traefik v2 does not support clustered/distributed LetsEncrypt, so you can’t have multiple instances run in parallel using LE. It’s a paid feature in Traefik EE v2/v3.

Even though you are using a shared acme.json, I don’t think this works.

Either reduce (temporary) to a single Traefik instance. Or change to dnsChallenge without shared file, then every instance gets their own (up to 5 servers). Or use a workaround by creating the certs externally (go-acme, certbot). This has been discussed here before.

Or switch to k8s, there a component named cert-manager, which will handle LE.


Trying to explain the technical background:

Upon renewal, Traefik will create a token and send a http request to the domain to see if the domain is pointing to Traefik correctly. Not sure if this is already persisted and even fast enough.

When using multiple Traefik instances, requests are usually distributed round-robin. So this request is highly likely to be received by an instance, which it is not aware of the token, I don’t think acme.json is watched. So you get the error.

Even if this retrieval was successful, then LE server is notified to do the same. The chance it hits the right server again is like lottery.

If only the second request fails, you even run into the risk of hitting limits of 5 (failed) requests per domain per week.

Thanks a lot for the explanation. The DNS resolver is unfortunately not usable, i don't have access to all of the DNS involved, some of the domains are just linked to our servers. That's really sad that no HA is available with Traefik. Maybe i have to use some hacks with certbot to make it possible.

Only for my interest: How do you handle redeployments of Traefik or updating OS?

We run Traefik in Docker Swarm, behind a managed load balancer. Initially with paid TLS certs, now with externally created LE certs. You could run a single certbot instance to generate certs every 2 months.