Traefik stopped renewing certificates (During secondary validation Invalid Response 404)

doridina · April 26, 2024, 6:14pm

Hi all,

i've a strange issue with renewing certificates. Creating new certificates for new domains works, renewing certs has stopped working and i don't know why. I've running traefik 2.10 on a docker swarm and i had 3 managers nodes and on each one traefik instance was running, acme.json is shared via glusterfs on all 3 nodes. Most of the domains are behind Cloudflare, some not. But no difference between them, except on a container (domain behind Cloudflare) that has this additional configuration is renewed regularly:

- "traefik.http.routers.node2.tls.domains[0].main=Redacted.TLD"
- "traefik.http.routers.node2.tls.domains[0].sans=*.Redacted.TLD"

I don't know what specific change has led to this issue, the only bigger change we had was docker update from 24 to 25. But after that some certs were renewed.

I now get the following error:

Error renewing certificate from LE: {Redacted [Redacted]}" error="error: one or more domains had a problem:\n[Redacted] acme: error: 403 :: urn:ietf:params:acme:error:unauthorized :: During secondary validation: 2606:4700:3032::6815:3bb1: Invalid response from http://Redacted/.well-known/acme-challenge/5Rt5Xm024_p3n9zh9oPoHlQhucm1pcqGgMXwXasYGOE: 404\n" providerName=leresolver.acme ACME CA="https://acme-v02.api.letsencrypt.org/directory"

and lots of these lines:

time="2024-04-26T17:54:58Z" level=error msg="Cannot retrieve the ACME challenge for Redacted (token \"MsmFFcds4QXY3KveynCGr-Y6JfDgkQGWNLHpvRivwRc\")" providerName=acme

traefik.yaml

version: "3.3"

services:

  traefik:
    image: "traefik:v2.10"
    command:
      - "--log.level=DEBUG"
      - --api.dashboard=false
      - --entrypoints.websecure.http3
      - --experimental.http3=true
      - --certificatesresolvers.leresolver.acme.caserver=https://acme-v02.api.letsencrypt.org/directory
      - --certificatesresolvers.leresolver.acme.email=Redacted
      - --certificatesresolvers.leresolver.acme.storage=/le/acme.json
      - --certificatesresolvers.leresolver.acme.httpchallenge=true
      - --certificatesresolvers.leresolver.acme.httpchallenge.entrypoint=web
      - --entrypoints.web.address=:80
      - --entrypoints.websecure.address=:443
      - --providers.docker
      - --providers.docker.exposedbydefault=false
      - --providers.docker.swarmmode=true
      - --providers.docker.network=public
      - --providers.docker.watch
      - --entrypoints.web.proxyProtocol.trustedIPs=Redacted
      - --entrypoints.web.forwardedHeaders.trustedIPs=Redacted
      - --entrypoints.websecure.proxyProtocol.trustedIPs=Redacted
      - --entrypoints.websecure.forwardedHeaders.trustedIPs=Redacted
    ports:
      - "80:80"
      - "443:443/tcp"
      - "443:443/udp"
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
      - "/opt/mnt/traefik/acme.json:/le/acme.json"
    networks:
      - public
    deploy:
      mode: global
      placement:
        constraints: [node.role == manager]
      labels:
        - "traefik.enable=true"
networks:
  public:
    external: true

I have now changed the configuration so that traefik is running on one single node only, but still the same error.

Sample config from one container:

deploy:
      labels:
        - 'traefik.enable=true'        
        - 'traefik.http.routers.node1.rule=Host(`${DOMAIN1}`) || Host(`${DOMAIN2}`)‘
        - "traefik.http.routers.node1.service=node1"
        - "traefik.http.services.node1.loadbalancer.server.port=3000"
        - 'traefik.http.routers.node1.entrypoints=websecure'
        - "traefik.http.middlewares.node1.forwardauth.trustForwardHeader=true"
        - "traefik.http.routers.node1.tls=true"
        - 'traefik.http.routers.node1.tls.certresolver=leresolver'
      placement:
        constraints: [node.role == worker]
      replicas: 4
      mode: replicated
      update_config:
        parallelism: 2
        delay: 10s
        failure_action: rollback
        order: start-first
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s

acme.json is writable and traefik is writting new certificates to the file. To make it more complicated, restarting traefik sometimes helps to renew a certificate that had errors before in the log and wasn't renewed before restart.

I hope I have added all the necessary information.

Update:
I've done some further investigations and checked the certificates in acme.json, in acme.json are new and updated certificates but they are not used by traefik (but traefik has written them).

Now i have an additional error in the log:

time="2024-04-27T08:10:02Z" level=debug msg="legolog: [INFO] [Redacted.TLD] acme: authorization already valid; skipping challenge"
time="2024-04-27T08:10:02Z" level=debug msg="legolog: [INFO] [*.Redacted.TLD] acme: Could not find solver for: dns-01"
time="2024-04-27T08:10:02Z" level=debug msg="legolog: [INFO] Skipping deactivating of valid auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/redacted"
time="2024-04-27T08:10:02Z" level=debug msg="legolog: [INFO] Deactivating auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/redacted"
time="2024-04-27T08:10:02Z" level=error msg="Unable to obtain ACME certificate for domains \"Redacted.TLD,*.Redacted.TLD\"" rule="Host(`sub.Redacted.TLD`)" ACME CA="https://acme-v02.api.letsencrypt.org/directory" providerName=leresolver.acme routerName=redacted@docker error="unable to generate a certificate for the domains [Redacted.TLD *.Redacted.TLD]: error: one or more domains had a problem:\n[*.Redacted.TLD] [*.Redacted.TLD] acme: could not determine solvers\n"

Help is highly appreciated.

Thanks

bluepuma77 · April 28, 2024, 7:15am

Traefik v2 does not support clustered/distributed LetsEncrypt, so you can’t have multiple instances run in parallel using LE. It’s a paid feature in Traefik EE v2/v3.

Even though you are using a shared acme.json, I don’t think this works.

Either reduce (temporary) to a single Traefik instance. Or change to dnsChallenge without shared file, then every instance gets their own (up to 5 servers). Or use a workaround by creating the certs externally (go-acme, certbot). This has been discussed here before.

Or switch to k8s, there a component named cert-manager, which will handle LE.

Trying to explain the technical background:

Upon renewal, Traefik will create a token and send a http request to the domain to see if the domain is pointing to Traefik correctly. Not sure if this is already persisted and even fast enough.

When using multiple Traefik instances, requests are usually distributed round-robin. So this request is highly likely to be received by an instance, which it is not aware of the token, I don’t think acme.json is watched. So you get the error.

Even if this retrieval was successful, then LE server is notified to do the same. The chance it hits the right server again is like lottery.

If only the second request fails, you even run into the risk of hitting limits of 5 (failed) requests per domain per week.

doridina · April 29, 2024, 7:16pm

Thanks a lot for the explanation. The DNS resolver is unfortunately not usable, i don't have access to all of the DNS involved, some of the domains are just linked to our servers. That's really sad that no HA is available with Traefik. Maybe i have to use some hacks with certbot to make it possible.

Only for my interest: How do you handle redeployments of Traefik or updating OS?

bluepuma77 · April 29, 2024, 8:40pm

We run Traefik in Docker Swarm, behind a managed load balancer. Initially with paid TLS certs, now with externally created LE certs. You could run a single certbot instance to generate certs every 2 months.

Topic		Replies	Views
Traefik v2.8 letsencrypt renewal fails Traefik v2 docker , letsencrypt-acme	5	1004	October 27, 2022
Let's Encrypt certificates not renewing Traefik v3 (latest) docker , letsencrypt-acme	5	636	October 21, 2024
Unable to renew LesEncrypt cert Traefik v2 docker , letsencrypt-acme	1	654	June 15, 2020
Letsencrypt fails to renew certificates Traefik v2 letsencrypt-acme	28	2427	January 27, 2024
[SOLVED] Can't renew certificates after expiry Traefik v3 (latest) letsencrypt-acme	2	166	February 23, 2025

Traefik stopped renewing certificates (During secondary validation Invalid Response 404)

Related topics