Fetching certificates with DNS-01 is suddenly failing

At some point within the past few weeks to months, I started getting errors like NS donovan.ns.cloudflare.com. returned SERVFAIL for _acme-challenge.dondo.xyz. It's very possible that it started when I changed something, between a new traefik image version (I hadn't been pinning them), migrating my containers to a new machine, changes on my LAN, but I couldn't say for sure.

While acme is waiting for the DNS record to propagate, I can see the TXT record on my Cloudflare dashboard, and I can see it resolve from inside of the container with dig, and all of the servers on https://dnschecker.org also resolve it. However, after that, acme says that Cloudflare returned an error.

I thought it could be an issue with the default DNS server, so I added known dns servers to the docker service and acme DNS resolver list, but that didn't help.

In traefik's startup log:

2024-06-16T16:42:03-05:00 INF github.com/traefik/traefik/v3/pkg/provider/acme/provider.go:817 > Renewing certificate from LE : {Main:dondo.xyz SANs:[]} acmeCA=https://acme-v02.api.letsencrypt.org/directory providerName=letsencrypt.acme
2024-06-16T16:42:03-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] [dondo.xyz] acme: Trying renewal with -2788 hours remaining lib=lego
2024-06-16T16:42:03-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] [dondo.xyz] acme: Obtaining bundled SAN certificate lib=lego
2024-06-16T16:42:03-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] [dondo.xyz] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/364851915477 lib=lego
2024-06-16T16:42:03-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] [dondo.xyz] acme: Could not find solver for: tls-alpn-01 lib=lego
2024-06-16T16:42:03-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] [dondo.xyz] acme: Could not find solver for: http-01 lib=lego
2024-06-16T16:42:03-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] [dondo.xyz] acme: use dns-01 solver lib=lego
2024-06-16T16:42:03-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] [dondo.xyz] acme: Preparing to solve DNS-01 lib=lego
2024-06-16T16:42:04-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] cloudflare: new record for dondo.xyz, ID 8f565c5fad14ab94028fd66647151c28 lib=lego
2024-06-16T16:42:04-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] [dondo.xyz] acme: Trying to solve DNS-01 lib=lego
2024-06-16T16:42:04-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] [dondo.xyz] acme: Checking DNS record propagation. [nameservers=1.1.1.1:53,8.8.8.8:53] lib=lego
2024-06-16T16:42:06-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] Wait for propagation [timeout: 2m0s, interval: 2s] lib=lego
2024-06-16T16:42:06-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] [dondo.xyz] acme: Waiting for DNS record propagation. lib=lego
... many of these ...
2024-06-16T16:44:04-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] [dondo.xyz] acme: Waiting for DNS record propagation. lib=lego
2024-06-16T16:44:06-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] [dondo.xyz] acme: Cleaning DNS-01 challenge lib=lego
2024-06-16T16:44:07-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] Deactivating auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/364851915477 lib=lego
2024-06-16T16:44:07-05:00 ERR github.com/traefik/traefik/v3/pkg/provider/acme/provider.go:832 > Error renewing certificate from LE: {dondo.xyz []} error="error: one or more domains had a problem:\n[dondo.xyz] propagation: time limit exceeded: last error: NS donovan.ns.cloudflare.com. returned SERVFAIL for _acme-challenge.dondo.xyz.\n" acmeCA=https://acme-v02.api.letsencrypt.org/directory providerName=letsencrypt.acme

In docker-compose.yml:

  traefik:
    container_name: traefik
    image: traefik:3.0
    <<: [*logging, *restart]
    depends_on:
      - ddclient
      - authelia
    networks:
      - traefik-proxy
      - ldap
    dns:
      - 8.8.8.8
      - 1.1.1.1
    ports:
      - 80:80
      - 443:443
    environment:
      - TZ=America/Chicago
      - CF_API_EMAIL=${CF_API_EMAIL}
      - CF_DNS_API_TOKEN=${CF_DNS_API_TOKEN}
      - CF_ZONE_API_TOKEN=${CF_ZONE_API_TOKEN}
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./config/traefik/:/config
    command:
      - "--log.level=DEBUG"
      - "--api=true"
      - "--api.dashboard=true"
      - "--providers.docker.exposedByDefault=false"
      - "--providers.docker.network=root_traefik-proxy"
      - "--providers.file.directory=/config"
      - "--providers.file.watch=true"
      - "--entryPoints.web.address=:80"
      - "--entryPoints.websecure.address=:443"
      - "--certificatesResolvers.letsencrypt.acme.email=redacted"
      - "--certificatesresolvers.letsencrypt.acme.storage=/config/acme.json"
      - "--certificatesresolvers.letsencrypt.acme.dnschallenge=true"
      - "--certificatesresolvers.letsencrypt.acme.dnschallenge.resolvers=1.1.1.1:53,8.8.8.8:53"
      - "--certificatesresolvers.letsencrypt.acme.dnschallenge.provider=cloudflare"
      - "--certificatesresolvers.letsencrypt.acme.dnschallenge.delaybeforecheck=0"
      - "--serverstransport.insecureskipverify=true"
    labels:
      <<: *traefik
      traefik.http.routers.http-catchall.rule: hostregexp(`{subdomain:.*}.dondo.xyz`)
      traefik.http.routers.http-catchall.entrypoints: web
      traefik.http.routers.http-catchall.middlewares: secure-redirect@file
      traefik.http.routers.http-catchall.priority: 0
      traefik.http.routers.api.rule: Host(`traefik.dondo.xyz`)
      traefik.http.routers.api.entrypoints: websecure
      traefik.http.routers.api.tls: true
      traefik.http.routers.api.tls.certResolver: letsencrypt
      traefik.http.routers.api.service: api@internal
      traefik.http.routers.api.middlewares: chain-authelia@file

What should I try next? Thanks.

Set delaybeforecheck to a higher number and remove 8.8.8.8, which is a different provider.

Thanks, I updated my container per your suggestion and unfortunately I'm still getting the same error:

2024-06-18T19:10:32-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] cloudflare: new record for vivid.fish, ID 88e87240e055a734f1f238977506d9f8 lib=lego
2024-06-18T19:10:32-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] [vivid.fish] acme: Trying to solve DNS-01 lib=lego
2024-06-18T19:10:32-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] [vivid.fish] acme: Checking DNS record propagation. [nameservers=1.1.1.1:53] lib=lego
2024-06-18T19:10:34-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] Wait for propagation [timeout: 2m0s, interval: 2s] lib=lego
2024-06-18T19:10:34-05:00 DBG github.com/traefik/traefik/v3/pkg/provider/acme/provider.go:305 > Delaying 120000000000 rather than validating DNS propagation now. providerName=letsencrypt.acme
2024-06-18T19:12:34-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] [vivid.fish] acme: Waiting for DNS record propagation. lib=lego
2024-06-18T19:12:36-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] [vivid.fish] acme: Cleaning DNS-01 challenge lib=lego
2024-06-18T19:12:37-05:00 DBG github.com/go-acme/lego/v4@v4.17.3/log/logger.go:48 > [INFO] Deactivating auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/365762166797 lib=lego
2024-06-18T19:12:37-05:00 ERR github.com/traefik/traefik/v3/pkg/provider/acme/provider.go:832 > Error renewing certificate from LE: {vivid.fish []} error="error: one or more domains had a problem:\n[vivid.fish] propagation: time limit exceeded: last error: NS donovan.ns.cloudflare.com. returned SERVFAIL for _acme-challenge.vivid.fish.\n" acmeCA=https://acme-v02.api.letsencrypt.org/directory providerName=letsencrypt.acme
  x-logging: &logging
    logging:
      driver: "json-file"
      options:
        max-file: "5"
        max-size: "10m"
  
  x-restart: &restart
    restart: always

  x-traefik: &traefik
    traefik.enable: "true"

  traefik:
    container_name: traefik
    image: traefik:3.0
    <<: [*logging, *restart]
    depends_on:
      - ddclient
      - authelia
    networks:
      - traefik-proxy
      - ldap
    dns:
      - 1.1.1.1
    ports:
      - 80:80
      - 443:443
    environment:
      - TZ=America/Chicago
      - CF_API_EMAIL=${CF_API_EMAIL}
      - CF_DNS_API_TOKEN=${CF_DNS_API_TOKEN}
      - CF_ZONE_API_TOKEN=${CF_ZONE_API_TOKEN}
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./config/traefik/:/config
    command:
      - "--log.level=DEBUG"
      - "--api=true"
      - "--api.dashboard=true"
      - "--providers.docker.exposedByDefault=false"
      - "--providers.docker.network=root_traefik-proxy"
      - "--providers.file.directory=/config"
      - "--providers.file.watch=true"
      - "--entryPoints.web.address=:80"
      - "--entryPoints.websecure.address=:443"
      - "--certificatesResolvers.letsencrypt.acme.email=redacted"
      - "--certificatesresolvers.letsencrypt.acme.storage=/config/acme.json"
      - "--certificatesresolvers.letsencrypt.acme.dnschallenge=true"
      - "--certificatesresolvers.letsencrypt.acme.dnschallenge.resolvers=1.1.1.1:53"
      - "--certificatesresolvers.letsencrypt.acme.dnschallenge.provider=cloudflare"
      - "--certificatesresolvers.letsencrypt.acme.dnschallenge.delaybeforecheck=120"
      - "--serverstransport.insecureskipverify=true"
    labels:
      <<: *traefik
      traefik.http.routers.http-catchall.rule: hostregexp(`{subdomain:.*}.dondo.xyz`)
      traefik.http.routers.http-catchall.entrypoints: web
      traefik.http.routers.http-catchall.middlewares: secure-redirect@file
      traefik.http.routers.http-catchall.priority: 0
      traefik.http.routers.api.rule: Host(`traefik.dondo.xyz`)
      traefik.http.routers.api.entrypoints: websecure
      traefik.http.routers.api.tls: true
      traefik.http.routers.api.tls.certResolver: letsencrypt
      traefik.http.routers.api.service: api@internal
      traefik.http.routers.api.middlewares: chain-authelia@file

Do you really need the dnsChallenge?

Probably 80% of TLS issues are about dnsChallenge (mostly with Cloudflare), maybe just try simpler tlsChallenge.

Check simple Traefik example.

Note: you seem to inject further config with <<, but you never show what it is.

Thanks, I tried with tlsChallenge and it worked. A wildcard certificate would be nice, but I probably don't need it. More than anything, I wonder what the root cause of dnsChallenge failing is after spending time investigating.