Traefik cannot resolve wildcard cert for subdomain due to failure to find zone. Can resolve certificate with lego directly

Hi folks, I've been working on getting traefik v2 stood up. In doing so, I'm trying to configure traefik to automatically resolve certificates for my domain. I have been able to successfully resolve a wildcard certificate for my primary domain, I but have not been able to get traefik to resolve a wildcard for a subdomain. Below are the logs providing the error.

time="2024-02-09T22:36:31Z" level=debug msg="Building ACME client..." providerName=cloudflare.acme
time="2024-02-09T22:36:31Z" level=debug msg="https://acme-v02.api.letsencrypt.org/directory" providerName=cloudflare.acme
time="2024-02-09T22:36:31Z" level=debug msg="Using DNS Challenge provider: cloudflare" providerName=cloudflare.acme
time="2024-02-09T22:36:31Z" level=debug msg="legolog: [INFO] [local.michaelcook.dev, *.local.michaeldook.dev] acme: Obtaining bundled SAN certificate"
time="2024-02-09T22:36:32Z" level=debug msg="legolog: [INFO] [*.local.michaeldook.dev] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/313211891847"
time="2024-02-09T22:36:32Z" level=debug msg="legolog: [INFO] [local.michaelcook.dev] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/313476094357"
time="2024-02-09T22:36:32Z" level=debug msg="legolog: [INFO] [local.michaelcook.dev] acme: authorization already valid; skipping challenge"
time="2024-02-09T22:36:32Z" level=debug msg="legolog: [INFO] [*.local.michaeldook.dev] acme: use dns-01 solver"
time="2024-02-09T22:36:32Z" level=debug msg="legolog: [INFO] [*.local.michaeldook.dev] acme: Preparing to solve DNS-01"
time="2024-02-09T22:36:33Z" level=debug msg="legolog: [INFO] [*.local.michaeldook.dev] acme: Cleaning DNS-01 challenge"
time="2024-02-09T22:36:33Z" level=debug msg="legolog: [WARN] [*.local.michaeldook.dev] acme: cleaning up failed: cloudflare: failed to find zone dev.: zone could not be found "
time="2024-02-09T22:36:33Z" level=debug msg="legolog: [INFO] Skipping deactivating of valid auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/313211891847"
time="2024-02-09T22:36:33Z" level=debug msg="legolog: [INFO] Deactivating auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/313476094357"
time="2024-02-09T22:36:33Z" level=error msg="Unable to obtain ACME certificate for domains \"local.michaelcook.dev,*.local.michaeldook.dev\"" providerName=cloudflare.acme error="unable to generate a certificate for the domains [*.local.michaeldook.dev]: error: one or more domains had a problem:\n[*.local.michaeldook.dev] [*.local.michaeldook.dev] acme: error presenting token: cloudflare: failed to find zone dev.: zone could not be found\n" routerName=traefik-secure@docker rule="Host(`traefik.local.michaelcook.dev`)" ACME CA="https://acme-v02.api.letsencrypt.org/directory"

In trying to debug, I decided to try resolving the same wildcard certificate using the lego acme client directly and that was successful! Below are the lego logs including the command I executed

lego --dns cloudflare --domains "local.michaelcook.dev,*.local.michaelcook.dev" --email mcook4728@gmail.com --dns.resolvers="1.1.1.1:53,1.0.0.1:53" run

2024/02/09 23:55:13 [INFO] [local.michaelcook.dev, *.local.michaelcook.dev] acme: Obtaining bundled SAN certificate
2024/02/09 23:55:14 [INFO] [*.local.michaelcook.dev] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/313496906427
2024/02/09 23:55:14 [INFO] [local.michaelcook.dev] AuthURL: https://acme-v02.api.letsencrypt.org/acme/authz-v3/313496906437
2024/02/09 23:55:14 [INFO] [*.local.michaelcook.dev] acme: use dns-01 solver
2024/02/09 23:55:14 [INFO] [local.michaelcook.dev] acme: Could not find solver for: tls-alpn-01
2024/02/09 23:55:14 [INFO] [local.michaelcook.dev] acme: Could not find solver for: http-01
2024/02/09 23:55:14 [INFO] [local.michaelcook.dev] acme: use dns-01 solver
2024/02/09 23:55:14 [INFO] [*.local.michaelcook.dev] acme: Preparing to solve DNS-01
2024/02/09 23:55:14 [INFO] cloudflare: new record for local.michaelcook.dev, ID ff3b4c973c1c8ab45344a8582bf97832
2024/02/09 23:55:14 [INFO] [local.michaelcook.dev] acme: Preparing to solve DNS-01
2024/02/09 23:55:14 [INFO] cloudflare: new record for local.michaelcook.dev, ID db6eb0d570ddf96b6d3a485f77209a99
2024/02/09 23:55:14 [INFO] [*.local.michaelcook.dev] acme: Trying to solve DNS-01
2024/02/09 23:55:14 [INFO] [*.local.michaelcook.dev] acme: Checking DNS record propagation using [1.1.1.1:53 1.0.0.1:53]
2024/02/09 23:55:16 [INFO] Wait for propagation [timeout: 2m0s, interval: 2s]
2024/02/09 23:55:21 [INFO] [*.local.michaelcook.dev] The server validated our request
2024/02/09 23:55:21 [INFO] [local.michaelcook.dev] acme: Trying to solve DNS-01
2024/02/09 23:55:21 [INFO] [local.michaelcook.dev] acme: Checking DNS record propagation using [1.1.1.1:53 1.0.0.1:53]
2024/02/09 23:55:23 [INFO] Wait for propagation [timeout: 2m0s, interval: 2s]
2024/02/09 23:55:23 [INFO] [local.michaelcook.dev] acme: Waiting for DNS record propagation.
2024/02/09 23:55:25 [INFO] [local.michaelcook.dev] acme: Waiting for DNS record propagation.
2024/02/09 23:55:27 [INFO] [local.michaelcook.dev] acme: Waiting for DNS record propagation.
2024/02/09 23:55:29 [INFO] [local.michaelcook.dev] acme: Waiting for DNS record propagation.
2024/02/09 23:55:31 [INFO] [local.michaelcook.dev] acme: Waiting for DNS record propagation.
2024/02/09 23:55:33 [INFO] [local.michaelcook.dev] acme: Waiting for DNS record propagation.
2024/02/09 23:55:41 [INFO] [local.michaelcook.dev] The server validated our request
2024/02/09 23:55:41 [INFO] [*.local.michaelcook.dev] acme: Cleaning DNS-01 challenge
2024/02/09 23:55:41 [INFO] [local.michaelcook.dev] acme: Cleaning DNS-01 challenge
2024/02/09 23:55:42 [INFO] [local.michaelcook.dev, *.local.michaelcook.dev] acme: Validations succeeded; requesting certificates
2024/02/09 23:55:43 [INFO] [local.michaelcook.dev] Server responded with a certificate.

Ignore that the lego logs contain logs for resolving the cert for primary domain.

My treafik config and docker compose can be found here.

I've included a screenshot of my cloudflare DNS configuration:

I've tried a lot of things, but the major contention point is obviously that lego works directly but it doesn't work through traefik. The fact that I'm able to resolve certificates for the primary domain via traefik also asserts that my cloudflare API token is correct (and yes, it has access to all zones).

Unfortunately, traefik/lego just don't have enough logging for me to know exactly where there's a difference. I would really appreciate if someone can point me in the right direction.

Hello,

I've tried a lot of things, but the major contention point is obviously that lego works directly but it doesn't work through traefik.

I think the difference is not traefik vs lego (because Traefik just uses lego) but container vs local.

The way to detect the zone is doing by iterative SOA calls:

  1. check SOA on local.michaelcook.dev.
  2. if the answer is NXDOMAIN, check SOA on michaelcook.dev.
  3. if the answer is NXDOMAIN, check SOA on dev.
  4. if the answer is SUCCESS, use dev. as the zone

It feels like there is something related to DNS when you are using a container or maybe it was just a timing problem related to the DNS propagation of the creation of your domains.
So you have to check your network, and maybe try traefik as a binary, outside a container.

I figured a more apt test would be to install lego 4.15.0 (identical version as used in traefik:latest image) cli inside of the docker container that traefik is running that I reported the logs from. Lego was able to successfully resolve wildcards for both the domain and subdomain. The commands used were identical to the one posted above, the only difference being the domains in the command used to resolve certs for the primary domain.

I can continue to try using traefik outside of a container, but it's certainly very unclear to me why Lego is successful and traefik isn't when they're both running inside the same container, and thus the networking should be the same.

lego 4.15.0 (identical version as used in traefik:latest image)

traefik:latest is currently traefik:v2.10.7 and this version uses lego v4.14.0.

But this is not a problem because the 2 versions are very close.

I recommend never using latest because latest is a floating reference, that can change to a new major version, and major versions are breaking.

Otherwise, I recommend not disabling the propagation check (disablePropagationCheck).

FYI I'm the main maintainer of Lego, and one of the core maintainer of Traefik.

The only way that can produce this error cloudflare: failed to find zone dev.: zone could not be found is that the result of zone detection is dev. (based on the algo I already explained).

So the problem is 100% related to DNS.
I can be the configuration of resolvers but your configuration seems right, as I don't have the related logs, I cannot know if this configuration is the real configuration that you are using.
Another possibility can be something that answers to DNS calls instead of the expected resolvers, it can be a local DNS or a firewall , but if you are running lego inside the same container as Traefik, I think the problem is related to resolvers.
Can you try lego without the --dns.resolvers?

this is just a homelab, I'll pin to a version once I actually get things working.

All of my recent attempts were without disablePropagationCheck enabled. That was the one config change that hadn't been pushed to my remote.

I've disabled my internal DNS (pihole) for the further tests I've done. Lego was successful without dns resolvers being explicitly provided in the command. Note that I was running against let's encrypt staging instance now. Lego did have a number of timeouts as well DNS challenge failures due to supposedly incorrect TXT records. This isn't anything I haven't seen before, but it did seem to be happening more. I decided to try traefik wiithout the dns resolvers specified as well and I see the identical behavior as before.

If you're confident that it's DNS, I'm probably going to try and get some wireshark traces to see what's going on and where LEGO/traefik differ. One quick question, if you don't mind, in cloudflare, I have my local subdomain configured as a CNAME record pointing to my primary domain. Could this be abnormal and part of the issue? I suspect the subdomain record being a CNAME results in it having it's own SOA record. It wouldn't explain the difference between lego and traefik, regardless, I'm curious. Alternatively, do I even need a record for the subdomain? Is proof of ownership of the primary domain sufficient for lets encrypt to grant a certificate for any subdomain?

Really appreciate your support btw.

CNAME has an effect but lego and Traefik use the same approach (because the behavior is defined by lego), so it's not a part of this issue.

You can disable CNAME support with an env var: LEGO_DISABLE_CNAME_SUPPORT=true

Yes, you need a record for the subdomain because, for example, a hosting provider can own the apex but lets users handle the subdomains.
So being the owner of the apex is now enough.

I got some wireshark traces for the DNS queries that traefik is issuing both in and outside of the container. You were certainly correct, something weird is going on with the DNS.

While outside of the container, the replies to DNS queries issued to 1.1.1.1 and 1.0.0.1 indicate that the primary authoritative name servers are either of cloudflare's nameservers (gemma and kyrie). I've attached a screenshot of window of this capture below.

When running inside of the container, traefik is still issuing queries to 1.1.1.1 and 1.0.0.1; however, the authoritative name server on the initial replies is Google's registry (Charleston road registry). It does start eventually getting replies from the correct authority, but not till after traefik has issued an SOA query for the .dev tld. I assume at that point it's beyond recovery. Again, a screenshot is attached.

I could genuinely not have less of an idea of what's causing this. I checked the /etc/resolv.conf on the host and in the container to ensure it wasn't configured with google's name servers. I know the DNS on my proxmox server is configured to point to my gateway.

It's surprising, maybe it's related to a kind of DNS cache or a timing problem. :thinking:

I don't know if it's related to Docker, Promox, or something else.
But based on your frames analysis, Traefik does the right calls but something interacts with the call.
It's like there is a kind of fallback on the first calls.
I don't understand why specifically the Charleston Road registry is used: Traefik or lego doesn't use it (even by default), and the Docker DNS default is on 8.8.8.8.
There is something inside your network stack that does that but I don't know which element.

I think I found where ns-tld1.charlestonroadregistry.com. comes from.
I need to investigate more.

This is just the authoritative of the ltd .dev:

Summary
$ drill SOA exmaple.dev
;; ->>HEADER<<- opcode: QUERY, rcode: NXDOMAIN, id: 30793
;; flags: qr rd ra ; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0 
;; QUESTION SECTION:
;; exmaple.dev. IN      SOA

;; ANSWER SECTION:

;; AUTHORITY SECTION:
dev.    300     IN      SOA     ns-tld1.charlestonroadregistry.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300

;; ADDITIONAL SECTION:

;; Query time: 17 msec
;; SERVER: 192.168.1.1
;; WHEN: Thu Feb 15 15:26:00 2024
;; MSG SIZE  rcvd: 127
$ drill @1.1.1.1 SOA local.michaeldook.dev
;; ->>HEADER<<- opcode: QUERY, rcode: NXDOMAIN, id: 51282
;; flags: qr rd ra ; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;; local.michaeldook.dev.       IN      SOA

;; ANSWER SECTION:

;; AUTHORITY SECTION:
dev.    300     IN      SOA     ns-tld1.charlestonroadregistry.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300

;; ADDITIONAL SECTION:

;; Query time: 21 msec
;; SERVER: 1.1.1.1
;; WHEN: Thu Feb 15 15:28:55 2024
;; MSG SIZE  rcvd: 137
$ drill @1.1.1.1 SOA michaeldook.dev
;; ->>HEADER<<- opcode: QUERY, rcode: NXDOMAIN, id: 52702
;; flags: qr rd ra ; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;; michaeldook.dev.     IN      SOA

;; ANSWER SECTION:

;; AUTHORITY SECTION:
dev.    300     IN      SOA     ns-tld1.charlestonroadregistry.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300

;; ADDITIONAL SECTION:

;; Query time: 22 msec
;; SERVER: 1.1.1.1
;; WHEN: Thu Feb 15 15:29:06 2024
;; MSG SIZE  rcvd: 131

Based on this:

more
$ drill @1.1.1.1 SOA local.michaeldook.dev
;; ->>HEADER<<- opcode: QUERY, rcode: NXDOMAIN, id: 24686
;; flags: qr rd ra ; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0 
;; QUESTION SECTION:
;; local.michaeldook.dev.       IN      SOA

;; ANSWER SECTION:

;; AUTHORITY SECTION:
dev.    300     IN      SOA     ns-tld1.charlestonroadregistry.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300

;; ADDITIONAL SECTION:

;; Query time: 16 msec
;; SERVER: 1.1.1.1
;; WHEN: Thu Feb 15 15:33:19 2024
;; MSG SIZE  rcvd: 137
$ drill @1.1.1.1 SOA michaeldook.dev
;; ->>HEADER<<- opcode: QUERY, rcode: NXDOMAIN, id: 52702
;; flags: qr rd ra ; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;; michaeldook.dev.     IN      SOA

;; ANSWER SECTION:

;; AUTHORITY SECTION:
dev.    300     IN      SOA     ns-tld1.charlestonroadregistry.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300

;; ADDITIONAL SECTION:

;; Query time: 22 msec
;; SERVER: 1.1.1.1
;; WHEN: Thu Feb 15 15:29:06 2024
;; MSG SIZE  rcvd: 131
$ drill @1.1.1.1 SOA dev.      
;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 42998
;; flags: qr rd ra ; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 
;; QUESTION SECTION:
;; dev. IN      SOA

;; ANSWER SECTION:
dev.    21600   IN      SOA     ns-tld1.charlestonroadregistry.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300

;; AUTHORITY SECTION:

;; ADDITIONAL SECTION:

;; Query time: 15 msec
;; SERVER: 1.1.1.1
;; WHEN: Thu Feb 15 15:32:32 2024
;; MSG SIZE  rcvd: 119

The answers are exactly what I explain in my first answer:

  • local.michaeldook.dev. -> NXDOMAIN
  • michaeldook.dev. -> NXDOMAIN
  • dev. -> NOERROR

So the problem is not related to cache or something that interacts with your DNS but with the creation/existence of your domain.

How do you create your domain/zone inside Cloudflare?

Note: NXDOMAIN = Non eXistent DOMAIN

:thinking: I used one of your wireframe pictures to get the domain, so I used michaelDook but it seems that the real thing is michaelCook.

I think there is a typo somewhere...

michaelCook
$ drill @1.1.1.1 SOA local.michaelcook.dev
;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 26828
;; flags: qr rd ra ; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0 
;; QUESTION SECTION:
;; local.michaelcook.dev.       IN      SOA

;; ANSWER SECTION:

;; AUTHORITY SECTION:
michaelcook.dev.        1800    IN      SOA     gemma.ns.cloudflare.com. dns.cloudflare.com. 2333453539 10000 2400 604800 1800

;; ADDITIONAL SECTION:

;; Query time: 25 msec
;; SERVER: 1.1.1.1
;; WHEN: Thu Feb 15 15:41:00 2024
;; MSG SIZE  rcvd: 102

$ drill @1.1.1.1 SOA michaelcook.dev 
;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 41298
;; flags: qr rd ra ; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 
;; QUESTION SECTION:
;; michaelcook.dev.     IN      SOA

;; ANSWER SECTION:
michaelcook.dev.        1800    IN      SOA     gemma.ns.cloudflare.com. dns.cloudflare.com. 2333453539 10000 2400 604800 1800

;; AUTHORITY SECTION:

;; ADDITIONAL SECTION:

;; Query time: 18 msec
;; SERVER: 1.1.1.1
;; WHEN: Thu Feb 15 15:41:08 2024
;; MSG SIZE  rcvd: 96
michaelDook
$ drill @1.1.1.1 SOA local.michaeldook.dev
;; ->>HEADER<<- opcode: QUERY, rcode: NXDOMAIN, id: 25764
;; flags: qr rd ra ; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0 
;; QUESTION SECTION:
;; local.michaeldook.dev.       IN      SOA

;; ANSWER SECTION:

;; AUTHORITY SECTION:
dev.    300     IN      SOA     ns-tld1.charlestonroadregistry.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300

;; ADDITIONAL SECTION:

;; Query time: 18 msec
;; SERVER: 1.1.1.1
;; WHEN: Thu Feb 15 15:42:42 2024
;; MSG SIZE  rcvd: 137

$ drill @1.1.1.1 SOA michaeldook.dev 
;; ->>HEADER<<- opcode: QUERY, rcode: NXDOMAIN, id: 61911
;; flags: qr rd ra ; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0 
;; QUESTION SECTION:
;; michaeldook.dev.     IN      SOA

;; ANSWER SECTION:

;; AUTHORITY SECTION:
dev.    300     IN      SOA     ns-tld1.charlestonroadregistry.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300

;; ADDITIONAL SECTION:

;; Query time: 16 msec
;; SERVER: 1.1.1.1
;; WHEN: Thu Feb 15 15:42:48 2024
;; MSG SIZE  rcvd: 131

$ drill @1.1.1.1 SOA dev.      
;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 11398
;; flags: qr rd ra ; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 
;; QUESTION SECTION:
;; dev. IN      SOA

;; ANSWER SECTION:
dev.    21600   IN      SOA     ns-tld1.charlestonroadregistry.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300

;; AUTHORITY SECTION:

;; ADDITIONAL SECTION:

;; Query time: 16 msec
;; SERVER: 1.1.1.1
;; WHEN: Thu Feb 15 15:42:55 2024
;; MSG SIZE  rcvd: 119

The typo is here: homelab/pve1/infrastructure/traefik/docker-compose.yaml at ec739ec2856903a697a468e4768a2785067e6af9 · MikeCook9994/homelab · GitHub

I missed that but it was inside your logs:

I was right, the problem was related to a DNS thing, a failure of the zone detection: the domain did not exist :smile_cat:

I created a PR to fix that: fix: domain typo by ldez · Pull Request #1 · MikeCook9994/homelab · GitHub

I want to crawl into a hole....I've spent hours looking at these logs and I never once noticed that I had a typo in the labels. Of course it worked everywhere else I tried it because I spelled the domain correctly.

I merged the PR and sent a tip your way mostly to cover my shame but also for your time.

1 Like

Thank you for the tip :heart:

Missing a subtle one-letter typo can happen, no shame for that.

We deployed together an "extreme" and "deep" analysis to just find ... a typo :smile:

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.