Hi everyone.
I've recently changed some of my path-prefix routes to subdomains to keep cookies separate for security.
right after the change I noticed that one of the new domains didn't receive an LE certificate and was instead returning the self-signed default certificate.
After looking at it I couldn't really find a configuration issue nor could I see any suspicious logs.
Since this is a testing environment I simply went ahead and deleted traefik's certificate storage to see if it only affects that one domain or if it affects all of them equally. After that non of the domains received certificates except one. After a bunch of tests I now eventually ended up with no certificates being resolved. No useful logs hinting to any concrete problem, even when setting DEBUG as the log level.
A private key is generated and stored in the json file and I see in the debug logs that traefik receives the correct configuration from the docker (swarm mode) provider with the proper domains and certResolvers.
It feels like I've hit some form of LE limits, but I had that in the post and it usually resulted in some logs indicating the problem. I'm running traefik 2.8.7 and I'm slowly reaching the point where I have no more ideas of what I could try.
And just as I posted this I realized the problem: I also recently swapped out the self-signed default certificate and to one that has a valid SAN for all the domains I was trying to get LE certificates for, except for that one domain that initially worked. I assume that one really is rate-limited by LE due to all the tests I did to get here.
I just as I fixed this mistake by generating a new default cert with an unrelated commonName I remembered that I had this issue before. This is really unfortunate. Traefik should log which certificate it selects for which router, maybe even show it in the dashboard.
While traefik fetched certificates for all domains that previously didn't work, now the one domain that previously worked (because the default-cert was not valid for it) doesn't get a certificate. Again the configuration from the docker provider looks correct, but no errors or other obvious issues. I also switched the domain to a slightly different one to circumvent LE's duplicate-certificate rate limit of 5 per week, still nothing.
Giving support without seeing your configuration is hard. Make sure you enable debug logs, it usually includes some information about acme and routes.
Note that LE has various limits, even if you change the sub-domain you could still hit other limits like 'register' for your email. Have you tried to use LE staging?
How do you handle the LE challenge? If you use http, also enable access logs and check for failed /.well-known/acme-challenge
entries.
When using Docker Swarm, do you only run a single Traefik instance?
Yeah it's always tricky to decide what to share without exposing to much of your infrastructure.
TRAEFIK_ACCESSLOG=true
TRAEFIK_ACCESSLOG_FORMAT=json
TRAEFIK_API=true
TRAEFIK_API_DASHBOARD=true
TRAEFIK_API_DEBUG=false
TRAEFIK_CERTIFICATESRESOLVERS_tls=true
TRAEFIK_CERTIFICATESRESOLVERS_tls_ACME_CASERVER=https://acme-v02.api.letsencrypt.org/directory
TRAEFIK_CERTIFICATESRESOLVERS_tls_ACME_EMAIL=...
TRAEFIK_CERTIFICATESRESOLVERS_tls_ACME_STORAGE=/acme/tls.json
TRAEFIK_CERTIFICATESRESOLVERS_tls_ACME_TLSCHALLENGE=true
TRAEFIK_ENTRYPOINTS_traefik=true
TRAEFIK_ENTRYPOINTS_traefik_ADDRESS=:8080
TRAEFIK_ENTRYPOINTS_websecure=true
TRAEFIK_ENTRYPOINTS_websecure_ADDRESS=:443
TRAEFIK_GLOBAL_CHECKNEWVERSION=false
TRAEFIK_GLOBAL_SENDANONYMOUSUSAGE=false
TRAEFIK_LOG_FORMAT=json
TRAEFIK_LOG_LEVEL=DEBUG
TRAEFIK_METRICS_PROMETHEUS=true
TRAEFIK_PROVIDERS_DOCKER=true
TRAEFIK_PROVIDERS_DOCKER_EXPOSEDBYDEFAULT=false
TRAEFIK_PROVIDERS_DOCKER_SWARMMODE=true
TRAEFIK_PROVIDERS_DOCKER_WATCH=true
TRAEFIK_PROVIDERS_FILE_FILENAME=/etc/traefik/providers/file.yml
TRAEFIK_PROVIDERS_FILE_WATCH=false
I had debug enabled for days now. The only useful log I ever got out of it was when it logs that a resolver couldn't be found, which was me mistyping the resolver name. (I also think that
yeah I checked the limits. I don't think I was actually hitting any limits, I'd hope that traefik logs something when it happens.
I use TLS-ALPN-01, I guess that's part of the reason why logs are a little thin.
One traefik instance on a manager node.