Lets encrypt dns challenge via traefik docker compose (problem only on arch)

Hello,
I am trying to get let's encrypt certs via dns challenge by using traefik docker compose. I started with official snippet:

I am using Cloudflare so I have swapped env variables but other than that I have confirmed this scripts works 100% on fresh Ubuntu-server install. On fresh Arch install I get this error which doesn't help ( example.com is just a placeholder):

level=error msg="Unable to obtain ACME certificate for domains \"example.com\": unable to generate a certificate for the domains [example.com]: error: one or more domains had a problem:\n[example.com] [example.com] acme: error presenting token: cloudflare: unexpected response code 'SERVFAIL' for _acme-challenge.example.com.\n" ACME CA="https://acme-staging-v02.api.letsencrypt.org/directory" routerName=whoami@docker rule="Host(`example.com`)" providerName=myresolver.acme

This problem should be easy to reproduce, just need a domain and fresh Arch install.
Things I have tried/checked:

  • docker container has internet connectivity and resolves dns names just fine
  • the cloudflare api key and email is 100% correct
  • i can generate cert via dns challenge using certbot on host just fine, same credentials
  • i tried overwriting docker dns via /etc/docker/daemon.json entry -> no change
  • i added recommended kernel tweak via sysctl = no change
(net.ipv4.ip_forward = 1
net.ipv4.ip_nonlocal_bind = 1)
  • iptables is disabled
  • like i said this works on clean ubuntu-server install, just installed docker and docker compose, with no kernel tweaks, no messing with daemon.json

I don't really know where to take it from here. I suspected some dns issue but everything seems to work and both arch and ubuntu use systemd-resolvd. Any ideas?

This is my docker-compose.yml. On Ubuntu Server it just works. Both Host('example.com') and lets encrypt email are swapped for real values in my proper config:

version: "3.3"

services:

  traefik:
    image: "traefik:v2.10"
    container_name: "traefik"
    command:
      #- "--log.level=DEBUG"
      - "--api.insecure=true"
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=false"
      - "--entrypoints.web.address=:80"
      - "--entrypoints.websecure.address=:443"
      - "--certificatesresolvers.myresolver.acme.dnschallenge=true"
      - "--certificatesresolvers.myresolver.acme.dnschallenge.provider=cloudflare"
      - "--certificatesresolvers.myresolver.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory"
      - "--certificatesresolvers.myresolver.acme.email=email@gmail.com"
      - "--certificatesresolvers.myresolver.acme.storage=/letsencrypt/acme.json"
    ports:
      - "80:80"
      - "443:443"
      - "8080:8080"
    environment:
      - CLOUDFLARE_EMAIL=${cloudflare_email}
      - CLOUDFLARE_DNS_API_TOKEN=${cloudflare_dns_api_token}
    volumes:
      - "./letsencrypt:/letsencrypt"
      - "/var/run/docker.sock:/var/run/docker.sock:ro"

  whoami:
    image: "traefik/whoami"
    container_name: "simple-service"
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.whoami.rule=Host(`example.com`)"
      - "traefik.http.routers.whoami.entrypoints=websecure"
      - "traefik.http.routers.whoami.tls.certresolver=myresolver"

Maybe arch is more restrictive and does not easily open ports like Ubuntu. Have you tried accessing port 80 and 443 on your server from external, are they reachable?

Those ports are needed for the http challenge, while I am using dns challenge. Your host can be behind NAT, no need to open port 80 or 443. For the record I have tried this both on cloud vps with public ip and local hardware with no public ip. Same result.

The error SERVFAIL is related to a DNS call, and precisely during SOA call to find the "zone".

This means that something is intercepting the DNS call locally.

It's not related to ports 80 and 443.

FYI, I'm a Manjaro user (I know Manjaro is not Arch but it's based on Arch) and I don't have any problem.

Hello, thank you for at least confirming my initial suspicions. Nevertheless I am still stuck. Like I said in my initial post I checked everything dns related I could came up with. Do you use iptables? I stopped the service cause I thought that would make debugging easier but maybe I should review this.

You can try to create an SOA call using drill/dig to check if you can reproduce it.
Check it inside the Traefik container and outside the container.

I don't have any specific iptables rules locally.

Your last answer was very helpful and I am getting closer to the solution, I've been struggling with this for 2 days already. That said my recent findings have made me confused more than ever. Please follow me:

  • I tried 2 different .com domains bought and hosted on CF. The script just works for them, I can get the cert!
  • I encounter this problem for my 2 .ovh domains. This domain is naturally bought on ovh, with NS records pointing to CF. The script doesn't work even though it is exactly the same.

Alright, we are getting somewhere but how come this exact script also works on said .ovh domains when I just change my system to ubuntu server? I checked both docker and docker compose versions and they are virtually the same, except for some minor release number.

I do not know how to make any sense of this.