Traefik v2 on GKE fails all ACME challenges

Hi! First poster here. I apologize in advance if this has been seen before (I've checked but not all the posts) / if this is the wrong place, feel free to point me elsewhere if so.

I recently upgraded by Kubernetes cluster from traefik 1.7 to 2+. I installed traefik 2+ using the containous helm chart with the default values (https://docs.traefik.io/getting-started/install-traefik/#use-the-helm-chart) - from my understanding this installation method created CRDs for k8s that better map to Traefik concepts. Fine with me, I migrated my routes to the new IngressRoute definitions and they worked, I was able to reach my services just fine through Traefik.

Now I've been trying to setup HTTPS, but for some reason traefik always fails the TLS and HTTP challenges (DNS not an option right now / don't need it ). I know traffic gets to the destination service because I can curl / visit the https route by trusting the dummy cert that traefik generates but for some reason letsencrypt can't verify?

Here's the ingress route def:

apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: traefik-v2-dashboard
  namespace: kube-system
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`traefik.mydomain.com`)
      kind: Rule
      services:
        - kind: TraefikService
          name: api@internal
  tls:
    certResolver: myresolver
---

and the traefik deployment:

spec:
  containers:
  - args:
    - --global.checknewversion
    - --global.sendanonymoususage
    - --entryPoints.traefik.address=:9000
    - --entryPoints.web.address=:80
    - --entryPoints.websecure.address=:443
    - --api.dashboard=true
    - --ping=true
    - --providers.kubernetescrd
    - --certificatesresolvers.myresolver.acme.email=engineering@mydomain.com
    - --certificatesresolvers.myresolver.acme.storage=acme.json
    - --certificatesresolvers.myresolver.acme.tlschallenge
    - --certificatesresolvers.myresolver.acme.caserver=https://acme-staging-v02.api.letsencry    pt.org/directory
    image: traefik:2.2.0
    name: traefik-v2
    ports:
    - containerPort: 9000
      name: traefik
      protocol: TCP
    - containerPort: 80
      name: web
      protocol: TCP
    - containerPort: 443
      name: websecure
      protocol: TCP

and the errors I've been getting for HTTP Challenge (TLS are similar)

time="2020-04-20T14:15:04Z" level=error msg="Unable to obtain ACME certificate for domains \"traefik.mydomain.com\": unable to generate a certificate for the domains [traefik.mydomain.com]: acme: Error -> One or more domains had a problem:\n[traefik.mydomain.com] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://traefik.mydomain.com/.well-known/acme-challenge/mjJQBITgUNfDxRR5am798R-p3DhqdTVJ97VWjGlHzUM: Connection refused, url: \n" rule="Host(`traefik.mydomain.com`)" providerName=myresolver.acme routerName=kube-system-traefik-v2-dashboard-541f1f6c7cdf8c56d30d@kubernetescrd

For context, I got the domain from google domains and have the DNS managed by netlify, where I forward the relevant subdomains to the GKE cluster. It's a roundabout configuration but it works. Also, I'm running in a non-HA configuration - one traefik container on one node is receiving the requests and dispatching.

Does anyone know what my problem may be or how I get more visibility into that?

Could it be some DNS delay? I don't spot an immediate issue with your configuration.

@SantoDE thanks for chiming in! I've had the DNS records in place for a week or so now so I think everything should be ok? Unless you're talking about some other DNS delay

Yeah, that's what I ment. Hmm. Is there another loadbalancer (or cloudfront?) in front, that could prevent lets encrypt to reach the actual Traefik instance?

no cloudfront though I was thinking of honestly sticking it on top to terminate ssl if I couldn't figure this out haha. There is another LB in front of traefik - the google LB to the nodes that GKE creates for traefik. I've double checked the details and it forwards 80 & 443 properly, and no firewalls on the instances from what I can see

Good news is I got it to work! Bad (?) news is I'm not entirely sure why it was failing in the first place. I changed the log level to DEBUG to follow along with the acme challenge procedure more carefully and found that after acme errors on traefik startup, it would successfully pass the challenge if I edited the ingress route crd to additionally listen to the web endpoint. I'm using the TLS challenge so I don't think the router listening on web matters - best guess is it's more of a cold-start problem for traefik?

When I restart the deployment I notice the cert dies as well - does traefik currently support putting the cert into a k8s secret?

Traefik does currently support persisting the cert in a volume. We're considering for other storage options as well, but for now its a PVC only.

Glad its working now tough :slight_smile:

1 Like