Aloha, I had a production (fortunately in-house) service lose its traefik setup, and unfortunately it looks like somebody didn't include the traefik manifests in our gitops, so we .... lost them. I've been up for about 30 hrs now trying to get the server (ra.ceresimaging.net, which uses oauth2 so https is a must) back before monday morning.
- I'm using the traefik helm chart as a sub-chart of our existing 'ra' service, no cert-manager
- I'm using ACME http challenge letsencrypt
- The traefik service has bound the correct (specified) GKE
loadBalancerIP
on ports :80 and :443 (which matches the dns records for ra.ceresimaging.net,loadBalancerIP=104.196.247.22
). - I can see the ACME http requests from letsencrypt on the traefik k8s service logs, but its 404ing, I also can't get a valid response when I try to
GET /.well-known/acme-challenge
from my local network:
ra-traefik-5fbd5b8b47-cggbl ra-traefik time="2022-02-07T04:13:47Z" level=error msg="Cannot retrieve the ACME challenge for token wQaUqKWr8WowLwbhuixpvmM7L41krfOO9JUsVJNNdR8: cannot find challenge for token wQaUqKWr8WowLwbhuixpvmM7L41krfOO9JUsVJNNdR8" providerName=acme
ra-traefik-5fbd5b8b47-cggbl ra-traefik 127.0.0.1 - - [07/Feb/2022:04:13:02 +0000] "GET /.well-known/acme-challenge/wQaUqKWr8WowLwbhuixpvmM7L41krfOO9JUsVJNNdR8 HTTP/1.1" 404 0 "-" "-" 256 "acme-http@internal" "-" 45005ms
Here's my traefik related config (note, I'm using traefik as a subchart, hence the traefik:
in values.yaml):
# values.yaml for traefik:
traefik:
additionalArguments:
- "--certificatesresolvers.letsencrypt.acme.email=seth@ceresimaging.net"
- "--certificatesresolvers.letsencrypt.acme.storage=/data/acme.json"
#- "--certificatesresolvers.letsencrypt.acme.caserver=https://acme-v02.api.letsencrypt.org/directory"
- "--certificatesresolvers.letsencrypt.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory"
- "--certificatesResolvers.letsencrypt.acme.httpchallenge=true"
- "--certificatesResolvers.letsencrypt.acme.httpchallenge.entrypoint=web"
- "--api.insecure=true"
- "--accesslog=true"
- "--ping=false"
- "--log.level=DEBUG"
persistence:
enabled: true
path: /data
service:
spec:
loadBalancerIP: 104.196.247.22 # gcp static reserved external ip: traefik is happily binding it, all seems good here
deployment:
initContainers:
- name: volume-permissions
image: busybox:1.31.1
command: ["sh", "-c", "chmod -Rv 600 /data/*"]
volumeMounts:
- name: data
mountPath: /data
# And here's my `IngressRoute`:
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
name: ra-web
spec:
entryPoints:
- web
- websecure
routes:
- match: Host(`{{ .Values.ingress.hostname }}`) # => "ra.ceresimaging.net"
kind: Rule
services:
- name: ra-web
port: 80
tls:
enabled: true
certResolver: letsencrypt
domains:
- main: {{ .Values.ingress.hostname }} # => "ra.ceresimaging.net"
Any suggestions on where I debug next appreciated, I'm getting pretty tired, and I feel like this is SO CLOSE to not being a disaster
aloha,
-seth