`traefik` service 404ing to letsencrypt acme http challenges (Prod down)

Aloha, I had a production (fortunately in-house) service lose its traefik setup, and unfortunately it looks like somebody didn't include the traefik manifests in our gitops, so we .... lost them. I've been up for about 30 hrs now trying to get the server (ra.ceresimaging.net, which uses oauth2 so https is a must) back before monday morning.

  1. I'm using the traefik helm chart as a sub-chart of our existing 'ra' service, no cert-manager
  2. I'm using ACME http challenge letsencrypt
  3. The traefik service has bound the correct (specified) GKE loadBalancerIP on ports :80 and :443 (which matches the dns records for ra.ceresimaging.net, loadBalancerIP=104.196.247.22).
  4. I can see the ACME http requests from letsencrypt on the traefik k8s service logs, but its 404ing, I also can't get a valid response when I try to GET /.well-known/acme-challenge from my local network:
ra-traefik-5fbd5b8b47-cggbl ra-traefik time="2022-02-07T04:13:47Z" level=error msg="Cannot retrieve the ACME challenge for token wQaUqKWr8WowLwbhuixpvmM7L41krfOO9JUsVJNNdR8: cannot find challenge for token wQaUqKWr8WowLwbhuixpvmM7L41krfOO9JUsVJNNdR8" providerName=acme
ra-traefik-5fbd5b8b47-cggbl ra-traefik 127.0.0.1 - - [07/Feb/2022:04:13:02 +0000] "GET /.well-known/acme-challenge/wQaUqKWr8WowLwbhuixpvmM7L41krfOO9JUsVJNNdR8 HTTP/1.1" 404 0 "-" "-" 256 "acme-http@internal" "-" 45005ms

Here's my traefik related config (note, I'm using traefik as a subchart, hence the traefik: in values.yaml):

# values.yaml for traefik:
traefik:
  additionalArguments:

    - "--certificatesresolvers.letsencrypt.acme.email=seth@ceresimaging.net"
    - "--certificatesresolvers.letsencrypt.acme.storage=/data/acme.json"
    #- "--certificatesresolvers.letsencrypt.acme.caserver=https://acme-v02.api.letsencrypt.org/directory"
    - "--certificatesresolvers.letsencrypt.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory"
    - "--certificatesResolvers.letsencrypt.acme.httpchallenge=true"
    - "--certificatesResolvers.letsencrypt.acme.httpchallenge.entrypoint=web"
    - "--api.insecure=true"
    - "--accesslog=true"
    - "--ping=false"
    - "--log.level=DEBUG"
  persistence:
    enabled: true
    path: /data
  service:
    spec:
      loadBalancerIP: 104.196.247.22 # gcp static reserved external ip: traefik is happily binding it, all seems good here
  deployment:
    initContainers:
      - name: volume-permissions
        image: busybox:1.31.1
        command: ["sh", "-c", "chmod -Rv 600 /data/*"]
        volumeMounts:
          - name: data
            mountPath: /data

# And here's my `IngressRoute`:

apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: ra-web
spec:
  entryPoints:
    - web
    - websecure
  routes:
    - match: Host(`{{ .Values.ingress.hostname }}`) # => "ra.ceresimaging.net"
      kind: Rule
      services:
        - name: ra-web
          port: 80
  
  tls:
    enabled: true
    certResolver: letsencrypt
    domains:
      - main: {{ .Values.ingress.hostname }} # => "ra.ceresimaging.net"

Any suggestions on where I debug next appreciated, I'm getting pretty tired, and I feel like this is SO CLOSE to not being a disaster :fire:

aloha,
-seth