Let's Encrypt issues cert when using STAGING, but "400 Timeout error" when pointing resolver to PROD CA server

Hello there!

I'm using Traefik as a Kubernetes Ingress.

Until today I was using dnschallenge=true + wildcard. Worked great (that's a euphemism, it really rocks ;).

I wanted to serve a second domain with the same config (dnschallenge), but it's currently NOT possible if the two domains are hosted on the same provider (OVH in my case). Too bad but that's ok.

So I decided to go the manual way + tslchallenge=true.

However, Let's Encrypt do not succeed to validate the tls challenge. It fails every time with the error message "Timeout during read (your server may be slow or overloaded)" (very similar to this issue btw).

Unless I point to Let's Encrypt STAGING server! WEIRD!

Also when accessing the service frontend in the browser (default Traefik cert being served) sometimes the request times out and nothing appear (Traefik logs remote error: tls: unknown certificate). And sometimes the request succeeds and the frontend page appears (with ofc an "INSECURE warning" from the browser saying that the cert isn't valid). Seems a bit random.

Any help/pointer would be much appreciated! Thanks a lot.

Here is my Traefik helm-values config:

additionalArguments:
  - --entryPoints.metrics.address=:8082

  - --entrypoints.web.http.redirections.entryPoint.to=websecure
  - --entrypoints.web.http.redirections.entryPoint.scheme=https

  - --certificatesresolvers.letls.acme.email=myemail
  - --certificatesresolvers.letls.acme.storage=/data/acme.json
  - --certificatesresolvers.letls.acme.tlschallenge=true
  # - --certificatesresolvers.letls.acme.caServer=https://acme-v02.api.letsencrypt.org/directory
  # - --certificatesresolvers.letls.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory
  - --certificatesresolvers.le.acme.email=myemail
  - --certificatesresolvers.le.acme.storage=/data/acme.json
  - --certificatesresolvers.le.acme.tlschallenge=false
  - --certificatesresolvers.le.acme.dnsChallenge=true
  - --certificatesresolvers.le.acme.dnsChallenge.provider=ovh
  - --certificatesresolvers.le.acme.dnsChallenge.delayBeforeCheck=10
  - --metrics.prometheus=true
  - --metrics.prometheus.entryPoint=metrics
  - --metrics.prometheus.addEntryPointsLabels=true
  - --metrics.prometheus.addServicesLabels=true
  - --log.level=DEBUG

ingressRoute:
  dashboard:
    enabled: false # take care of the tls IngressRoute in kubectl-traefik-config.yml file

podSecurityContext:
  fsGroup: null

persistence:
  enabled: true
  path: /data
  size: 1Gi

deployment:
  # Additional deployment annotations (e.g. for jaeger-operator sidecar injection)
  annotations: {}
  # Additional pod annotations (e.g. for mesh injection or prometheus scraping)
  podAnnotations:
    prometheus.io/scrape: "true"
    prometheus.io/path: /metrics
    prometheus.io/port: "8080"

env:
  - name: OVH_ENDPOINT
    valueFrom:
      secretKeyRef:
        name: ovh-credentials
        key: OVH_ENDPOINT
  - name: OVH_APPLICATION_KEY
    valueFrom:
      secretKeyRef:
        name: ovh-credentials
        key: OVH_APPLICATION_KEY
  - name: OVH_APPLICATION_SECRET
    valueFrom:
      secretKeyRef:
        name: ovh-credentials
        key: OVH_APPLICATION_SECRET
  - name: OVH_CONSUMER_KEY
    valueFrom:
      secretKeyRef:
        name: ovh-credentials
        key: OVH_CONSUMER_KEY

service:
  spec:
    loadBalancerIP: "the_IP_address"

Also, here is the service definition file:

apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: drone-app-tls
spec:
  entryPoints:
    - websecure
  routes:
    - kind: Rule
      match: Host(`drone.domain.com`)
      services:
        - name: drone
          port: 80
  tls:
    certResolver: letls