Multiple Traefik instances and ACME

Hello all!

I have Traefik (v3.1.2) deployed in a multi-node Hashicorp Nomad (v1.8.2) environment along with an internal ACME server step-ca (version 2024-07-20).
I have (3) instances of Traefik deployed to (3) nodes and using TLS challenge to the ACME server.
I am still new to Traefik and ACME, but HTTPS periodically breaks, and Traefik serves its default certificate instead of the trusted one from the ACME server.
The logs for all three Traefik instances look similar to this:

2024-08-12T09:43:22.428148264-04:00 stdout F 2024-08-12T13:43:22Z INF Testing certificate renew... acmeCA=https://ca.example.com/acme/acme/directory providerName=internal.acme
2024-08-12T09:43:22.429374845-04:00 stdout F 2024-08-12T13:43:22Z INF Renewing certificate from LE : {Main:traefik.example.com SANs:[]} acmeCA=https://ca.example.com/acme/acme/directory providerName=internal.acme
2024-08-12T10:43:22.427921365-04:00 stdout F 2024-08-12T14:43:22Z INF Testing certificate renew... acmeCA=https://ca.example.com/acme/acme/directory providerName=internal.acme
2024-08-12T10:43:22.428923226-04:00 stdout F 2024-08-12T14:43:22Z INF Renewing certificate from LE : {Main:traefik.example.com SANs:[]} acmeCA=https://ca.example.com/acme/acme/directory providerName=internal.acme
2024-08-12T10:51:23.212670375-04:00 stdout F 2024-08-12T14:51:23Z ERR Error renewing certificate from LE: {traefik.example.com []} error="error: one or more domains had a problem:\n[traefik.example.com] the server didn't respond to our request\n" acmeCA=https://ca.example.com/acme/acme/directory providerName=internal.acme

My interpretation of the above log is that at 09:43, Traefik's certificate testing passed but failed the test initiated at 10:43 and gave up at 10:51. Is this assumption correct?
If my logic is correct, two out of the three Traefik instances passed their certificate testing in this timeframe.
I am trying to determine whether I have a malformed Traefik ACME configuration or if the problem is with step-ca.

Here is my Nomad job specification that defines my Traefik deployment. Note that I have a unique email acme.address assigned to each Traefik instance.

job "traefik" {
  datacenters = ["homelab"]
  type        = "system"

  group "traefik" {
    network {
      port "http" {
        static = 80
      }
      port "https" {
        static = 443
      }
    }

    service {
      name     = "traefik"
      port     = "https"
      tags = [
        "traefik.enable=true",
        "traefik.http.routers.dashboard.rule=Host(`traefik.example.com`)",
        "traefik.http.routers.dashboard.service=api@internal",
        "traefik.http.routers.dashboard.entrypoints=web,websecure",
        "traefik.http.routers.dashboard.tls.certresolver=internal",
        "traefik.http.routers.dashboard.tls=true",
      ] 

      check {
        name     = "alive"
        type     = "tcp"
        port     = "http"
        interval = "10s"
        timeout  = "2s"
      }
    }

    task "traefik" {
      driver = "podman"
      config {
        image = "docker.io/library/traefik:v3.1.2"
        ports = [
          "http", 
          "https", 
        ]
        
        args = [
          "--api.dashboard=true",
          "--log.level=INFO",
          "--accesslog=true",

          # Consul integration
          "--providers.consulcatalog=true",
          "--providers.consulcatalog.exposedByDefault=false",
          "--providers.consulcatalog.prefix=traefik",
          "--providers.consulcatalog.endpoint.address=${NOMAD_IP_http}:8500",

          # HTTP entrypoints
          "--entrypoints.web.address=:${NOMAD_PORT_http}",
          "--entrypoints.websecure.address=:${NOMAD_PORT_https}",
          # Internal ACME/PKI
          "--certificatesresolvers.internal.acme.caserver=https://ca.example.com/acme/acme/directory",
          "--certificatesresolvers.internal.acme.email=${NOMAD_ALLOC_ID}@example.com",
          "--certificatesresolvers.internal.acme.storage=/local/internal.acme.json",
          "--certificatesresolvers.internal.acme.tlsChallenge=true",
          "--certificatesresolvers.internal.acme.certificatesduration=720",
        ]
      }
      
      artifact {
        source = "https://ca.example.com/roots.pem"
        mode   = "file"
      }

      env {
        LEGO_CA_CERTIFICATES = "/local/roots.pem"
      }

      resources {
        cpu    = 100
        memory = 128
      }
    }
  }
}

Thanks

Traefik CE is not LetsEncrypt cluster-enabled, so you can not really use tlsChallenge, as the local initiates request might be sent from LE to a different instance.

If you have your own cert server, try to use dnsChallenge to get 3 independent certs without the risk of crossing instances.

Thanks for the feedback. I'll work on this and report my findings.

I just wanted to follow up and say that using dnsChallenge appears to have fixed my issue.

Thanks!

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.