Hello all!
I have Traefik (v3.1.2) deployed in a multi-node Hashicorp Nomad (v1.8.2) environment along with an internal ACME server step-ca (version 2024-07-20).
I have (3) instances of Traefik deployed to (3) nodes and using TLS challenge to the ACME server.
I am still new to Traefik and ACME, but HTTPS periodically breaks, and Traefik serves its default certificate instead of the trusted one from the ACME server.
The logs for all three Traefik instances look similar to this:
2024-08-12T09:43:22.428148264-04:00 stdout F 2024-08-12T13:43:22Z INF Testing certificate renew... acmeCA=https://ca.example.com/acme/acme/directory providerName=internal.acme
2024-08-12T09:43:22.429374845-04:00 stdout F 2024-08-12T13:43:22Z INF Renewing certificate from LE : {Main:traefik.example.com SANs:[]} acmeCA=https://ca.example.com/acme/acme/directory providerName=internal.acme
2024-08-12T10:43:22.427921365-04:00 stdout F 2024-08-12T14:43:22Z INF Testing certificate renew... acmeCA=https://ca.example.com/acme/acme/directory providerName=internal.acme
2024-08-12T10:43:22.428923226-04:00 stdout F 2024-08-12T14:43:22Z INF Renewing certificate from LE : {Main:traefik.example.com SANs:[]} acmeCA=https://ca.example.com/acme/acme/directory providerName=internal.acme
2024-08-12T10:51:23.212670375-04:00 stdout F 2024-08-12T14:51:23Z ERR Error renewing certificate from LE: {traefik.example.com []} error="error: one or more domains had a problem:\n[traefik.example.com] the server didn't respond to our request\n" acmeCA=https://ca.example.com/acme/acme/directory providerName=internal.acme
My interpretation of the above log is that at 09:43, Traefik's certificate testing passed but failed the test initiated at 10:43 and gave up at 10:51. Is this assumption correct?
If my logic is correct, two out of the three Traefik instances passed their certificate testing in this timeframe.
I am trying to determine whether I have a malformed Traefik ACME configuration or if the problem is with step-ca.
Here is my Nomad job specification that defines my Traefik deployment. Note that I have a unique email acme.address assigned to each Traefik instance.
job "traefik" {
datacenters = ["homelab"]
type = "system"
group "traefik" {
network {
port "http" {
static = 80
}
port "https" {
static = 443
}
}
service {
name = "traefik"
port = "https"
tags = [
"traefik.enable=true",
"traefik.http.routers.dashboard.rule=Host(`traefik.example.com`)",
"traefik.http.routers.dashboard.service=api@internal",
"traefik.http.routers.dashboard.entrypoints=web,websecure",
"traefik.http.routers.dashboard.tls.certresolver=internal",
"traefik.http.routers.dashboard.tls=true",
]
check {
name = "alive"
type = "tcp"
port = "http"
interval = "10s"
timeout = "2s"
}
}
task "traefik" {
driver = "podman"
config {
image = "docker.io/library/traefik:v3.1.2"
ports = [
"http",
"https",
]
args = [
"--api.dashboard=true",
"--log.level=INFO",
"--accesslog=true",
# Consul integration
"--providers.consulcatalog=true",
"--providers.consulcatalog.exposedByDefault=false",
"--providers.consulcatalog.prefix=traefik",
"--providers.consulcatalog.endpoint.address=${NOMAD_IP_http}:8500",
# HTTP entrypoints
"--entrypoints.web.address=:${NOMAD_PORT_http}",
"--entrypoints.websecure.address=:${NOMAD_PORT_https}",
# Internal ACME/PKI
"--certificatesresolvers.internal.acme.caserver=https://ca.example.com/acme/acme/directory",
"--certificatesresolvers.internal.acme.email=${NOMAD_ALLOC_ID}@example.com",
"--certificatesresolvers.internal.acme.storage=/local/internal.acme.json",
"--certificatesresolvers.internal.acme.tlsChallenge=true",
"--certificatesresolvers.internal.acme.certificatesduration=720",
]
}
artifact {
source = "https://ca.example.com/roots.pem"
mode = "file"
}
env {
LEGO_CA_CERTIFICATES = "/local/roots.pem"
}
resources {
cpu = 100
memory = 128
}
}
}
}
Thanks