We have strange problem with Traefik (v3.3.5) and Consul (v1.21.4) running on Nomad (v1.10.5) cluster.
I’m not sure if the problem is with the Traefik ConsulCatalog integration or with Consul itself.
Either way, we have two Traefik instances (Nomad jobs) in the cluster — one for the staging environment and another for production. They both have almost the same configuration.
Staging
experimental:
plugins:
tokenauth:
moduleName: "github.com/Clasyc/tokenauth"
version: "v0.1.0"
entryPoints:
http:
address: :80
transport:
respondingTimeouts:
readTimeout: 3600s
writeTimeout: 3600s
idleTimeout: 180s
http:
redirections:
entryPoint:
to: https
scheme: https
permanent: true
https:
address: :443
transport:
respondingTimeouts:
readTimeout: 3600s
writeTimeout: 3600s
idleTimeout: 180s
http:
middlewares:
- hsts@file
tls: {}
traefik:
address: :8081
ping:
address: :8082
tls:
options:
default:
sniStrict: true
minVersion: VersionTLS12
api:
dashboard: true
insecure: true
pilot:
dashboard: false
providers:
file:
directory: /local/rules
watch: true
consulCatalog:
cache: true
constraints: "!Tag(`env.production`)"
refreshInterval: 5s
connectAware: true
prefix: traefik
exposedByDefault: false
endpoint:
address: consul.service.consul:8501
scheme: https
tls:
ca: /etc/nomad-volume/ca/consul-agent-ca.pem
ping:
entryPoint: ping
log:
format: json
level: info
metrics:
prometheus:
addEntryPointsLabels: false
addServicesLabels: true
addRoutersLabels: false
buckets:
- 0.04
- 0.06
- 0.08
- 0.1
- 0.2
- 0.3
- 0.5
- 1.0
Production
experimental:
plugins:
tokenauth:
moduleName: "github.com/Clasyc/tokenauth"
version: "v0.1.0"
entryPoints:
http:
address: <REDACTED>:80
transport:
respondingTimeouts:
readTimeout: 3600s
writeTimeout: 3600s
idleTimeout: 180s
http:
redirections:
entryPoint:
to: https
scheme: https
permanent: true
https:
address: <REDACTED>:443
transport:
respondingTimeouts:
readTimeout: 3600s
writeTimeout: 3600s
idleTimeout: 180s
http:
middlewares:
- hsts@file
tls: {}
traefik:
address: :8081
ping:
address: :8082
tls:
options:
default:
sniStrict: true
minVersion: VersionTLS12
api:
dashboard: true
insecure: true
pilot:
dashboard: false
providers:
file:
directory: /local/rules
watch: true
consulCatalog:
cache: true
constraints: "Tag(`env.production`)"
refreshInterval: 5s
connectAware: true
watch: false
prefix: traefik
exposedByDefault: false
endpoint:
address: consul.service.consul:8501
scheme: https
tls:
ca: /etc/nomad-volume/ca/consul-agent-ca.pem
ping:
entryPoint: ping
log:
format: json
level: info
metrics:
prometheus:
addEntryPointsLabels: false
addServicesLabels: true
addRoutersLabels: false
buckets:
- 0.04
- 0.06
- 0.08
- 0.1
- 0.2
- 0.3
- 0.5
- 1.0
The problem is that, without any changes or new deployments, our production Traefik instance suddenly starts returning a 502 Bad Gateway response for all applications that are dynamically added using tags through the Consul Catalog integration.
Here’s an example of the application tags:
tags = [
"env.{{ namespace }}",
"traefik.enable=true",
"traefik.http.routers.search-manager--{{ namespace }}.rule=Host(`{{ config.traefik_host }}`)",
"traefik.http.routers.search-manager--{{ namespace }}.tls=true",
"traefik.consulcatalog.connect=true",
"traefik.http.middlewares.add-forwarded-headers.headers.customrequestheaders.X-Forwarded-Proto=https",
"traefik.http.routers.search-manager--{{ namespace }}.middlewares=add-forwarded-headers",
"logs.promtail=true"
]
To begin with, this only happens to applications added through the Consul Catalog, and only on the production Traefik instance, even though the configuration is almost the same as on staging. It affects all routes added by the Consul Catalog. I checked the Consul service health checks and connected to the Docker containers to see if the applications were responding — they were all healthy and responding correctly. Consul itself also reports that all health checks pass. Traefik logs show no errors at all. The issue appears randomly — sometimes after a week, sometimes after a few hours — and redeploying Traefik temporarily fixes it, bringing everything back online.
I can’t properly reproduce this issue since it happens randomly. Once it starts throwing 502 Bad Gateway errors, it never recovers on its own until Traefik is redeployed.
The staging environment has been running with the same setup for over a year on the same cluster without any issues, so I’m really puzzled.
The only differences I see between the staging and production servers where Traefik is deployed are that the production server has two public IPs on the same network interface, and the production Traefik configuration is bound to a specific IP like this:
entryPoints: http: address: <REDACTED>:80
But, I have a feeling this has nothing to do with the issue.
I’m lost.