Consul Catalog services suddenly return 502 Bad Gateway

We have strange problem with Traefik (v3.3.5) and Consul (v1.21.4) running on Nomad (v1.10.5) cluster.

I’m not sure if the problem is with the Traefik ConsulCatalog integration or with Consul itself.

Either way, we have two Traefik instances (Nomad jobs) in the cluster — one for the staging environment and another for production. They both have almost the same configuration.

Staging
experimental:
  plugins:
    tokenauth:
      moduleName: "github.com/Clasyc/tokenauth"
      version: "v0.1.0"

entryPoints:
  http:
    address: :80
    transport:
      respondingTimeouts:
        readTimeout: 3600s
        writeTimeout: 3600s
        idleTimeout: 180s
    http:
      redirections:
        entryPoint:
          to: https
          scheme: https
          permanent: true
  https:
    address: :443
    transport:
      respondingTimeouts:
        readTimeout: 3600s
        writeTimeout: 3600s
        idleTimeout: 180s
    http:
      middlewares:
        - hsts@file
      tls: {}
  traefik:
    address: :8081
  ping:
    address: :8082
tls:
  options:
    default:
      sniStrict: true
      minVersion: VersionTLS12
api:
  dashboard: true
  insecure: true
pilot:
  dashboard: false
providers:
  file:
    directory: /local/rules
    watch: true
  consulCatalog:
    cache: true
    constraints: "!Tag(`env.production`)"
    refreshInterval: 5s
    connectAware: true
    prefix: traefik
    exposedByDefault: false
    endpoint:
      address: consul.service.consul:8501
      scheme: https
      tls:
        ca: /etc/nomad-volume/ca/consul-agent-ca.pem
ping:
  entryPoint: ping
log:
  format: json
  level: info
metrics:
  prometheus:
    addEntryPointsLabels: false
    addServicesLabels: true
    addRoutersLabels: false
    buckets:
      - 0.04
      - 0.06
      - 0.08
      - 0.1
      - 0.2
      - 0.3
      - 0.5
      - 1.0
Production
experimental:
  plugins:
    tokenauth:
      moduleName: "github.com/Clasyc/tokenauth"
      version: "v0.1.0"

entryPoints:
  http:
    address: <REDACTED>:80
    transport:
      respondingTimeouts:
        readTimeout: 3600s
        writeTimeout: 3600s
        idleTimeout: 180s
    http:
      redirections:
        entryPoint:
          to: https
          scheme: https
          permanent: true
  https:
    address: <REDACTED>:443
    transport:
      respondingTimeouts:
        readTimeout: 3600s
        writeTimeout: 3600s
        idleTimeout: 180s
    http:
      middlewares:
        - hsts@file
      tls: {}
  traefik:
    address: :8081
  ping:
    address: :8082
tls:
  options:
    default:
      sniStrict: true
      minVersion: VersionTLS12
api:
  dashboard: true
  insecure: true
pilot:
  dashboard: false
providers:
  file:
    directory: /local/rules
    watch: true
  consulCatalog:
    cache: true
    constraints: "Tag(`env.production`)"
    refreshInterval: 5s
    connectAware: true
    watch: false
    prefix: traefik
    exposedByDefault: false
    endpoint:
      address: consul.service.consul:8501
      scheme: https
      tls:
        ca: /etc/nomad-volume/ca/consul-agent-ca.pem
ping:
  entryPoint: ping
log:
  format: json
  level: info
metrics:
  prometheus:
    addEntryPointsLabels: false
    addServicesLabels: true
    addRoutersLabels: false
    buckets:
      - 0.04
      - 0.06
      - 0.08
      - 0.1
      - 0.2
      - 0.3
      - 0.5
      - 1.0

The problem is that, without any changes or new deployments, our production Traefik instance suddenly starts returning a 502 Bad Gateway response for all applications that are dynamically added using tags through the Consul Catalog integration.

Here’s an example of the application tags:

tags = [
  "env.{{ namespace  }}",
  "traefik.enable=true",
  "traefik.http.routers.search-manager--{{ namespace }}.rule=Host(`{{ config.traefik_host }}`)",
  "traefik.http.routers.search-manager--{{ namespace }}.tls=true",
  "traefik.consulcatalog.connect=true",
  "traefik.http.middlewares.add-forwarded-headers.headers.customrequestheaders.X-Forwarded-Proto=https",
  "traefik.http.routers.search-manager--{{ namespace }}.middlewares=add-forwarded-headers",
  "logs.promtail=true"
]

To begin with, this only happens to applications added through the Consul Catalog, and only on the production Traefik instance, even though the configuration is almost the same as on staging. It affects all routes added by the Consul Catalog. I checked the Consul service health checks and connected to the Docker containers to see if the applications were responding — they were all healthy and responding correctly. Consul itself also reports that all health checks pass. Traefik logs show no errors at all. The issue appears randomly — sometimes after a week, sometimes after a few hours — and redeploying Traefik temporarily fixes it, bringing everything back online.

I can’t properly reproduce this issue since it happens randomly. Once it starts throwing 502 Bad Gateway errors, it never recovers on its own until Traefik is redeployed.

The staging environment has been running with the same setup for over a year on the same cluster without any issues, so I’m really puzzled.

The only differences I see between the staging and production servers where Traefik is deployed are that the production server has two public IPs on the same network interface, and the production Traefik configuration is bound to a specific IP like this:

entryPoints: http: address: <REDACTED>:80

But, I have a feeling this has nothing to do with the issue.

I’m lost.

Ok, I found the issue. Our staging Traefik had Consul service name traefik, while our production Traefik used traefik-production.

The problem was that providers.consulCatalog.serviceName defaults to traefik (according to the documentation). This caused production to silently use the wrong service name, which sometimes worked initially because our staging traefik service was already registered in Consul.

The fix: We explicitly set providers.consulCatalog.serviceName to match our actual Consul service name (traefik-production), and the problem disappeared.

The confusing part: There seems to be a timing quirk with how Nomad Workload Identity tokens interact with Consul. During deployment, when the short-lived workload identity token is first issued, Traefik can sometimes authenticate successfully even with the wrong service name (despite strict Consul binding rules). However, once that initial token expires and Traefik tries to refresh the TLS certificate, it fails because it's still using the incorrect service name:

[ERROR] agent.http: Request error: method=GET url=/v1/agent/connect/ca/leaf/traefik?index=59833053
from=10.180.238.25:46044 error="rpc error making call: Permission denied: token with AccessorID
'c89fa400-b9d0-5683-63b9-9a01e1d49f14' lacks permission 'service:write' on "traefik""

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.