Traefik (consulcatalog) , Nomad and Consul - 502 errors during deployment

Traefik: 2.10.1
Nomad: v1.5.6+ent
Consul: v1.15.2+ent

We are experiencing 502 errors during deployments and we've tried few configurations for the Nomad Jobs and Traefik.

Enabling "watch: true" seems to solve the issue, however when it is enabled, it causes a massive spike on CPU usage on the Consul Leader server.

Trying "refreshInterval: 1" seems to hugely improve the issue without causing problems on Consul but some 502's still occur.

[I've got some input from a forum entry before opening my own: Traefik V2 and ConsulCatalog]

I would appreciate your input/help.

An example of the configuration in use for Traefik

api:
  dashboard: true
  insecure: true
entrypoints:
  https:
    address: "0.0.0.0:<PORT_NUMBER>"
  metrics:
    address: "0.0.0.0:<PORT_NUMBER>"
  traefik:
    address: "0.0.0.0:<PORT_NUMBER>"
log:
  level: DEBUG
  format: json
accessLog:
  format: json
  filePath: "/dev/stdout"
  fields:
    defaultMode: keep
    headers:
      defaultMode: keep
metrics:
  prometheus:
    addEntryPointsLabels: true
    addServicesLabels: true
    entryPoint: metrics
providers:
  file:
    filename: "/secrets/<FILE_1>.yaml"
  consulCatalog:
    prefix: <PREFIX>
    exposedByDefault: false
    cache: false
    constraints: Tag(`tier=<TIER_NAME>`)
    endpoint:
      address: 172.17.0.1:8501
      scheme: https
      tls:
        ca: "/secrets/<FILE_2>.pem"
        cert: "/secrets/<FILE_3>.pem"
        key: "/secrets/<FILE_4>.key"
        insecureSkipVerify: true
    
serversTransport:
  insecureSkipVerify: true

I have the same problem here, but i'm using only nomad. I will try the refresh interval to 1 to see if it helps

We use Docker Swarm and I guess you need to balance how your orchestration tools is deploying new containers and how the refresh interval comes in between.

At the end we set a Docker Swarm container update cycle of 30 sec and a poll interval of 15 sec. So 1 container is restarted every 30 secs and Traefik always has a few still working in its list of targets to forward to.

An update for others with a similar issue.

We had two types of 502s:

  1. the ones happening during deployments
    This was solves with two options to chose from:
    a) on Traefik: refreshInterval: 3s

     on Nomad: kill_timeout=“5s”
               shutdown_delay=“10s”
    

    b) on Traefik: no particular settings

       on Nomad:kill_timeout="2m"
                shutdown_delay=“10s”
                kill_signal="SIGUSR1"
    
  2. intermittent/sporadic ones happening
    This one is not being caused by Traefik (we’ve performed lots of tests to be sure) and the errors are being generated by the upstream - Traefik is “just logging it” as received.

The issue on number 2) was causing confusion during our tests to eliminate the issue happening during deployments because the error is the same (i.e. 502s appearing on Traefik logs)

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.