Traefik (consulcatalog) , Nomad and Consul - 502 errors during deployment

marco.chessa · August 23, 2023, 3:46pm

Traefik: 2.10.1
Nomad: v1.5.6+ent
Consul: v1.15.2+ent

We are experiencing 502 errors during deployments and we've tried few configurations for the Nomad Jobs and Traefik.

Enabling "watch: true" seems to solve the issue, however when it is enabled, it causes a massive spike on CPU usage on the Consul Leader server.

Trying "refreshInterval: 1" seems to hugely improve the issue without causing problems on Consul but some 502's still occur.

[I've got some input from a forum entry before opening my own: Traefik V2 and ConsulCatalog]

I would appreciate your input/help.

An example of the configuration in use for Traefik

api:
  dashboard: true
  insecure: true
entrypoints:
  https:
    address: "0.0.0.0:<PORT_NUMBER>"
  metrics:
    address: "0.0.0.0:<PORT_NUMBER>"
  traefik:
    address: "0.0.0.0:<PORT_NUMBER>"
log:
  level: DEBUG
  format: json
accessLog:
  format: json
  filePath: "/dev/stdout"
  fields:
    defaultMode: keep
    headers:
      defaultMode: keep
metrics:
  prometheus:
    addEntryPointsLabels: true
    addServicesLabels: true
    entryPoint: metrics
providers:
  file:
    filename: "/secrets/<FILE_1>.yaml"
  consulCatalog:
    prefix: <PREFIX>
    exposedByDefault: false
    cache: false
    constraints: Tag(`tier=<TIER_NAME>`)
    endpoint:
      address: 172.17.0.1:8501
      scheme: https
      tls:
        ca: "/secrets/<FILE_2>.pem"
        cert: "/secrets/<FILE_3>.pem"
        key: "/secrets/<FILE_4>.key"
        insecureSkipVerify: true
    
serversTransport:
  insecureSkipVerify: true

helioascorreia · September 18, 2023, 4:36pm

I have the same problem here, but i'm using only nomad. I will try the refresh interval to 1 to see if it helps

bluepuma77 · September 19, 2023, 6:16am

We use Docker Swarm and I guess you need to balance how your orchestration tools is deploying new containers and how the refresh interval comes in between.

At the end we set a Docker Swarm container update cycle of 30 sec and a poll interval of 15 sec. So 1 container is restarted every 30 secs and Traefik always has a few still working in its list of targets to forward to.

marco.chessa · September 25, 2023, 8:41am

An update for others with a similar issue.

We had two types of 502s:

the ones happening during deployments
This was solves with two options to chose from:
a) on Traefik: refreshInterval: 3s

 on Nomad: kill_timeout=“5s”
           shutdown_delay=“10s”

b) on Traefik: no particular settings

   on Nomad:kill_timeout="2m"
            shutdown_delay=“10s”
            kill_signal="SIGUSR1"

intermittent/sporadic ones happening
This one is not being caused by Traefik (we’ve performed lots of tests to be sure) and the errors are being generated by the upstream - Traefik is “just logging it” as received.

The issue on number 2) was causing confusion during our tests to eliminate the issue happening during deployments because the error is the same (i.e. 502s appearing on Traefik logs)

system · September 28, 2023, 8:42am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Traefik V2 and ConsulCatalog Traefik v2 consul-catalog	11	2245	April 9, 2021
Intermittent 502's downstream with Nomad caused by : EOF Traefik v2 docker , consul-catalog	0	45	October 10, 2024
Traefik Nomad deployment with Consul Connect Traefik v2 consul-catalog	2	1190	June 1, 2022
Traefik not finding services from consul Traefik v2 consul-catalog	2	1011	March 2, 2021
Consul-catalog migration to 2.x not working Traefik v2 docker , consul-catalog	2	931	August 12, 2020

Traefik (consulcatalog) , Nomad and Consul - 502 errors during deployment

Related topics