Traefik stops responding under load

liam-mackie · January 6, 2022, 11:39pm

Hello! We run Traefik in our Kubernetes cluster as ingress for Artifactory, and while under load, we've been seeing a lot of timeouts.

We're seeing the following logs in debug mode:

{"level":"debug","msg":"http: TLS handshake error from 10.240.2.33:33892: EOF","time":"2022-01-06T02:50:38Z"}
{"level":"debug","msg":"http: TLS handshake error from 10.240.2.199:36642: EOF","time":"2022-01-06T02:50:39Z"}
{"level":"debug","msg":"http: TLS handshake error from 10.240.0.105:35048: EOF","time":"2022-01-06T02:51:03Z"}
{"level":"debug","msg":"http: TLS handshake error from 10.240.2.33:34746: EOF","time":"2022-01-06T02:51:04Z"}
{"level":"debug","msg":"Send instance info to pilot: 2022-01-06 02:51:17.281474505 +0000 UTC m=+304.997529756","time":"2022-01-06T02:51:17Z"}
{"level":"error","msg":"failed to create UUID: failed call Pilot: Post \"https://instance-info.pilot.traefik.io/public/\": dial tcp 18.158.161.205:443: i/o timeout (Client.Timeout exceeded while awaiting headers)","time":"2022-01-06T02:51:22Z"}
{"level":"debug","msg":"http: TLS handshake error from 10.240.3.44:18813: EOF","time":"2022-01-06T02:51:54Z"}
{"level":"debug","msg":"http: TLS handshake error from 10.240.2.212:46822: EOF","time":"2022-01-06T02:52:17Z"}
{"level":"debug","msg":"http: TLS handshake error from 10.240.3.153:33670: EOF","time":"2022-01-06T02:52:17Z"}
{"level":"debug","msg":"http: TLS handshake error from 10.240.1.51:57164: EOF","time":"2022-01-06T02:52:21Z"}

When these errors begin popping up, we have timeouts awaiting headers, timeouts from TLS handshakes, and timeouts on the non-https /ping endpoint from our health checks. Note that we have manually modified the health checks to have a 10s timeout and 10 consecutive failures before restart to ensure it wasn't something being caused by the checks failing and the pod being restarted.

Since we're using the inbuilt LE Resolver, we are not running HA - only 1 pod. We are actively looking at ways to migrate to cert-manager so that we can run multiple pods to potentially mitigate the issue.

In terms of configuration, it's fairly basic:

# Ingressroute spec
spec:
  entryPoints:
  - web
  - websecure
  routes:
  - kind: Rule
    match: Host(`<redacted>`)
    services:
    - name: artifactory
      port: 8082
  - kind: Rule
    match: Host(`artifactory.octopushq.com`)
    services:
    - name: artifactory
      port: 8082
  - kind: Rule
    match: Host(`chocolatey.<redacted>`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-chocolatey-v2-path-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`deb.<redacted>`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-deb-path-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`docker.<redacted>`)
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`helm.<redacted>`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-helm-path-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`msi.<redacted>`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-msi-path-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`npm.<redacted>`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-npm-path-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`nuget.<redacted>`) && Path(`/v3/index.json`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-nuget-v3-index-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`nuget.<redacted>`) && PathPrefix(`/v3`) && !Path(`/v3/index.json`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-nuget-v3-path-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`nuget.<redacted>`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-nuget-v2-path-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`rpm.<redacted>`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-rpm-path-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`zip.<redacted>`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-zip-path-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`maven.<redacted>`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-maven-path-middleware
    services:
    - name: artifactory
      port: 8081
  tls:
    certResolver: le
    options:
      name: minimum-tls-1-2

# Middleware - header
spec:
  headers:
    customRequestHeaders:
      X-JFrog-Override-Base-Url: https://<redacted>/

# Middleware - path (example from npm, but it is about the same for all middleware)
spec:
  replacePathRegex:
    regex: ^/(.*)$
    replacement: /artifactory/npm/$1

We aren't seeing any other pods experiencing networking issues on the same node, and I'm not sure where we should look next! Any help would be much appreciated.

liam-mackie · January 9, 2022, 11:01pm

For anyone else that finds this thread, we ended up finding that the node CPU was spiking, causing the drops.

jakubhajek · January 10, 2022, 8:45am

Hello @liam-mackie

Thanks for posting the issue and for proceeding with further investigation.
Glad to hear that you have found the root cause of the issue.

In case of other issues feel free to contact the Traefik community.

Thanks,
Jakub

Topic		Replies	Views
Improving latency with ingress service that seems to have tied up traefik connections Traefik v2 kubernetes-ingress	1	465	May 26, 2024
Request to API in k8s cluster behind Traefik times out after 30 seconds Traefik v2 kubernetes-ingress	2	1334	December 11, 2023
Traefik Ingress times out when connecting from another host, works fine from localhost Traefik v1 kubernetes-ingress	1	954	March 10, 2020
Traefik 1.7.12 ingress requests ending with a HTTP 504 gateway timeout Traefik v1 kubernetes-ingress	14	8561	July 9, 2019
Connection to backend service times out when running multiple Traefik pod instances Traefik v2 kubernetes-crd , kubernetes-ingress , ping , tcp	1	1975	December 23, 2020

Traefik stops responding under load

Related topics