Traefik stops responding under load

Hello! We run Traefik in our Kubernetes cluster as ingress for Artifactory, and while under load, we've been seeing a lot of timeouts.

We're seeing the following logs in debug mode:

{"level":"debug","msg":"http: TLS handshake error from 10.240.2.33:33892: EOF","time":"2022-01-06T02:50:38Z"}
{"level":"debug","msg":"http: TLS handshake error from 10.240.2.199:36642: EOF","time":"2022-01-06T02:50:39Z"}
{"level":"debug","msg":"http: TLS handshake error from 10.240.0.105:35048: EOF","time":"2022-01-06T02:51:03Z"}
{"level":"debug","msg":"http: TLS handshake error from 10.240.2.33:34746: EOF","time":"2022-01-06T02:51:04Z"}
{"level":"debug","msg":"Send instance info to pilot: 2022-01-06 02:51:17.281474505 +0000 UTC m=+304.997529756","time":"2022-01-06T02:51:17Z"}
{"level":"error","msg":"failed to create UUID: failed call Pilot: Post \"https://instance-info.pilot.traefik.io/public/\": dial tcp 18.158.161.205:443: i/o timeout (Client.Timeout exceeded while awaiting headers)","time":"2022-01-06T02:51:22Z"}
{"level":"debug","msg":"http: TLS handshake error from 10.240.3.44:18813: EOF","time":"2022-01-06T02:51:54Z"}
{"level":"debug","msg":"http: TLS handshake error from 10.240.2.212:46822: EOF","time":"2022-01-06T02:52:17Z"}
{"level":"debug","msg":"http: TLS handshake error from 10.240.3.153:33670: EOF","time":"2022-01-06T02:52:17Z"}
{"level":"debug","msg":"http: TLS handshake error from 10.240.1.51:57164: EOF","time":"2022-01-06T02:52:21Z"}

When these errors begin popping up, we have timeouts awaiting headers, timeouts from TLS handshakes, and timeouts on the non-https /ping endpoint from our health checks. Note that we have manually modified the health checks to have a 10s timeout and 10 consecutive failures before restart to ensure it wasn't something being caused by the checks failing and the pod being restarted.

Since we're using the inbuilt LE Resolver, we are not running HA - only 1 pod. We are actively looking at ways to migrate to cert-manager so that we can run multiple pods to potentially mitigate the issue.

In terms of configuration, it's fairly basic:

# Ingressroute spec
spec:
  entryPoints:
  - web
  - websecure
  routes:
  - kind: Rule
    match: Host(`<redacted>`)
    services:
    - name: artifactory
      port: 8082
  - kind: Rule
    match: Host(`artifactory.octopushq.com`)
    services:
    - name: artifactory
      port: 8082
  - kind: Rule
    match: Host(`chocolatey.<redacted>`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-chocolatey-v2-path-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`deb.<redacted>`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-deb-path-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`docker.<redacted>`)
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`helm.<redacted>`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-helm-path-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`msi.<redacted>`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-msi-path-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`npm.<redacted>`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-npm-path-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`nuget.<redacted>`) && Path(`/v3/index.json`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-nuget-v3-index-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`nuget.<redacted>`) && PathPrefix(`/v3`) && !Path(`/v3/index.json`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-nuget-v3-path-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`nuget.<redacted>`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-nuget-v2-path-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`rpm.<redacted>`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-rpm-path-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`zip.<redacted>`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-zip-path-middleware
    services:
    - name: artifactory
      port: 8081
  - kind: Rule
    match: Host(`maven.<redacted>`)
    middlewares:
    - name: artifactory-artifactory-header-middleware
    - name: artifactory-maven-path-middleware
    services:
    - name: artifactory
      port: 8081
  tls:
    certResolver: le
    options:
      name: minimum-tls-1-2

# Middleware - header
spec:
  headers:
    customRequestHeaders:
      X-JFrog-Override-Base-Url: https://<redacted>/

# Middleware - path (example from npm, but it is about the same for all middleware)
spec:
  replacePathRegex:
    regex: ^/(.*)$
    replacement: /artifactory/npm/$1

We aren't seeing any other pods experiencing networking issues on the same node, and I'm not sure where we should look next! Any help would be much appreciated.

For anyone else that finds this thread, we ended up finding that the node CPU was spiking, causing the drops.

Hello @liam-mackie

Thanks for posting the issue and for proceeding with further investigation.
Glad to hear that you have found the root cause of the issue.

In case of other issues feel free to contact the Traefik community.

Thanks,
Jakub