Failed to download plugin caused all services to go down

Hello,
I am using k3s cluster with traefik as the ingress. I had an interesting situation this morning when I was notified that all my services are down with 404 error. I checked the logs and I saw this:

<serving traffic as usual>

{"level":"info","msg":"Traefik version 2.9.4 built on 2022-10-27T18:44:34Z","time":"2023-02-28T06:59:36Z"}

{"level":"info","msg":"\nStats collection is disabled.\nHelp us improve Traefik by turning this feature on :)\nMore details on: https://doc.traefik.io/traefik/contributing/data-collection/\n","time":"2023-02-28T06:59:36Z"}

{"error":"failed to download plugin github.com/soulbalz/traefik-real-ip: failed to call service: Get \"https://plugins.traefik.io/public/download/github.com/soulbalz/traefik-real-ip/v1.0.3\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","level":"error","msg":"Plugins are disabled because an error has occurred.","time":"2023-02-28T06:59:41Z"}

{"entryPointName":"websecure","level":"error","msg":"invalid middleware \"kube-system-traefik-real-ip@kubernetescrd\" configuration: invalid middleware type or middleware does not exist","routerName":"service-name-ce29b9a8df15f4c7cd88@kubernetescrd","time":"2023-02-28T06:59:42Z"}

I could fix this by simply killing the pod but in the meantime all services were down because all of them are using one plugin or another.

I have a couple of questions regarding this issue:

  1. why was config suddenly reloaded, the pod was up for 28 days and this has never happened before, there is nothing in the logs to suggest a problem, the pod did not die

  2. is there a way to make healthcheck fail in case of failed plugin download so kubernetes will kill the pod until plugin download is successful?

My config:

    - --entrypoints.metrics.address=:9101/tcp
    - --entrypoints.traefik.address=:9000/tcp
    - --entrypoints.web.address=:8000/tcp
    - --entrypoints.websecure.address=:8443/tcp
    - --api.dashboard=true
    - --ping=true
    - --metrics.prometheus=true
    - --metrics.prometheus.entrypoint=metrics
    - --providers.kubernetescrd
    - --providers.kubernetesingress
    - --providers.kubernetesingress.ingressendpoint.publishedservice=kube-system/traefik
    - --entrypoints.web.http.redirections.entryPoint.to=:443
    - --entrypoints.web.http.redirections.entryPoint.scheme=https
    - --entrypoints.websecure.http.tls=true
    - --log.format=json
    - --log.level=INFO
    - --accesslog=true
    - --accesslog.format=json
    - --accesslog.fields.defaultmode=drop
    - --accesslog.fields.names.ClientHost=keep
    - --accesslog.fields.names.RequestHost=keep
    - --accesslog.fields.names.RequestMethod=keep
    - --accesslog.fields.names.RequestPath=keep
    - --accesslog.fields.headers.defaultmode=drop
    - --accesslog.fields.headers.names.Cf-Ipcountry=keep
    - --accesslog.fields.headers.names.User-Agent=keep
    - --accesslog.fields.headers.names.X-Forwarded-User=keep
    - --log.level=INFO
    - --providers.kubernetescrd.allowCrossNamespace=true
    - --providers.kubernetescrd
    - --experimental.plugins.traefik-real-ip.modulename=github.com/soulbalz/traefik-real-ip
    - --experimental.plugins.traefik-real-ip.version=v1.0.3
    - --entryPoints.web.forwardedHeaders.insecure
    - --entryPoints.websecure.forwardedHeaders.insecure
    image: rancher/mirrored-library-traefik:2.9.4

Thanks!

1 Like

Hi @maxiu,
Thanks for your interest in Traefik!

The first thing what you can try is upgrading to a newer Traefik version.
From v2.9.6 on, Traefik increased the timeout on plugin downloads.

You could also try Traefik v3 which includes a new retry functionality.

Plugins are parsed and loaded exclusively during startup, which allows Traefik to check the integrity of the code and catch errors early on.
If an error occurs during loading, the plugin is disabled.

Hello!
Thanks for the suggestions.
Are there any alternative solutions to make it more robust like failing the healthcheck, making the middleware optional in case plugin was not loaded?

Plugins are parsed and loaded exclusively during startup

This is what I don't understand, it happened during normal operation, there was no restart, it looked like traefik process inside the container died with no message and was restarted.

Sorry to raise this ticket but I have exactly the same problem.

On Kubernetes during a rollout of a deployment traefik, if a traefik cannot download the plugin then it will be ready and always return 404.
Wouldn't it be simpler to SIGTERM the process to ensure that the traefik (which isn't healthy after all) doesn't become ready?