Failed to download plugin caused all services to go down

maxiu · February 28, 2023, 9:04am

Hello,
I am using k3s cluster with traefik as the ingress. I had an interesting situation this morning when I was notified that all my services are down with 404 error. I checked the logs and I saw this:

<serving traffic as usual>

{"level":"info","msg":"Traefik version 2.9.4 built on 2022-10-27T18:44:34Z","time":"2023-02-28T06:59:36Z"}

{"level":"info","msg":"\nStats collection is disabled.\nHelp us improve Traefik by turning this feature on :)\nMore details on: https://doc.traefik.io/traefik/contributing/data-collection/\n","time":"2023-02-28T06:59:36Z"}

{"error":"failed to download plugin github.com/soulbalz/traefik-real-ip: failed to call service: Get \"https://plugins.traefik.io/public/download/github.com/soulbalz/traefik-real-ip/v1.0.3\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","level":"error","msg":"Plugins are disabled because an error has occurred.","time":"2023-02-28T06:59:41Z"}

{"entryPointName":"websecure","level":"error","msg":"invalid middleware \"kube-system-traefik-real-ip@kubernetescrd\" configuration: invalid middleware type or middleware does not exist","routerName":"service-name-ce29b9a8df15f4c7cd88@kubernetescrd","time":"2023-02-28T06:59:42Z"}

I could fix this by simply killing the pod but in the meantime all services were down because all of them are using one plugin or another.

I have a couple of questions regarding this issue:

why was config suddenly reloaded, the pod was up for 28 days and this has never happened before, there is nothing in the logs to suggest a problem, the pod did not die
is there a way to make healthcheck fail in case of failed plugin download so kubernetes will kill the pod until plugin download is successful?

My config:

    - --entrypoints.metrics.address=:9101/tcp
    - --entrypoints.traefik.address=:9000/tcp
    - --entrypoints.web.address=:8000/tcp
    - --entrypoints.websecure.address=:8443/tcp
    - --api.dashboard=true
    - --ping=true
    - --metrics.prometheus=true
    - --metrics.prometheus.entrypoint=metrics
    - --providers.kubernetescrd
    - --providers.kubernetesingress
    - --providers.kubernetesingress.ingressendpoint.publishedservice=kube-system/traefik
    - --entrypoints.web.http.redirections.entryPoint.to=:443
    - --entrypoints.web.http.redirections.entryPoint.scheme=https
    - --entrypoints.websecure.http.tls=true
    - --log.format=json
    - --log.level=INFO
    - --accesslog=true
    - --accesslog.format=json
    - --accesslog.fields.defaultmode=drop
    - --accesslog.fields.names.ClientHost=keep
    - --accesslog.fields.names.RequestHost=keep
    - --accesslog.fields.names.RequestMethod=keep
    - --accesslog.fields.names.RequestPath=keep
    - --accesslog.fields.headers.defaultmode=drop
    - --accesslog.fields.headers.names.Cf-Ipcountry=keep
    - --accesslog.fields.headers.names.User-Agent=keep
    - --accesslog.fields.headers.names.X-Forwarded-User=keep
    - --log.level=INFO
    - --providers.kubernetescrd.allowCrossNamespace=true
    - --providers.kubernetescrd
    - --experimental.plugins.traefik-real-ip.modulename=github.com/soulbalz/traefik-real-ip
    - --experimental.plugins.traefik-real-ip.version=v1.0.3
    - --entryPoints.web.forwardedHeaders.insecure
    - --entryPoints.websecure.forwardedHeaders.insecure
    image: rancher/mirrored-library-traefik:2.9.4

Thanks!

svx · February 28, 2023, 1:58pm

Hi @maxiu,
Thanks for your interest in Traefik!

The first thing what you can try is upgrading to a newer Traefik version.
From v2.9.6 on, Traefik increased the timeout on plugin downloads.

You could also try Traefik v3 which includes a new retry functionality.

Plugins are parsed and loaded exclusively during startup, which allows Traefik to check the integrity of the code and catch errors early on.
If an error occurs during loading, the plugin is disabled.

maxiu · February 28, 2023, 3:32pm

Hello!
Thanks for the suggestions.
Are there any alternative solutions to make it more robust like failing the healthcheck, making the middleware optional in case plugin was not loaded?

Plugins are parsed and loaded exclusively during startup

This is what I don't understand, it happened during normal operation, there was no restart, it looked like traefik process inside the container died with no message and was restarted.

bmagic · July 4, 2024, 10:00am

Sorry to raise this ticket but I have exactly the same problem.

On Kubernetes during a rollout of a deployment traefik, if a traefik cannot download the plugin then it will be ready and always return 404.
Wouldn't it be simpler to SIGTERM the process to ensure that the traefik (which isn't healthy after all) doesn't become ready?

Topic		Replies	Views
Plugin not working Traefik v2 plugin	3	1204	November 16, 2022
Maybe can not download plugin Traefik v3 (latest) middleware , plugin	3	229	June 10, 2024
Error on traefik plugin install configuration in kubernetes Traefik v2 plugin	8	799	March 6, 2024
Failed to download plugin due to tls error Traefik v2 plugin	6	948	April 22, 2023
V3 Plugin Download failing or Layer 8 Congestion? Traefik v3 (latest) docker , plugin	0	22	July 7, 2025

Failed to download plugin caused all services to go down

Related topics