Issues with Prometheus/Traefik Metrics - "Context Deadline Exceeded" Error

Hello Traefik community,

I'm currently running Traefik in a Kubernetes cluster, and I'm encountering an issue with Prometheus scraping the metrics from Traefik. Here's a detailed overview of my setup and the problem I'm facing:

Cluster Details:

  • Kubernetes Environment: Running on a cloud provider (please specify your cloud provider if applicable, e.g., GKE, EKS, DigitalOcean)
  • Traefik Version: Using the official Traefik Helm chart
  • Namespace: traefik
  • Prometheus Version: Deployed using the Prometheus Helm chart
globalArguments:
  - "--global.sendanonymoususage=false"
  - "--global.checknewversion=false"

additionalArguments:
  - "--serversTransport.insecureSkipVerify=true"
  - "--log.level=DEBUG"
  - "--accessLog=true"
  - "--accessLog.fields.headers.defaultMode=keep"
  - "--accessLog.fields.headers.names.Authorization=drop"
  - "--accessLog.fields.headers.names.Cookie=drop"
  - "--entrypoints.web.address=:8000"
  - "--entrypoints.websecure.address=:8443"
  - "--metrics.prometheus=true"
  - "--metrics.prometheus.entryPoint=metrics"
  - "--entryPoints.metrics.address=:9100"

ports:
  web:
    redirectTo:
      port: websecure
      priority: 10
  websecure:
    http3:
      enabled: true
    advertisedPort: 443 
    tls:
      enabled: true
  cockroachdb:
    port: 26257
  metrics:
    port: 9100
    expose:
      enabled: true
      port: 9100
      protocol: TCP
      entryPoint: metrics

The Problem:

In the Prometheus UI, the Traefik metrics endpoint appears as DOWN with the following error message:

http://5.5.176.214:9100/metrics  DOWN  
instance="5.5.176.214:9100" job="traefik"
Get "http://5.5.176.214:9100/metrics": context deadline exceeded

Troubleshooting Steps Taken:

  1. Verified the Traefik Metrics Endpoint:
  • Attempted to curl the metrics endpoint from both inside and outside the cluster, but the request times out.
  1. Checked Traefik Logs:
  • No errors related to metrics exposure were found in the logs.
  1. Adjusted Prometheus Scrape Timeout:
  • Increased the scrape timeout to 30 seconds in the Prometheus config, but the issue persists.
  1. Checked Network Policies and Firewall Settings:
  • Ensured that no network policies or firewall rules are blocking port 9100.
  1. Checked Resource Usage:
  • Verified that the Traefik pod isn't resource-constrained (normal CPU and memory usage).
  1. Restarted Traefik:
  • Tried restarting the Traefik pod, but the problem remains.

Request for Help:

I'm seeking advice on how to further troubleshoot and resolve this issue. Has anyone encountered similar problems with Prometheus scraping Traefik metrics? Are there any specific configurations or logs I should look into?

Thank you in advance for your help!

1 Like

I have the same problem.
helm chart version 31.0.0.