Hello Traefik community,
I'm currently running Traefik in a Kubernetes cluster, and I'm encountering an issue with Prometheus scraping the metrics from Traefik. Here's a detailed overview of my setup and the problem I'm facing:
Cluster Details:
- Kubernetes Environment: Running on a cloud provider (please specify your cloud provider if applicable, e.g., GKE, EKS, DigitalOcean)
- Traefik Version: Using the official Traefik Helm chart
- Namespace:
traefik
- Prometheus Version: Deployed using the Prometheus Helm chart
globalArguments:
- "--global.sendanonymoususage=false"
- "--global.checknewversion=false"
additionalArguments:
- "--serversTransport.insecureSkipVerify=true"
- "--log.level=DEBUG"
- "--accessLog=true"
- "--accessLog.fields.headers.defaultMode=keep"
- "--accessLog.fields.headers.names.Authorization=drop"
- "--accessLog.fields.headers.names.Cookie=drop"
- "--entrypoints.web.address=:8000"
- "--entrypoints.websecure.address=:8443"
- "--metrics.prometheus=true"
- "--metrics.prometheus.entryPoint=metrics"
- "--entryPoints.metrics.address=:9100"
ports:
web:
redirectTo:
port: websecure
priority: 10
websecure:
http3:
enabled: true
advertisedPort: 443
tls:
enabled: true
cockroachdb:
port: 26257
metrics:
port: 9100
expose:
enabled: true
port: 9100
protocol: TCP
entryPoint: metrics
The Problem:
In the Prometheus UI, the Traefik metrics endpoint appears as DOWN
with the following error message:
http://5.5.176.214:9100/metrics DOWN
instance="5.5.176.214:9100" job="traefik"
Get "http://5.5.176.214:9100/metrics": context deadline exceeded
Troubleshooting Steps Taken:
- Verified the Traefik Metrics Endpoint:
- Attempted to
curl
the metrics endpoint from both inside and outside the cluster, but the request times out.
- Checked Traefik Logs:
- No errors related to metrics exposure were found in the logs.
- Adjusted Prometheus Scrape Timeout:
- Increased the scrape timeout to 30 seconds in the Prometheus config, but the issue persists.
- Checked Network Policies and Firewall Settings:
- Ensured that no network policies or firewall rules are blocking port 9100.
- Checked Resource Usage:
- Verified that the Traefik pod isn't resource-constrained (normal CPU and memory usage).
- Restarted Traefik:
- Tried restarting the Traefik pod, but the problem remains.
Request for Help:
I'm seeking advice on how to further troubleshoot and resolve this issue. Has anyone encountered similar problems with Prometheus scraping Traefik metrics? Are there any specific configurations or logs I should look into?
Thank you in advance for your help!