Hello, I am trying to get a better understanding of what is happening:
I got a sentry notification that I got a 503. I confirmed this by checking stackdriver logs, and yes I see a couple of 5XX errors within the last hour or so. When I query the traefik_service_requests_total
on Grafana, I get no results back, here is the query I used:
traefik_service_requests_total{service=~"production.*", code=~"503"}
I read somewhere this could either be due to counters being reset when services are scaled down and scaled up AND querying the metrics will return counts for services that are currently live. To account for this, we should use rate/increase which tries to automatically adjust for resets in counters. docs
I changed my query to then use rate to see if it was able to detect the 503 error that I see in both sentry and the logs. I used a 3hr, 7hr, and 12hr time range within my rate query and it still returned a value of 0. Should the metrics accurately reflect the logs? Why is there a discrepancy between the 5XX we see in the logs vs what appears in the metrics? I am new to time series/analytics and may be misunderstanding how it works.
Thank you for your time!