due to a P1 Incident, we are improving the monitoring on the Traefik side, but we lack some knowledge on how the metrics are being collected and how can we make use of them.
On one hand, we'd like to get error rate, but it's really important that we isolate the errors coming from Traefik itself and we don't get the errors coming from the underlying services. We were considering the metrics from the entrypoints, but we are not sure of how good this approach would be.
On the other hand, we need to get the latency. For that we were considering to use the request duration from the entrypoints and the services, but we need to understand what is this metric exactly: the time it takes for an entrypoint/service to process a request and just forward it to the next hop or is it the RTT of a request depending on which end does it come from?