Traefik v3.3.2 Preventing Loki Cluster Ring Formation

Hello Traefik Lab Community,

I am seeking assistance with an issue where integrating Traefik v3.3.2 into our environment has led to our previously stable Loki cluster becoming unhealthy, specifically failing to form its ring. Prior to this integration, the Loki cluster operated seamlessly behind an Nginx reverse proxy.

Background:

Our Loki cluster was functioning correctly with Nginx, handling log aggregation and querying without any issues. We decided to transition to Traefik v3.3.2 to leverage its dynamic routing and load-balancing capabilities. However, post-migration, the Loki cluster has been unable to establish its ring, leading to instability and health check failures.

Troubleshooting Steps Undertaken:

  1. Network Configuration Review:
  • Ensured that all Loki services and Traefik are on the same Docker network to facilitate seamless communication.
  • Verified that there are no firewall rules or network policies obstructing traffic between Loki components.
  1. Traefik EntryPoints Configuration:
  • Configured Traefik with specific entry points for Loki's HTTP and gRPC services:
entryPoints:
  http:
    address: ":80"
  https:
    address: ":443"
  ringUDP:
    address: ":7946/udp"
  gRPC:
    address: ":9095"
  • Ensured that Traefik is not inadvertently intercepting or interfering with Loki's internal communication ports, such as 7946, which is used for the gossip protocol. ( I tried with or without the ringUDP)
  1. Service Labels and Routing Rules:
  • Applied the following labels to the Loki services to define Traefik routing:
labels:
  - "traefik.enable=true"
  - "traefik.http.routers.loki.entrypoints=http"
  - "traefik.http.routers.loki.rule=Host(`loki.example.com`)"
  - "traefik.http.services.loki.loadbalancer.server.port=3100"
  - "traefik.tcp.routers.loki-grpc.entrypoints=gRPC"
  - "traefik.tcp.routers.loki-grpc.service=loki-grpc"
  - "traefik.tcp.services.loki-grpc.loadbalancer.server.port=9095"
  - "traefik.udp.routers.loki-ringUDP.entrypoints=ringUDP"
  - "traefik.udp.routers.loki-ringUDP.service=loki-ringUDP"
  - "traefik.udp.services.loki-ringUDP.loadbalancer.server.port=7946"

with or without the tcp and udp routing the loki cluster cannot form the ring

  • Confirmed that these labels are correctly applied and correspond to the appropriate Traefik entry points and Loki service ports.
  1. Protocol Handling:
  • Recognized that Loki's ring formation relies on the gossip protocol over UDP on port 7946.
  • Noted that Traefik's support for UDP routing is limited and may not fully accommodate the requirements of Loki's gossip protocol.
  1. Isolation of Traefik:
  • Tested the Loki cluster without Traefik in the path, reverting to direct communication or using Nginx as the reverse proxy.
  • Observed that the Loki cluster successfully formed its ring and returned to a healthy state in the absence of Traefik.

Conclusion:

Based on the troubleshooting steps undertaken, it appears that Traefik v3.3.2 may be interfering with Loki's internal communication, particularly affecting the gossip protocol necessary for ring formation. This interference could be due to Traefik's handling (or lack thereof) of UDP traffic on port 7946, which is essential for Loki's cluster operations.

Request for Assistance:

I am reaching out to the community for guidance on configuring Traefik to coexist with Loki's clustering requirements. Specifically, I am interested in:

  • Best practices for setting up Traefik to handle or bypass Loki's internal UDP traffic.
  • Any known limitations or considerations when using Traefik as a reverse proxy for a Loki cluster.
  • Alternative approaches or configurations that have proven successful in similar scenarios.

Any insights or recommendations would be greatly appreciated as we aim to integrate Traefik into our logging infrastructure without compromising the stability of our Loki cluster.

Thank you very much in advanced for your assistance.

Do you route the Loki internal protocol through Traefik? I would expect that not to work, as Traefik would round-robin and targets will change all the time.

Hi @bluepuma77
Thank you for your prompt response. To address your question, I have been attempting to route Loki's internal communication through Traefik, when I noticed the Loki cluster is not forming the ring, specifically targeting the gossip protocol on port 7946. I understand that this approach might introduce complexities due to Traefik's load-balancing behavior, which could disrupt the consistent peer connections required for Loki's ring formation.

  • So I tried to doing this, I configured Traefik to handle UDP traffic for Loki's internal communication by adding the following labels to the Loki service:
  - "traefik.udp.routers.loki-ringUDP.entrypoints=ringUDP"
  - "traefik.udp.routers.loki-ringUDP.service=loki-ringUDP"
  - "traefik.udp.services.loki-ringUDP.loadbalancer.server.port=7946"

that I thought might solve it... but is not. So I wonder why is causing to prevent the Loki instances to join the ring.

  • I did network segmentation, to ensure isolated communication channels, I defined separate Docker networks for Loki and Traefik:
networks:
  loki:
    external: true
    name: loki
  traefik:
    external: true
    name: traefik
  • I then assigned the networks accordingly:
services:
    loki:
        networks:
          - loki
    
    traefik:
        networks:
          - loki
          - traefik
  • This setup was intended to facilitate direct communication between Loki instances while allowing Traefik to manage external traffic.

Isolating Traefik:

  • For further troubleshooting, I temporarily removed Traefik from the Docker Swarm to observe the behavior of the Loki cluster. And put traefik in standalone docker-compose but still the same behavior of Loki cluster is not forming the ring.

For sure I might missed some information that I haven't tried yet...

I appreciate your insights and would welcome any further recommendations or best practices to ensure a seamless integration of Traefik with the Loki cluster.