Hi,
I'm running a gRPC service in Nomad (on Docker) with Consul for discovery. I'm using Traefik to proxy requests to the service over h2c (Traefik itself is behind an AWS NLB that performs TLS termination, so it's unencrypted http2 coming in to Traefik).
When running just one instance of the service, everything works great -- no issues with unary RPCs, streaming, etc. However as soon as there are multiple instances (say, n=3
), clients see only 1/n requests succeed — the rest eventually time out and get a 504 from Traefik as if the upstream connection timed out.
From what I can tell, the successful requests are going to the same service instance every time. If I restart the gRPC client (causing it to re-connect to Traefik), it'll be a different instance that receives the 1/3 successful requests.
It seems like the gRPC clients are opening an HTTP2 "session" of some sort to one of the 3 service instances, then future RPC requests are getting round-robin'd by Traefik such that the other two instances never receive/respond to the request. Without knowing much about HTTP2, I'd still expect that a client with a persistent HTTP2 connection that is being kept alive will continue to talk to the same backend until its HTTP2 connection closes.
Any workarounds here?