We are running Traefik v1.7.9 as a deamonset on our 7-node K8s cluster.The traffic rate is about ~14K-15K requests per minute. Currently we are facing an issue where Traefik returns ~100 '502' responses every hour and we fail to understand why.
The relevant line from Traefik log shown below:
time="2020-01-09T17:16:45Z" level=debug msg="'502 Bad Gateway' caused by: read tcp 10.1.1.2:43068->10.1.1.1:80: read: connection reset by peer" time="2020-01-09T17:16:45Z" level=debug msg="vulcand/oxy/forward/http: Round trip: http://10.1.1.1:80, code: 502, Length: 11, duration: 271.066707ms"
After doing some research, we've noticed the following thread on github which suggest setting the
MaxIdleConnsPerHost = -1 and it seems to resolve our problem.
The thing is that we do not really understand how this change will affect us or will it cause us to lose samples sent to our API's. I would be glad to understand what this change might cause and in general what are the best ways to analyze\debug such problems going forward.