Hello all,
I am dealing with a networking issue in GKE where every now and then my synthetic tests to a /healthcheck
endpoint of my service will return a 503 or a 504 (similar to Traefik 1.7.12 ingress requests ending with a HTTP 504 gateway timeout). Following the suggestions of @daniel.tomcej, I turned on debug mode for the logs in traefik and waited for my issue to happen again.
To add some context, I am using traefik installed via helm (chart version traefik-1.63.1, traefik version 1.7.9) and it is the main ingress controller for my kubernetes cluster (on GKE).
These are the logs that I believe are related to my issue (correlated by timestamp and deployment id):
{
"msg": "Endpoints not available for default/myAwesomeService",
"kube_labels_heritage": "Tiller",
"log": {
"msg": "Endpoints not available for default/myAwesomeService",
"level": "warning",
"time": "2019-07-09T09:18:54Z"
},
"level": "warning",
"lifted_annotations": {
"kubernetes_io/limit-ranger": "LimitRanger plugin set: cpu request for container traefik-live-1",
"checksum/config": "0f81f7b30bdd55207aba4ab87c9eac4ec5b6acea03a16e46027b4f88419b16ac"
},
"kube_labels_release": "traefik-live-1",
"lifted_pod_name": "traefik-live-1-6f555d46c5-ktctm",
"lifted_container_name": "traefik-live-1",
"lifted_docker_id": "b0fb4c928464a218dc0eafef299a1911f5b0258fa036f446cb9cf06c10f68437",
"@timestamp": "2019-07-09T09:18:54.063Z",
"lifted_pod_id": "89493634-a221-11e9-b4e1-42010a1e000a",
"stream": "stdout",
"lifted_namespace_name": "default",
"kube_labels_pod-template-hash": "6f555d46c5",
"time": "2019-07-09T09:18:54Z",
"kube_labels_chart": "traefik-1.63.1",
"kube_labels_app": "traefik"
}
{
"msg": "Endpoints not available for default/myAwesomeService",
"kube_labels_heritage": "Tiller",
"log": {
"msg": "Endpoints not available for default/myAwesomeService",
"level": "warning",
"time": "2019-07-09T09:18:54Z"
},
"level": "warning",
"lifted_annotations": {
"kubernetes_io/limit-ranger": "LimitRanger plugin set: cpu request for container traefik-live-1",
"checksum/config": "0f81f7b30bdd55207aba4ab87c9eac4ec5b6acea03a16e46027b4f88419b16ac"
},
"kube_labels_release": "traefik-live-1",
"@timestamp": "2019-07-09T09:18:54.066Z",
"lifted_pod_id": "8f36d3cd-a221-11e9-b4e1-42010a1e000a",
"stream": "stdout",
"lifted_namespace_name": "default",
"kube_labels_pod-template-hash": "6f555d46c5",
"time": "2019-07-09T09:18:54Z",
"kube_labels_chart": "traefik-1.63.1",
"kube_labels_app": "traefik"
}
In kubernetes I can see the following warning appearing in my deployment, but I am not sure why it is happening:
➜ ~ kubectl get events --namespace=default --field-selector reason=FailedToUpdateEndpoint
LAST SEEN TYPE REASON KIND MESSAGE
39m Warning FailedToUpdateEndpoint Endpoints Failed to update endpoint default/location-search: Operation cannot be fulfilled on endpoints "myAwesomeService": the object has been modified; please apply your changes to the latest version and try again
So far it doesn't look like the GKE load balancer is failing or that the traefik pods are causing the issue, so I suspect it is a problem with my deployments. I am not sure how to invetigate this further however, so I would appreciate any help!