Traefik Cluster Unhealthy

Hi folks,

I'm a traefik novice. We are seeing cluster degradation on the dashboard (as shown below). We are running Traefik EE on Docker Swarm with an ansible playbook. The logs from TEE is being picked up by greylog

Q1: Can this be a false positive?

When searching our greylog logs, I couldn't find anything in container_name: traefikee_con* with errors that relates to the timeframe where we were seeing 504 errors.

Q2: If this is not a false positive, how can we recover the cluster?

From my understanding of traefik (and reading the docs), there can be a few possible solutions:

  1. Create a new HA cluster with the same configuration and swap over to the new cluster.
  2. Redeploy the full TEE stack from the ansible playbook.

I don't have the instructions for option #1 but can someone maybe help me with that?

For option #2, here is the documentation I have but I've never run through them before. I'm worried that this may make our services non-operational. So, can someone also help me figure out if these steps are okay?

Steps to Reset Traefik EE Stack

Removing the stacks

docker stack rm traefikee

Remove the config

Two config items will be in the swarm. They will be prefixed with the environment name.
{{env}}-controller {{env}}-proxy

Example for test:

docker config rm test-controller test-proxy

Remove or clean volume

docker run --rm -it --volume vol0:/data bash rm -fr /data/controller-0

Redeploy controller using ansible

ansible-playbook -i inventory-test/ stack-traefikee.yaml

Wait until LetsEncrypt challenge is completed and the controller is up before going to the next steps.

Redeploy proxies using ansible

ansible-playbook -i inventory-test/ stack-traefikee-proxies.yaml