Seamless failover in docker swarm

We’re trying to understand why traefik doesn’t handle simple high availability scenarios in docker swarm. My HA examples are things like being able to drain a node in a cluster for maintenance and having the services start on another node first and traffic be routed to the new services seamlessly by traefik.

The expected HA behaviour works out-of-the-box with docker swarm if I bypass traefik and simply talk to a service using the routing mesh. The services in question obviously have to be configured to use update config of “start-first”. I can then cause the service to update and deploy a new instance. When I do this, there are briefly multiple services running at once, and then the old service goes down. All the while, inbound traffic is seamlessly routed by the service mesh to a working service.

If I use traefik, however, failover isn’t seamless. I assume that this is because traefik talks directly to the service on its own individual IP address, rather than using the routing mesh. When a new service comes up and the original goes down, the original service is down for a period before traefik polls and realises this; it appears it tries to talk to it at an address with nothing listening.
We would like to understand:

  1. If the service isn’t reachable at the address traefik thinks there is a service on, why doesn’t it failover to one of the other instances that have already come up on different addresses?
  2. Why doesn’t traefik offer a mode to talk to the services using the routing mesh via the service name? This would seem like an easy win to utilise the native high availability mechanisms of swarm mode.

As I understand it, things would probably become more robust if we use k8s instead. We don’t want to though: we like the simplicity of swarm mode and the dynamism it affords us, without having to explicitly and repetitively specify additional configuration. If we did move to k8s, I have a feeling it would probably make traefik redundant anyway.

https://success.docker.com/article/traefik-swarm-load-balancing seems to indicate that the desired behaviour is available in traefikEE. Is this case that to make traefik work as desired, we simply have to pay for the EE version? I wouldn’t say my example is exactly enterprise-level and seamless failover should be part and parcel of the base product, especially as docker is doing the heavy lifting.

Examples:

In the following three examples I am simply causing the service to redeploy to another node once it is running.

A) Talking to a service using an exposed port (and the default ingress network mode).

$ watch.py "curl -m 0.4 -sS http://localhost:8000/ | grep Hostname"
2020-07-06 13:26:48,649:INFO: Hostname: 11080fd0f35b
2020-07-06 13:26:48,768:INFO: Hostname: 11080fd0f35b
2020-07-06 13:26:48,893:INFO: Hostname: 11080fd0f35b
2020-07-06 13:26:49,021:INFO: Hostname: d14947ae2c6e
2020-07-06 13:26:49,140:INFO: Hostname: 11080fd0f35b
2020-07-06 13:26:49,263:INFO: Hostname: d14947ae2c6e
2020-07-06 13:26:49,394:INFO: Hostname: 11080fd0f35b
2020-07-06 13:26:49,528:INFO: Hostname: d14947ae2c6e
2020-07-06 13:26:49,662:INFO: Hostname: 11080fd0f35b
2020-07-06 13:26:49,796:INFO: Hostname: d14947ae2c6e
2020-07-06 13:26:49,927:INFO: Hostname: 11080fd0f35b
2020-07-06 13:26:50,062:INFO: Hostname: d14947ae2c6e
2020-07-06 13:26:50,196:INFO: Hostname: d14947ae2c6e
2020-07-06 13:26:50,324:INFO: Hostname: d14947ae2c6e

B) Using traefik things do not work seamlessly. There is a period where operations time out.

$ watch.py "curl -m 0.4 -sS --insecure https://localhost/ | grep Hostname"
2020-07-06 13:27:11,531:INFO: Hostname: d14947ae2c6e
2020-07-06 13:27:11,713:INFO: Hostname: d14947ae2c6e
2020-07-06 13:27:11,895:INFO: Hostname: d14947ae2c6e
2020-07-06 13:27:12,602:ERROR: curl: (28) Operation timed out after 405 milliseconds with 0 bytes received
2020-07-06 13:27:13,127:ERROR: curl: (28) Operation timed out after 403 milliseconds with 0 bytes received
2020-07-06 13:27:13,661:ERROR: curl: (28) Operation timed out after 401 milliseconds with 0 bytes received
2020-07-06 13:27:14,192:ERROR: curl: (28) Operation timed out after 403 milliseconds with 0 bytes received
2020-07-06 13:27:14,722:ERROR: curl: (28) Operation timed out after 404 milliseconds with 0 bytes received
2020-07-06 13:27:15,259:ERROR: curl: (28) Operation timed out after 401 milliseconds with 0 bytes received
2020-07-06 13:27:15,793:ERROR: curl: (28) Operation timed out after 402 milliseconds with 0 bytes received
2020-07-06 13:27:16,327:ERROR: curl: (28) Operation timed out after 405 milliseconds with 0 bytes received
2020-07-06 13:27:16,855:ERROR: curl: (28) Operation timed out after 402 milliseconds with 0 bytes received
2020-07-06 13:27:17,379:ERROR: curl: (28) Operation timed out after 403 milliseconds with 0 bytes received
2020-07-06 13:27:17,560:INFO: Hostname: 4e7a6b255acb
2020-07-06 13:27:17,729:INFO: Hostname: 4e7a6b255acb
2020-07-06 13:27:17,902:INFO: Hostname: 4e7a6b255acb

C) Without any ports exposed, using the service name inside another container also works seamlessly

root@953d7cb91744:/# ./watch.py "curl -m 0.4 -sS http://whoami/ | grep Hostname"
2020-07-06 13:25:03,129:INFO: Hostname: e635eb537b43
2020-07-06 13:25:03,252:INFO: Hostname: e635eb537b43
2020-07-06 13:25:03,380:INFO: Hostname: e635eb537b43
2020-07-06 13:25:03,512:INFO: Hostname: 82b612eae5e4
2020-07-06 13:25:03,650:INFO: Hostname: e635eb537b43
2020-07-06 13:25:03,777:INFO: Hostname: 82b612eae5e4
2020-07-06 13:25:03,905:INFO: Hostname: e635eb537b43
2020-07-06 13:25:04,025:INFO: Hostname: 82b612eae5e4
2020-07-06 13:25:04,152:INFO: Hostname: e635eb537b43
2020-07-06 13:25:04,288:INFO: Hostname: 82b612eae5e4
2020-07-06 13:25:04,415:INFO: Hostname: e635eb537b43
2020-07-06 13:25:04,536:INFO: Hostname: 82b612eae5e4
2020-07-06 13:25:04,676:INFO: Hostname: 82b612eae5e4
2020-07-06 13:25:04,807:INFO: Hostname: 82b612eae5e4

My service config

version: "3.5"

services:
  whoami:
    image: containous/whoami
    deploy:
      update_config:
        order: start-first
      labels:
        - "traefik.enable=true"
        - "traefik.docker.network=core"
        - "traefik.http.routers.whoami.rule=Path(`/`)"
        - "traefik.http.routers.whoami.entrypoints=http"
        - "traefik.http.routers.whoami.service=whoami"
        - "traefik.http.services.whoami.loadbalancer.server.port=80"
    ports:
      - 8000:80
    networks:
      - core

networks:
  core:
    external: true

After trialling TraefikEE, it appears that documentation is only referring to the taking advantage of the routing mesh for external ingress to reach traefik. TraefiKEE, like TraefikCE doesn't appear to use the routing mesh to talk to the downstream service. Why on earth not?

I would guess that certain features require balancing to the endpoints, stickiness being one that comes to mind.

You have the option to use the swarm ingress with docker.lb.swarm