Fallback error response while container is restarting

I have Traefik set up in front of several services including GitLab:

services:
  traefik:
    image: traefik:latest
    container_name: traefik
    restart: always
    networks:
    - web
    ports:
    - "22:22"
    - "80:80"
    - "443:443"
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock"
      - "./traefik/traefik.toml:/traefik.toml"
      - "./traefik/acme.json:/acme.json"
      - "./traefik/conf:/conf"
    labels:
      - traefik.enable=true
      - traefik.http.routers.traefik.entrypoints=https
      - traefik.http.routers.traefik.rule=Host(`server.example.com`) && (PathPrefix(`/traefik`) || PathPrefix(`/api`))
      - traefik.http.routers.traefik.tls=true
      - traefik.http.routers.traefik.tls.certresolver=letsencrypt
      - traefik.http.routers.traefik.service=api@internal
      - "traefik.http.routers.traefik.middlewares=strip-tr@docker, traefik-auth@file"
      - traefik.http.middlewares.strip-tr.stripprefix.prefixes=/traefik

  gitlab:
    image: 'gitlab/gitlab-ce:latest'
    container_name: gitlab
    restart: always
    hostname: 'git.example.com'
    networks:
      - web
    volumes:
      - './gitlab/config:/etc/gitlab'
      - './gitlab/logs:/var/log/gitlab'
      - './gitlab/data:/var/opt/gitlab'
    labels:
      - traefik.enable=true

      - traefik.http.routers.gitlab.entrypoints=https
      - "traefik.http.routers.gitlab.rule= Host(`git.example.com`) || Host(`registry.example.com`)"
      - traefik.http.routers.gitlab.tls=true
      - traefik.http.routers.gitlab.tls.certresolver=letsencrypt
      - traefik.http.services.gitlab.loadbalancer.server.port=80

      - traefik.tcp.routers.gitlab-ssh.rule=HostSNI(`*`)
      - traefik.tcp.routers.gitlab-ssh.entrypoints=ssh
      - traefik.tcp.routers.gitlab-ssh.service=gitlab-ssh
      - traefik.tcp.services.gitlab-ssh.loadbalancer.server.port=22

When updating the GitLab container (using docker-compose pull gitlab && docker-compose up -d, it takes up to 10 minutes for GitLab to restart. During the first few seconds and last few seconds of that time, I get GitLab's 502 error page. However, during the rest of the time I get Traefik's default 404 response.

When I look in the Traefik dashboard, I see that the GitLab router and service don't show up in the dashboard until GitLab has finished restarting, which explains the 404.

What I want is some way that I can tell Traefik that it should respond with a 502 instead of a 404, specifically for those hostnames git.example.com and registry.example.com that would normally be served by GitLab.

Effectively, is there a way to have a "fallback" service (that responds with only a 502) if GitLab is not responding?

Hello @kohenkatz

If I correctly understood your issue you are looking to have Rolling Updates strategy to service. This is highly connected with the cluster orchestration tool you are using and its deployment strategies capabilities.

The rolling update is also available in Docker and it allows to gradually replace instances with the newer versions to ensure there are enough running replicas to manage the service availability before terminating the old version.

It is well documented in the Kubernetes ecosystem and there is also some for Docker. Here is one of the example configurations that I've been personally using some time ago. This is the section you should be interested in:

update_config:
  order: start-first

I hope that helps.

@jakubhajek Thanks. I'm not looking for anything that complicated, and I'm not using a cluster/swarm - all of this is running in Docker on a single machine. I don't need zero-downtime deployment or rolling updates, and I don't have enough resources on the machine to be running two copies of GitLab at the same time so the new one can start up before the old one shuts down.

I just want something simple: a way to have Traefik serve a 502 page instead of a 404 page while the GitLab container is updating.

Here is what happens right now:

  1. Request comes in to https://git.example.com
  2. If the GitLab container is up, send the request to it.
  3. If the GitLab container is being updated, return 404 since the labels for the container don't seem to be visible to Traefik.

Here is what I want to happen:

  1. Request comes in to https://git.example.com
  2. If the GitLab container is up, send the request to it.
  3. If the GitLab container is not yet responding to requests, have some configuration that still knows that there is supposed to be a service at that hostname so it will return 502.

Come to think of it, this is really a deeper question: Why are the labels of the GitLab container not seen by Traefik while the GitLab container is restarting? I can see (using docker-compose ps that the GitLab container is running, but I can also see (in the Traefik Dashboard) that the gitlab@docker Router and gitlab@docker Service are not listed.

I understand your point. You can still have a single node Swarm and take the advantage of having more advanced features.

I haven't seen health probes implemented in your configuration, so I would try adding health checks on a service level making sure that the application is ready to accept incoming traffic. Here is the example that I have been using for some time. You can also see the health check capabilities provided by Docker.

Would you please try to configure it ?

It looks like Don't ignore labels from unhealthy containers in docker provider by jvasseur · Pull Request #8690 · traefik/traefik · GitHub is going to do what I'm asking for here.

Hi @kohenkatz and @jakubhajek.
I have a question and by searching I came up to this forum. We are currently using Express gateway on a cloud system which is designed 4 years ago and this library is not used and updated anymore. regarding the issues we have, we are looking for an alternative.
We have implemented the cluster with an NGINX as reverse proxy which gives access to the client to arrive on Express API gateway and then this gateway redirects users' calls to different containers. But if a container gets down while sending back the response, we receive error. What we expect to see is that gateway gets the error and handle it which means to send it to a healthy container but this does not happen.
We have tried every possible solution on the current configuration we have but seems like it is useless.
So my question here is, is Traefik able to support our needs and avoid this issue we have?

Thanks

So if you receive a specific error, you want the reverse proxy to try with a different target instance?

Sorry to reply late. yes. another healthy target(container). currently we have 3 replicas of the authentication API and if one container goes down in the middle of the response, we receive error. We want to avoid this failure and create a fault tolerant system.

I don't think that is possible. AFAIK Traefik checks if a container is available and then forwards the request. But it won't forward to another container if the process was already going on with one container and the first one dies during processing.

I understand. do we have to add another layer as API gateway or Traefik can be enough?
I was suggested to study about using Traefik and Kong together.

Not sure you will find a solution for this.

If a request is going on, especially with POST, you are modifying data on the server side. How should the reverse proxy decide if it should just repeat the request to a new instance, maybe it messes things up because data is updated twice.

Furthermore when POST-ing a file, a lot of data may be transported, Traefik will not buffer that, but directly stream to the target. So if the target dies, it can't just repeat the request to a new target.

Despite those challenges, maybe have a look at the Traefik retry middleware (link)

Cluster orchestration note: Traefik will work fine when doing updates to the target services, restarting one after another, with some time in between. Usually containers have a grace period of 10 seconds. They receive a SIGTERM, can finish ongoing requests, but should not accept new ones, either close the port or go unhealthy. After 10 secs they get killed.

thanks for your response. we expects the API gateway to investigate the response coming from server, and if there is an error, do not pass the error back to nginx in one layer before and instead do a new request.
I thought it might not run so I tried it with proxy and it worked. Express gateway uses proxy to do this. So I tested a real scenario and it worked but in our project, it does not work.

we have a cluster which is using nginx as reverse proxy and then Express Gateway as API Gateway which is based on an old library that is not updated anymore. It is too complicated. it chooses containers behind the scene and I cannot find any element in the code that shows how the containers are being chosen upon each request. The issue we face with, is that once in a while, a container that is healthy and is working, gets down in the middle of a response and after 5 minutes comes back up but the behavior we expect to see, is that Express gateway does not return the error and right after failure of the container, send the request to another healthy container. I played a lot with it but did not cope with anything. so now we are thinking of alternatives. Traefik and Kong. Traefik for reverse proxy and Kong for API gateway. Do you know if these are good enough to fulfill our expectations?