Traefik 3 connections closed

Hi,

I upgraded some days ago to Traefik 3.0, it seems to work perfectly for some hours them begins to consider some connections closed, stops to write the log (so I can't understand what's happening). Just restarting the container makes Traefik work again just for a bunch of hours.

It seems to close only some connections, not all and they are not specific to specific routers/services (I have one TCP router serving many domains/services and only some of them fail, it seems).

Also: docker logs <traefik container> results in nothing, with some error lines in the traefik.log

I attach the log (how can I attach a very long log file?) and here's the docker conf:

services:
  # Traefik 2 - Reverse Proxy
  traefik:
    container_name: traefik
    image: traefik:latest
    pull_policy: always
    security_opt:
      - no-new-privileges:true
    restart: unless-stopped
    # profiles: ["core", "all"]
    networks:
#      - dnet
      - traefik
      - pihole
#    network_mode: bridge
    command: # CLI arguments
      - --global.checkNewVersion=true
      - --global.sendAnonymousUsage=true
      - --entrypoints.web.address=:80
      - --entrypoints.websecure.address=:443
      - --entrypoints.traefik.address=:8080
      - --entryPoints.metrics.address=:8082
      - --entrypoints.websecure.http.tls=true
      - --entrypoints.web.http.redirections.entrypoint.to=websecure
      - --entrypoints.web.http.redirections.entrypoint.scheme=https
      - --entrypoints.web.http.redirections.entrypoint.permanent=true
      - --api=true
      - --api.dashboard=true
      # - --api.insecure=true
      - --serversTransport.insecureSkipVerify=true
      # Metrics
      - --metrics.prometheus=true
      - --metrics.prometheus.buckets=0.1,0.3,1.2,5.0
      - --metrics.prometheus.addrouterslabels=true
      - --metrics.prometheus.entryPoint=metrics
      # Allow these IPs to set the X-Forwarded-* headers - Cloudflare IPs: https://www.cloudflare.com/ips/
      - --entrypoints.websecure.forwardedHeaders.trustedIPs=$CLOUDFLARE_IPS,$LOCAL_IPS
      - --log=true
      - --log.filePath=/logs/traefik.log
      - --log.level=DEBUG # (Default: error) DEBUG, INFO, WARN, ERROR, FATAL, PANIC
      - --accessLog=true
      - --accessLog.filePath=/logs/access.log
      - --accessLog.bufferingSize=100 # Configuring a buffer of 100 lines
      - --accessLog.filters.statusCodes=204-299,400-499,500-599
      - --providers.docker=true
      - --providers.docker.endpoint=unix:///var/run/docker.sock # Disable for Socket Proxy. Enable otherwise.
      # - --providers.docker.endpoint=tcp://socket-proxy:2375 # Enable for Socket Proxy. Disable otherwise.
      - --providers.docker.exposedByDefault=false
      - --providers.docker.network=traefik
      # - --providers.docker.swarmMode=false
      - --entrypoints.websecure.http.tls.options=tls-opts@file
      # Add dns-cloudflare as default certresolver for all services. Also enables TLS and no need to specify on individual services
      - --entrypoints.websecure.http.tls.certresolver=dns-cloudflare
      - --entrypoints.websecure.http.tls.domains[0].main=$DOMAINNAME_1
      - --entrypoints.websecure.http.tls.domains[0].sans=*.$DOMAINNAME_1
      - --entrypoints.websecure.forwardedHeaders.trustedIPs=173.245.48.0/20, 103.21.244.0/22, 103.22.200.0/22, 103.31.4.0/22, 141.101.64.0/18, 108.162.192.0/18, 190.93.240.0/20, 188.114.96.0/20, 197.234.240.0/22, 198.41.128.0/17, 162.158.0.0/15, 104.16.0.0/13, 104.24.0.0/14, 172.64.0.0/13, 131.0.72.0/22
      # - --entrypoints.websecure.http.tls.domains[1].main=$DOMAINNAME_2 # Pulls main cert for second domain
      # - --entrypoints.websecure.http.tls.domains[1].sans=*.$DOMAINNAME_2 # Pulls wildcard cert for second domain
      - --providers.file.directory=/rules # Load dynamic configuration from one or more .toml or .yml files in a directory
      - --providers.file.watch=true # Only works on top level files in the rules folder
      # - --certificatesResolvers.dns-cloudflare.acme.caServer=https://acme-staging-v02.api.letsencrypt.org/directory # LetsEncrypt Staging Server - uncomment when testing
      - --certificatesresolvers.dns-cloudflare.acme.storage=/acme.json
      - --certificatesresolvers.dns-cloudflare.acme.dnsChallenge.provider=cloudflare
      - --certificatesresolvers.dns-cloudflare.acme.dnsChallenge.resolvers=8.8.8.8:53,1.1.1.1:53,1.0.0.1:53
      - --certificatesresolvers.dns-cloudflare.acme.dnsChallenge.delayBeforeCheck=90 # To delay DNS check and reduce LE hitrate
    ports:
      - "80:80"
      - "443:443"
      - "8082:8082"
    dns:
      - 192.168.4.100
      - 127.0.0.11
    volumes:
      - $CONFIGDIR/traefik/rules:/rules # Dynamic File Provider directory
      - /var/run/docker.sock:/var/run/docker.sock:ro # Enable if not using Socket Proxy
      - $CONFIGDIR/traefik/acme/acme.json:/acme.json # Certs File
      - $LOGDIR/traefik:/logs # Traefik logs
    environment:
      - TZ=${TZ}
      - CF_DNS_API_TOKEN_FILE=/run/secrets/cf_dns_api_token
      - DOMAINNAME_1 # Passing the domain name to traefik container to be able to use the variable in rules.
    secrets:
      - cf_dns_api_token
    labels:
      - "org.label-schema.group=monitoring"
      - "traefik.enable=true"
      # HTTP Routers
      - "traefik.http.routers.traefik-rtr.entrypoints=websecure"
      - "traefik.http.routers.traefik-rtr.rule=Host(`traefik.$DOMAINNAME_1`)"
      # Services - API
      - "traefik.http.routers.traefik-rtr.service=api@internal"
      # Middlewares
      # - "traefik.http.routers.traefik-rtr.middlewares=middlewares-basic-auth@file" # For Basic HTTP Authentication
      - "traefik.http.routers.traefik-rtr.middlewares=middlewares-chain-no-auth@file"

The Traefik dashboard is one of the failing routers/services.

Here's a TCP router/services (file provided) in which only some of the domains/services seem to fail, not all.

I have a second Traefik instance behind the TCP router, and it works perfectly when I connect directly, and only some of the services on this second instance fail when connecting through the first Traefik while it's not working.

tcp: #http:
  routers:
    homelab-rtr-https:
      rule: "HostSNIRegexp(`domain1.com`) || HostSNIRegexp(`domain2.com`) || HostSNIRegexp(`domain3.com`) || HostSNIRegexp(`{subdomain:[a-z0-9]+}.domain1.com`) || HostSNIRegexp(`{subdomain:[a-z0-9]+}.domain2.com`) || HostSNIRegexp(`{subdomain:[a-z0-9]+}.domain3.com`) " #"Host(`*.domain1.com`)"
      entryPoints:
        - websecure
#      middlewares:
#        - middlewares-cloudflare-headers
      service: homelab-svc-https
      tls:
        passthrough: true
        certResolver: dns-cloudflare
        domains:
          - main: "domain1.com"
            sans:
              - "*.domain1.com"
          - main: "domain2.com"  
            sans:                                                                                                                                                                                             
              - "*.domain2.com"
          - main: "domain3.com"  
            sans:                                                                                                                                                                                             
              - "*.domain3.com"
        options: tls-opts@file
    homelab-rtr-http:
      rule: "HostSNIRegexp(`domain1.com`) || HostSNIRegexp(`domain2.com`) || HostSNIRegexp(`domain3.com`) || HostSNIRegexp(`{subdomain:[a-z0-9]+}.domain1.com`) || HostSNIRegexp(`{subdomain:[a-z0-9]+}.domain2.com`) || HostSNIRegexp(`{subdomain:[a-z0-9]+}.domain3.com`) "  #"Host(`*.urb
      entryPoints:
        - web
      #middlewares:
      #  - middlewares-chain-no-auth
      service: homelab-svc-http
      #tls:
      #  passthrough: true
      #  certResolver: dns-cloudflare
      #  options: tls-opts@file
  services:
    homelab-svc-https:
      loadBalancer:
        servers:
          - address: "192.168.1.103:443"
#        proxyProtocol:
#          version: 2
    homelab-svc-http:
      loadBalancer:
        servers:
          - address: "192.168.1.103:80"
#        proxyProtocol:
#          version: 2

I think Traefik v3 added some connection timeouts due to some fixed CVEs. Maybe check release notes and the last couple of security announcements.

Hi,

The strange thing is that it stops logging anything, both on traefik.log and access.log, so I really can't see what actually is happening and causing the connections to be closed.

I had some similar problems as soon as I migrated to 3.0, but the problems got logged and I had means to debug and fix them, now I seem to be totally blind to the actual problem (and so to options to solve).

I can share the log if anyone can help understand what's happening (I can see many connection closed, even while it still serves all of the routers, and some EOF errors at some point about tls handshake).

Still, the log stopping is what mostly seems strange to me.

I can only find many TLS errors, but they seem not to stop traefik from serve routers/services, as they are widely spread through the log:

024-05-06T09:14:54+02:00 DBG log/log.go:245 > http: TLS handshake error from 192.168.4.1:59660: tls: no cipher suite supported by both client and server
2024-05-06T01:05:38+02:00 DBG log/log.go:245 > http: TLS handshake error from 192.168.4.1:32938: tls: client requested unsupported application protocols
2024-05-05T23:56:53+02:00 DBG log/log.go:245 > http: TLS handshake error from 192.168.4.1:59990: EOF
2024-05-05T23:57:24+02:00 DBG log/log.go:245 > http: TLS handshake error from 192.168.4.1:60008: remote error: tls: bad certificate
024-05-06T01:06:44+02:00 DBG log/log.go:245 > http: TLS handshake error from 192.168.4.1:32952: tls: client offered only unsupported versions: [302 301]
024-05-06T03:30:25+02:00 DBG log/log.go:245 > http: TLS handshake error from 192.168.4.1:35900: tls: unsupported SSLv2 handshake received
024-05-06T03:30:25+02:00 DBG log/log.go:245 > http: TLS handshake error from 192.168.4.1:35900: tls: unsupported SSLv2 handshake received

The TLS errors are debug only, they usually happen when a browser/client connects to Traefik using a default created TLS cert, which is not trusted. Usually when TLS certs are not loaded or generated with LE correctly.

The only error, spread through the log:

2024-05-06T08:07:58+02:00 ERR github.com/traefik/traefik/v3/pkg/tcp/proxy.go:75 > Error while handling TCP connection error="readfrom tcp 192.168.4.8:48740->192.168.1.103:443: read tcp 192.168.4.8:443->192.168.4.1:41856: read: connection reset by peer"

I'm trying to see if they get stopped by a firewall in between the two internal IPs.

Could this be the offending problem causing my situation, with no other logs from a certain point and only a few services being served?

I begin to think this is the offending error.

It starts to spam the log just before the problematic situation arises.

I can seem to understand this part: readfrom tcp 192.168.4.8:48740->192.168.1.103:443 as I see logs on the firewall in between, even if I do not know why the docker internal network IP (192.168.4.8) of Traefik contacts directly instead of using the host's IP (in the 192.168.1.x network)

I do not understand the second part: read tcp 192.168.4.8:443->192.168.4.1:41856: read: connection reset by peer as it seems to arise because of the firewall blocking the connection, but traefik's and gateway's internal docker network IPS are involved.

"Connection reset by peer" seems to indicate that the browser/client closed the connection.

Firewalls usually do not allow connections in the first place, so I think it’s rather a TLS or potentially a network MTU size issue.

I solved changing the Traefik TCP routers into HTTP routers.
I couldn't debug more than I did about the TCP settings/problems.

Now all of the routers/services are working and stable:
Internet->ISP Router->Traefik1-HTTP->Traefik2-HTTP->Services

Instead of:
Internet->ISP Router->Traefik1-TCP->Traefik2-HTTP->Services

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.