Kubernetes and Let's Encrypt: Timeout during read (your server may be slow or overloaded)

darkpixel · October 13, 2020, 9:12pm

I'm running into the same error mentioned here:

When I have:

spec:                                                                                                                   
  replicas: 1                                                                                                           
  selector:                                                                                                             
    matchLabels:                                                                                                        
      app: traefik                                                                                                      
  template:                                                                                                             
    metadata:                                                                                                           
      labels:                                                                                                           
        app: traefik                                                                                                    
    spec:                                                                                                               
      serviceAccountName: traefik-ingress-controller                                                                    
      containers:                                                                                                       
        - name: traefik                                                                                                 
          image: traefik:v2.3                                                                                           
          args:                                                                                                         
            - --api.insecure                                                                                            
            - --accesslog                                                                                               
            - --entrypoints.web.Address=:80                                                                             
            - --entrypoints.websecure.Address=:443                                                                      
            - --providers.kubernetescrd                                                                                 
            - --certificatesresolvers.myresolver.acme.tlschallenge                                                      
            - --certificatesresolvers.myresolver.acme.email=aaron@ctrl-alt-it.com                                       
            - --certificatesresolvers.myresolver.acme.storage=acme.json                                                 
            # Please note that this is the staging Let's Encrypt server.
            # Once you get things working, you should remove that whole line altogether.
            - --certificatesresolvers.default.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory
          ports:                                                                                                        
            - name: web                                                                                                 
              containerPort: 80                                                                                         
            - name: websecure                                                                                           
              containerPort: 443                                                                                        
            - name: admin                                                                                               
              containerPort: 8080

It issues the test certificate perfectly.
When I remove:

            # Please note that this is the staging Let's Encrypt server.
            # Once you get things working, you should remove that whole line altogether.
            - --certificatesresolvers.default.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory

...and reapply the config, the container restarts and a few seconds later I get:

2020-10-13T20:35:44.414405078Z time="2020-10-13T20:35:44Z" level=error msg="Unable to obtain ACME certificate for domains \"redacted.com\": unable to generate a certificate for the domains [redacted.com]: error: one or more domains had a problem:\n[redacted.com] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Timeout during read (your server may be slow or overloaded), url: \n" providerName=myresolver.acme routerName=default-ingressroutetls-6d083b779d844aaf84a5@kubernetescrd rule="Host(`redacted.com`)"

If I put the staging resolver back in, it works.
I finally 'tricked' it into working by removing the staging resolver, letting it error out, then changing my IngressRoute to:

apiVersion: traefik.containo.us/v1alpha1                                                                                
kind: IngressRoute                                                                                                      
metadata:                                                                                                               
  name: ingressroutetls                                                                                                 
  namespace: default                                                                                                    
spec:                                                                                                                   
  entryPoints:                                                                                                          
    - websecure                                                                                                         
  routes:                                                                                                               
  - match: HostSNI(`redacted.com`)                                                                               
    kind: Rule                                                                                                          
    services:                                                                                                           
    - name: portal                                                                                                      
      port: 80                                                                                                          
  tls:                                                                                                                  
    certResolver: myresolver

If didn't like the HostSNI directive being there, so when I changed it back to Host, the traefik container DID NOT restart, but it got its config and updated.

There's nothing special about my k8s cluster. It's on DigitalOcean. It has the RBAC config installed, Prometheus for metrics, and a pretty simple Django application.

For the life of me, I can't figure out why Traefik times out when I apply the config for my app. Shouldn't the Traefik container retry getting the cert after a few minutes if it fails?

ldez · October 13, 2020, 9:55pm

Hello,

[redacted.com] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Timeout during read (your server may be slow or overloaded)

The error comes from Let's encrypt it-self because LE was not to get a response from https://redacted.com

darkpixel:

            - --certificatesresolvers.myresolver.acme.tlschallenge                                                      
            - --certificatesresolvers.myresolver.acme.email=aaron@ctrl-alt-it.com                                       
            - --certificatesresolvers.myresolver.acme.storage=acme.json                                                 
            # Please note that this is the staging Let's Encrypt server.
            # Once you get things working, you should remove that whole line altogether.
            - --certificatesresolvers.default.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory

With this configuration, you created 2 certificate resolvers named myresolver and default.
You have to change default to myresolver.

But I don't think that is related to your issue.

Also, the match has to be Host(`redacted.com`)

I think it's a networking issue, you have to check if redacted.com is really accessible over internet.

darkpixel · October 13, 2020, 10:22pm

Apologies. The certificateresolvers.default was a copy/paste error. I originally had it in my config and then removed it. I grabbed it from one of the traefik pages and didn't check.

The website definitely is reachable from the public internet. If I disable HSTS and purge the setting from Chrome, I can pull up the site.

I'll try tearing the cluster down and re-creating it in a few hours and use a new DNS name to test.

darkpixel · October 14, 2020, 12:08am

I just tried destroying and re-creating a bunch of times.
Literally, the moment the container starts up it tries to set up LE and fails.
But if the container starts up and within a few seconds of that error popping up I make a change that triggers it to load a new cert, it's successful.

This is on a stock DigitalOcean digital ocean cluster with RBAC, Prometheus metrics, Traefik, and a simple test app.

Is there an easy way to tell Traefik to delay trying to grab the cert by ~15 seconds?

zespri · October 15, 2020, 8:08am

Is traefik behind DO LB or something? May be the LB cannot come up quickly enough? Can you separate LB creation and traefik creation?

darkpixel · October 17, 2020, 11:29pm

Yeah, it's behind a Digital Ocean load balancer, but I've tried bringing everything up before Traefik and it still times out.

As a work-around, I just had DO issue the Let's Encrypt cert via their API as well as handling the HTTPS termination on the LBS.

zespri · October 17, 2020, 11:33pm

If you are happy with your work around we can leave it there, but otherwise you will need to show what exactly you are doing. One test that you may try is to write a start up script for the traefik container with a bit of a delay in it, and see if it helps.

darkpixel · October 19, 2020, 5:46am

I'm busy writing code at the moment, so I'm happy for now.
Some time in the next few weeks I'll be ready to set up a production cluster and I'll give it another shot. If it fails again, I'll see if I can find a simple way to reproduce it.

system · October 22, 2020, 5:46am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Let's Encrypt issues cert when using STAGING, but "400 Timeout error" when pointing resolver to PROD CA server Traefik v2 kubernetes-ingress , letsencrypt-acme	0	587	May 27, 2021
Traefik2.3: dnschallenge Traefik v2 kubernetes-ingress , letsencrypt-acme	0	613	December 3, 2020
Unable to obtain ACME: timeout during connect Traefik v2 docker , letsencrypt-acme	6	4697	January 21, 2021
Error obtaining certificate: Timeout during read Traefik v2 docker-swarm , letsencrypt-acme	13	2466	December 8, 2021
LetsEncrypt ACME error connection refused Traefik v2 letsencrypt-acme	9	11602	July 14, 2021

Kubernetes and Let's Encrypt: Timeout during read (your server may be slow or overloaded)

Related topics