Kubernetes and Let's Encrypt: Timeout during read (your server may be slow or overloaded)

I'm running into the same error mentioned here:

When I have:

spec:                                                                                                                   
  replicas: 1                                                                                                           
  selector:                                                                                                             
    matchLabels:                                                                                                        
      app: traefik                                                                                                      
  template:                                                                                                             
    metadata:                                                                                                           
      labels:                                                                                                           
        app: traefik                                                                                                    
    spec:                                                                                                               
      serviceAccountName: traefik-ingress-controller                                                                    
      containers:                                                                                                       
        - name: traefik                                                                                                 
          image: traefik:v2.3                                                                                           
          args:                                                                                                         
            - --api.insecure                                                                                            
            - --accesslog                                                                                               
            - --entrypoints.web.Address=:80                                                                             
            - --entrypoints.websecure.Address=:443                                                                      
            - --providers.kubernetescrd                                                                                 
            - --certificatesresolvers.myresolver.acme.tlschallenge                                                      
            - --certificatesresolvers.myresolver.acme.email=aaron@ctrl-alt-it.com                                       
            - --certificatesresolvers.myresolver.acme.storage=acme.json                                                 
            # Please note that this is the staging Let's Encrypt server.
            # Once you get things working, you should remove that whole line altogether.
            - --certificatesresolvers.default.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory
          ports:                                                                                                        
            - name: web                                                                                                 
              containerPort: 80                                                                                         
            - name: websecure                                                                                           
              containerPort: 443                                                                                        
            - name: admin                                                                                               
              containerPort: 8080

It issues the test certificate perfectly.
When I remove:

            # Please note that this is the staging Let's Encrypt server.
            # Once you get things working, you should remove that whole line altogether.
            - --certificatesresolvers.default.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory

...and reapply the config, the container restarts and a few seconds later I get:

2020-10-13T20:35:44.414405078Z time="2020-10-13T20:35:44Z" level=error msg="Unable to obtain ACME certificate for domains \"redacted.com\": unable to generate a certificate for the domains [redacted.com]: error: one or more domains had a problem:\n[redacted.com] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Timeout during read (your server may be slow or overloaded), url: \n" providerName=myresolver.acme routerName=default-ingressroutetls-6d083b779d844aaf84a5@kubernetescrd rule="Host(`redacted.com`)"

If I put the staging resolver back in, it works.
I finally 'tricked' it into working by removing the staging resolver, letting it error out, then changing my IngressRoute to:

apiVersion: traefik.containo.us/v1alpha1                                                                                
kind: IngressRoute                                                                                                      
metadata:                                                                                                               
  name: ingressroutetls                                                                                                 
  namespace: default                                                                                                    
spec:                                                                                                                   
  entryPoints:                                                                                                          
    - websecure                                                                                                         
  routes:                                                                                                               
  - match: HostSNI(`redacted.com`)                                                                               
    kind: Rule                                                                                                          
    services:                                                                                                           
    - name: portal                                                                                                      
      port: 80                                                                                                          
  tls:                                                                                                                  
    certResolver: myresolver   

If didn't like the HostSNI directive being there, so when I changed it back to Host, the traefik container DID NOT restart, but it got its config and updated.

There's nothing special about my k8s cluster. It's on DigitalOcean. It has the RBAC config installed, Prometheus for metrics, and a pretty simple Django application.

For the life of me, I can't figure out why Traefik times out when I apply the config for my app. Shouldn't the Traefik container retry getting the cert after a few minutes if it fails?

Hello,

[redacted.com] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Timeout during read (your server may be slow or overloaded)

The error comes from Let's encrypt it-self because LE was not to get a response from https://redacted.com


With this configuration, you created 2 certificate resolvers named myresolver and default.
You have to change default to myresolver.

But I don't think that is related to your issue.


Also, the match has to be Host(`redacted.com`)


I think it's a networking issue, you have to check if redacted.com is really accessible over internet.

Apologies. The certificateresolvers.default was a copy/paste error. I originally had it in my config and then removed it. I grabbed it from one of the traefik pages and didn't check.

The website definitely is reachable from the public internet. If I disable HSTS and purge the setting from Chrome, I can pull up the site.

I'll try tearing the cluster down and re-creating it in a few hours and use a new DNS name to test.

I just tried destroying and re-creating a bunch of times.
Literally, the moment the container starts up it tries to set up LE and fails.
But if the container starts up and within a few seconds of that error popping up I make a change that triggers it to load a new cert, it's successful.

This is on a stock DigitalOcean digital ocean cluster with RBAC, Prometheus metrics, Traefik, and a simple test app.

Is there an easy way to tell Traefik to delay trying to grab the cert by ~15 seconds?

Is traefik behind DO LB or something? May be the LB cannot come up quickly enough? Can you separate LB creation and traefik creation?

Yeah, it's behind a Digital Ocean load balancer, but I've tried bringing everything up before Traefik and it still times out.

As a work-around, I just had DO issue the Let's Encrypt cert via their API as well as handling the HTTPS termination on the LBS.

If you are happy with your work around we can leave it there, but otherwise you will need to show what exactly you are doing. One test that you may try is to write a start up script for the traefik container with a bit of a delay in it, and see if it helps.

1 Like

I'm busy writing code at the moment, so I'm happy for now. :slight_smile:
Some time in the next few weeks I'll be ready to set up a production cluster and I'll give it another shot. If it fails again, I'll see if I can find a simple way to reproduce it.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.