I've previously asked this question on SO, so far without luck.
I've got Traefik/Docker Swarm/Let's Encrypt/Consul set up, and it's been working fine. It managed to successfully get certificates for the domains admin.domain.tld
, registry.domain.tld
and matomo.domain.tld
, but others like domain.tld
and staging.domain.tld
aren't getting any certificates (browser warns of self signed certificate because it's the default Traefik certificate). Though some tries (after deleting the consul data and starting over from scratch) the set of domains working or not working are different...
My Traefik configuration (that's being uploaded to Consul):
debug = false
logLevel = "DEBUG"
insecureSkipVerify = true
defaultEntryPoints = ["https", "http"]
[entryPoints]
[entryPoints.ping]
address = ":8082"
[entryPoints.http]
address = ":80"
[entryPoints.http.redirect]
entryPoint = "https"
[entryPoints.https]
address = ":443"
[entryPoints.https.tls]
[traefikLog]
filePath = '/var/log/traefik/traefik.log'
format = 'json'
[accessLog]
filePath = '/var/log/traefik/access.log'
format = 'json'
[accessLog.fields]
defaultMode = 'keep'
[accessLog.fields.headers]
defaultMode = 'keep'
[accessLog.fields.headers.names]
"Authorization" = "drop"
[retry]
[api]
entryPoint = "traefik"
dashboard = true
debug = false
[ping]
entryPoint = "ping"
[metrics]
[metrics.influxdb]
address = "http://influxdb:8086"
protocol = "http"
pushinterval = "10s"
database = "metrics"
[docker]
endpoint = "unix:///var/run/docker.sock"
domain = "domain.tld"
watch = true
exposedByDefault = false
network = "net_web"
swarmMode = true
[acme]
email = "jan@maildomain.tld"
storage = "traefik/acme/account"
entryPoint = "https"
onHostRule = true
acmeLogging = true
[acme.httpChallenge]
entryPoint = "http"
The full log of startup and trying to load one page that's not working is available here (it's been passed through grep -i -A3 -B3 acme
). The last two lines are probably the issue:
{"level":"debug","msg":"http: TLS handshake error from 10.255.0.2:60638: remote error: tls: unknown certificate","time":"2019-07-11T17:19:01Z"}
{"level":"debug","msg":"http: TLS handshake error from 10.255.0.2:60637: remote error: tls: unknown certificate","time":"2019-07-11T17:19:01Z"}
I have along the way trying to fix this seen various other issues:
{"level":"error","msg":"Error getting ACME certificates [matomo.domain.tld] : cannot obtain certificates: acme: Error -\u003e One or more domains had a problem:\n[matomo.domain.tld] acme: error: 400 :: urn:ietf:paramsacme:error:connection :: Fetching http://matomo.domain.tld/.well-known/acme-challenge/WJZOZ9UC1aJl9ishmL2ACKFbKoGOe_xQoSbD34v8mSk: Timeout after connect (your server may be slow or overloaded), url: \n","time":"2019-07-09T16:27:43Z"}
I hope someone is able to shed some light on this, I'm growing ever closer to sticking nginx in front of Traefik to handle HTTPS-termination, but that sounds like a really bad hack.
In order to rule out the docker part I tried specifying my domains using [[acme.domains]]
, that resulted in a bunch of these errors on startup:
{"level":"error","msg":"Error getting ACME certificate for domain [\"domain.xyz\" \"www.domain.xyz\" \"staging.domain.xyz\" \"pagefle.domain.xyz\" \"gris.domain.xyz\" \"gefleteknologerna.domain.xyz\" \"staging.pagefle.domain.xyz\" \"staging.gris.domain.xyz\" \"staging.gefleteknologerna.domain.xyz\" \"admin.domain.xyz\" \"matomo.domain.xyz\" \"portainer.admin.domain.xyz\" \"registry.domain.xyz\"]: cannot obtain certificates: acme: Error -\u003e One or more domains had a problem:\n[domain.xyz] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://domain.xyz/.well-known/acme-challenge/NynlPWanf5_76iKQTIzQbA2GBpK182oaNmxfSk2x1qw: Connection refused, url: \n[gefleteknologerna.domain.xyz] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://gefleteknologerna.domain.xyz/.well-known/acme-challenge/GG8JdOyzwKCXRZvE99UzKcrMjFKtEy7XdWVNlmMV1zQ: Connection refused, url: \n[gris.domain.xyz] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://gris.domain.xyz/.well-known/acme-challenge/3FSpxOYL8U67oKOF5Mvdej9w-DsSD0f-b_h72MFuQro: Connection refused, url: \n[staging.domain.xyz] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://staging.domain.xyz/.well-known/acme-challenge/4qyFTx8xJYfvzystUc37U3Zk0VSHC1vjjp4BaCYSfbw: Connection refused, url: \n[staging.gefleteknologerna.domain.xyz] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://staging.gefleteknologerna.domain.xyz/.well-known/acme-challenge/Oy0AS9i-Hce-iWY93KRrAC9nJDzocuj6ooFa8naJhro: Connection refused, url: \n","time":"2019-07-12T18:41:45Z"}
Which is kinda weird, since I have no issues reach those domains (just not the ACME-path of course) from my browser (just with an invalid certificate).
Hi!
Attempt without Consul and with tlsChallenge
sadly gives exactly the same results: https://pastebin.com/ZwLcQkFF (400
, Connection refused
).
For reproduction (same error even without Docker: https://pastebin.com/Pzt4WLn2):
debug = true
logLevel = "DEBUG"
insecureSkipVerify = true
defaultEntryPoints = ["https", "http"]
[entryPoints]
[entryPoints.ping]
address = ":8082"
[entryPoints.http]
address = ":80"
[entryPoints.http.redirect]
entryPoint = "https"
[entryPoints.https]
address = ":443"
[entryPoints.https.tls]
[traefikLog]
filePath = '/var/log/traefik/traefik.log'
format = 'json'
[accessLog]
filePath = '/var/log/traefik/access.log'
format = 'json'
[accessLog.fields]
defaultMode = 'keep'
[accessLog.fields.headers]
defaultMode = 'keep'
[accessLog.fields.headers.names]
"Authorization" = "drop"
[retry]
[api]
entryPoint = "traefik"
dashboard = true
debug = false
[ping]
entryPoint = "ping"
[acme]
email = "jan@dalheimer.de"
storage = "/acme.json"
entryPoint = "https"
onHostRule = true
acmeLogging = true
[acme.tlsChallenge]
[[acme.domains]]
main = "domain.xyz"
sans = [
"www.domain.xyz",
"staging.domain.xyz",
"pagefle.domain.xyz",
"gris.domain.xyz",
"gefleteknologerna.domain.xyz",
"staging.pagefle.domain.xyz",
"staging.gris.domain.xyz",
"staging.gefleteknologerna.domain.xyz",
"admin.domain.xyz",
"matomo.domain.xyz",
"portainer.admin.domain.xyz",
"registry.domain.xyz"
]
[[acme.domains]]
main = "staging.otherdomain.se"
Docker Swarm configuration:
version: '3.7'
services:
traefik:
image: traefik:alpine
networks:
- public
- backbone
- web
ports:
- 80:80
- 443:443
- 8080:8080
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /home/docker/test/acme.json:/acme.json
- /home/docker/test/traefik.toml:/etc/traefik/traefik.toml
deploy:
mode: replicated
replicas: 1
placement:
constraints:
- node.role == manager
update_config:
parallelism: 1
delay: 10s
labels:
traefik.enable: 'true'
traefik.port: 8080
traefik.frontend.rule: 'Host: admin.domain.xyz; PathPrefixStrip: /traefik'
healthcheck:
test: 'printf "GET /ping HTTP/1.1\r\nHost: 127.0.0.1\r\nAccept: */*\r\n\r\n" | nc localhost 8082'
networks:
web:
driver: overlay
internal: true
public:
driver: overlay
backbone:
driver: overlay
internal: true
Instructions: Place both files as traefik.toml
and stack.yaml
in the same directory, also touch and chmod 600
acme.json
. Start with docker stack deploy -c stack.yaml test
, wait until container shows up in docker ps
, then run docker exec {container id} tail /var/log/traefik/traefik.log
to view the log.
Ok, I'm starting an env to reproduce. The HTTP/400 Connectionr efused from let's Encrypted is totally looking a network issue (as per all the Let's Encrypt communit posts on this topic as https://community.letsencrypt.org/t/error-400-connection-refused/89267/6 ).
Could you investigate with a machine outside your infrastructure, on another internet provider, that you can curl on both IPv4 and IPv6 the ip mentioned in the A
and AAAA
records for the faulty domains?
Is there any firewall that could refuse external connections?
Based on some post I found early on (that I can't find right now) I tried removing the AAAA
record, so everything's been going through IPv4. Firewall has likewise been disabled since my initial troubleshooting attempts (that's the reason I've removed the real domain name in all logs etc.).
My Cloudflare configuration:
(and so on for all subdomains)
otherdomain.se
is on a different DNS provider but with a similar configuration.
Also still in the test case the interesting point about some subdomains seemingly succeeding (at least not being included in the error) confuses me. Could it be some form of rate limit? If so, what?
Rate limits for LetsEncrypt are well documented:
And when you hit a rate limit, the error returned explicitly state the rate limit that is applying.
There is no guesswork involved for rate limiting!
I guess I can rule that out then. Good because one less possible issue, bad since it was the only explanation I could come up with...
@dduportal Have you had any luck reproducing?
Hi @02JanDal, alas no sorry. I tried with AWS Route53 and Digital Ocean DNS, with 4 domains I own, and 4 subsomains per domain. Unable to reproduce, unless when I block the Let's Encrypt access (either with a security group rule in AWS, a firewall rule in ifw or a bad port mapping).
What is the result of the command openssl s_client -connect <failing-domain>:443
from a machine under your platform? And from a machine from external (like through a 4G access point, a from another Internet provider) ?
Got pretty much exactly the same result both from my own local computer and the server itself.
CONNECTED(00000005)
depth=0 CN = TRAEFIK DEFAULT CERT
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 CN = TRAEFIK DEFAULT CERT
verify error:num=21:unable to verify the first certificate
verify return:1
---
Certificate chain
0 s:/CN=TRAEFIK DEFAULT CERT
i:/CN=TRAEFIK DEFAULT CERT
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIDRTCCAi2gAwIBAgIPBDjiwdJnkJpTTMm9qRDtMA0GCSqGSIb3DQEBCwUAMB8x
HTAbBgNVBAMTFFRSQUVGSUsgREVGQVVMVCBDRVJUMB4XDTE5MDcxNjE2NTEyM1oX
DTIwMDcxNTE2NTEyM1owHzEdMBsGA1UEAxMUVFJBRUZJSyBERUZBVUxUIENFUlQw
ggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQC21fGv2lZREwKWDr/RUkPu
mINdCBKkrGV3wTyAZKrTI4jQbeHnyRWHMbXS1puI5DPzXSjmaqRfF7oFE6PJTTlC
ohK4+h3M7HB3AGNuZcTg3mAA3lvSU24GCDyVI/hyXNvE++KLU154Q3rV1AGZONCl
RIHBr1dzmWkG2iZ4sL9HTrYkvORdjjuNCCT1E2fkZByzYuUz5JAssNg20Etojett
AGHQXoW0UaWdqh/FBz3XHXNJXCR4KqSEp9BzUbM3F85SvrAlV8VF6Ls5pbfx3ZJh
Tbj0b5AzHiwJq984o9u8mhjyWlsv4EGPRIiFfDOF1JDGrBw015wYUOMBma5KLm3T
AgMBAAGjfjB8MA4GA1UdDwEB/wQEAwIFIDAMBgNVHRMBAf8EAjAAMFwGA1UdEQRV
MFOCUTZhNTRkZDJiMjhlYTA3MmJhN2UyZTBmMDM3YTBkNDE5LjA3OGU5NTg1ZTE0
YzBhMWVkOGQzYjFiZWFlOGZlYjZmLnRyYWVmaWsuZGVmYXVsdDANBgkqhkiG9w0B
AQsFAAOCAQEAdhMeaYlWrwI4E3Ufd/FmVOvcz8C+ccs/k4RGs5n9UMvhOjFWax9E
ZKx+r2brvNhkSl8j9TBNe3M7OaoWRU8UI0Gry/eUhuCXjsltJPsL8HZIae/LG12/
jYlEqLYd7ojzzEyRvDFaVaRn9+kh2OFhgt8zOFZiN0L8BGm91KF2ZR+bWYucoRq6
H96myievTLyIwn6+r3Giqw5l7IHbQ+keqgxsCxseWvpgUsxOFSBCZI2AvZajdphm
EVWkB5N97WE6TGRAPZPjgaq3lzrCLJiPlLy5A5ksXp0RwhWZLTB9VffMcKmFSYDx
DoOQ6Ac1gdxIVDeqldf1fneU240vl9Q80w==
-----END CERTIFICATE-----
subject=/CN=TRAEFIK DEFAULT CERT
issuer=/CN=TRAEFIK DEFAULT CERT
---
No client certificate CA names sent
Server Temp Key: ECDH, X25519, 253 bits
---
SSL handshake has read 1414 bytes and written 293 bytes
---
New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES128-GCM-SHA256
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
Protocol : TLSv1.2
Cipher : ECDHE-RSA-AES128-GCM-SHA256
Session-ID: 7CC416EF8E9DA069909A26D8FC291B9FB08175B422ABF6939CE91624519E0A20
Session-ID-ctx:
Master-Key: 6650A9E449D1ED2E561AC6F0FC5FAC34B7C648D4DF324846E176EA687ACE28231B3A49E05ECAA3D6C690C16197D052ED
TLS session ticket:
0000 - fc ee 6e e0 a0 15 bc 3a-eb 44 7a 1e a5 7a 86 66 ..n....:.Dz..z.f
0010 - 3f c4 81 96 bb 8a 83 63-a2 a7 a2 c9 ab 89 4f 06 ?......c......O.
0020 - b8 54 f1 f4 b4 b5 79 07-62 24 44 5d da 3f 76 0d .T....y.b$D].?v.
0030 - ab 9a 0d ad c8 1a 1e 3f-41 a1 21 1b 46 aa e5 cf .......?A.!.F...
0040 - 64 d8 34 2e 91 9a 97 16-44 74 d5 9f 92 50 79 0b d.4.....Dt...Py.
0050 - 68 89 b9 ce 8e 95 f7 4c-dc 89 99 d1 7e dd f6 9c h......L....~...
0060 - 14 ef f5 bc 9b 62 79 ca-cd 48 93 36 da 74 11 d0 .....by..H.6.t..
0070 - 7f 25 49 3f a1 ec 93 4e- .%I?...N
Start Time: 1563985377
Timeout : 7200 (sec)
Verify return code: 21 (unable to verify the first certificate)
---
Since some domains succeed; could it be that Traefik is trying to do the challenges before Docker and Traefik are setup and the ports ready?