Hi everyone,
I'm reaching out for help with a persistent issue I've been troubleshooting for days. I'm running a local K3s cluster and trying to deploy a private Docker Registry (registry:2
image) for my home network using a non-TLS setup.
The Problem
docker login registry.local
works perfectly every time. However, a docker push registry.local/my/image
fails sporadically (about 9 out of 10 times) with the error unknown: 404 page not found
. During these failures, the Traefik Ingress logs show a corresponding subset not found for ic-docker-registry/docker-registry-without-tls
error. Occasionally, a push of a very small image succeeds.
Notably, an identical setup on two public x86 servers using TLS certificates from Let's Encrypt works flawlessly. This issue only occurs in my local, non-TLS environment.
My Environment
- Kubernetes: K3s v1.33.3+k3s1
- Cluster Setup: An HA cluster with 3 master nodes and 2 worker nodes.
- Infrastructure: All nodes are virtual machines (Ubuntu 24.04) running under UTM/QEMU on an ARM64 (aarch64) host.
- Ingress Controller: Traefik (deployed via a custom installer script).
- Docker Registry: Official
registry:2
image, deployed to theic-docker-registry
namespace.
What I've Already Tried (The Troubleshooting Journey)
I've systematically debugged this from the application layer down to the infrastructure. Here’s what I’ve done:
1. Verified Ingress Configuration (IngressRoute
)
I initially suspected my Traefik rule was too restrictive.
- Original Rule:
match: Host('registry.local') && PathPrefix('/v2/')
- Action: Simplified the rule to
match: Host('registry.local')
to ensure all traffic for the host is forwarded. - Result: The error persisted.
2. Verified Service and Endpoints
I confirmed the Kubernetes service was correctly pointing to the running pod.
kubectl describe service ...
showed the service had a valid endpoint pointing to the pod's IP address.
3. Isolated Storage as the Cause (NFS -> emptyDir)
My initial setup used a PersistentVolumeClaim
backed by an NFS share. I suspected slow network storage was the issue.
- System logs (
journalctl
) on the worker node did, in fact, showNFS: server not responding
errors during pushes. - Action: I rewrote the deployment to use a temporary
emptyDir
volume instead, completely removing the network storage dependency. - Result: The 404 error still occurred.
4. Added and Tuned Resource Limits & Health Probes
My next theory was that the pod was crashing under load (e.g., an OOM kill).
- Action: I added explicit
resources
(requests and limits) to the container, progressively increasing the memory limit up to 2Gi. I also configuredlivenessProbe
andreadinessProbe
with very tolerant settings (longer timeouts, higher failure thresholds) to prevent Kubernetes from marking the pod asNotReady
too quickly. - Result: No improvement.
describe pod
showed a stable, running pod with 0 restarts, but the push still failed.
5. Ruled out Inter-Node Networking (nodeSelector
)
To eliminate potential issues with the CNI network (Flannel) between nodes, I used a nodeSelector
to pin the registry pod to a specific worker (k3s-worker1
).
- Result: This did not change the behavior.
The Final Diagnosis: Kubelet Logs Reveal the Truth
After exhausting all application-level configurations, I monitored the live kubelet
logs on the worker node (k3s-worker1
) during a failed push attempt. This provided the definitive evidence:
Readiness probe failed: Get "http://<POD-IP>:5000/v2/": context deadline exceeded
This confirms the exact failure chain:
- The
docker push
starts, creating a high I/O and CPU load on the virtual machine. - The VM, running in UTM/QEMU on an ARM64 architecture, is unable to handle this load efficiently. The registry process becomes so slow that it fails to respond to Kubernetes' health checks in time.
- Kubernetes sees the failed probe, marks the pod as
NotReady
, and immediately removes it from the service's endpoints. - Traefik, attempting to proxy the next request from the
docker push
, finds no available backend pods for the service and correctly returns a404 page not found
. The actual push requests never even reach the pod's logs because Traefik stops them first.
The problem is not the Kubernetes configuration, but the performance of the underlying virtualized infrastructure. Even after increasing the VM's RAM to 8GB, the issue persisted, pointing directly to an I/O performance bottleneck.
My question to the community:
Has anyone else experienced similar I/O performance bottlenecks with intensive workloads (like a Docker Registry) in a K3s environment running inside UTM/QEMU on ARM64? Are there known performance issues or specific virtual disk configurations (caching, drivers, etc.) in UTM that can improve I/O throughput for this kind of server workload?
Thanks for any insights you can provide!