Sporadic "404 page not found" on docker push to local K3s Registry on ARM64 VM (UTM/QEMU) - I/O Bottleneck?

Hi everyone,

I'm reaching out for help with a persistent issue I've been troubleshooting for days. I'm running a local K3s cluster and trying to deploy a private Docker Registry (registry:2 image) for my home network using a non-TLS setup.

The Problem

docker login registry.local works perfectly every time. However, a docker push registry.local/my/image fails sporadically (about 9 out of 10 times) with the error unknown: 404 page not found. During these failures, the Traefik Ingress logs show a corresponding subset not found for ic-docker-registry/docker-registry-without-tls error. Occasionally, a push of a very small image succeeds.

Notably, an identical setup on two public x86 servers using TLS certificates from Let's Encrypt works flawlessly. This issue only occurs in my local, non-TLS environment.

My Environment

  • Kubernetes: K3s v1.33.3+k3s1
  • Cluster Setup: An HA cluster with 3 master nodes and 2 worker nodes.
  • Infrastructure: All nodes are virtual machines (Ubuntu 24.04) running under UTM/QEMU on an ARM64 (aarch64) host.
  • Ingress Controller: Traefik (deployed via a custom installer script).
  • Docker Registry: Official registry:2 image, deployed to the ic-docker-registry namespace.

What I've Already Tried (The Troubleshooting Journey)

I've systematically debugged this from the application layer down to the infrastructure. Here’s what I’ve done:

1. Verified Ingress Configuration (IngressRoute)
I initially suspected my Traefik rule was too restrictive.

  • Original Rule: match: Host('registry.local') && PathPrefix('/v2/')
  • Action: Simplified the rule to match: Host('registry.local') to ensure all traffic for the host is forwarded.
  • Result: The error persisted.

2. Verified Service and Endpoints
I confirmed the Kubernetes service was correctly pointing to the running pod.

  • kubectl describe service ... showed the service had a valid endpoint pointing to the pod's IP address.

3. Isolated Storage as the Cause (NFS -> emptyDir)
My initial setup used a PersistentVolumeClaim backed by an NFS share. I suspected slow network storage was the issue.

  • System logs (journalctl) on the worker node did, in fact, show NFS: server not responding errors during pushes.
  • Action: I rewrote the deployment to use a temporary emptyDir volume instead, completely removing the network storage dependency.
  • Result: The 404 error still occurred.

4. Added and Tuned Resource Limits & Health Probes
My next theory was that the pod was crashing under load (e.g., an OOM kill).

  • Action: I added explicit resources (requests and limits) to the container, progressively increasing the memory limit up to 2Gi. I also configured livenessProbe and readinessProbe with very tolerant settings (longer timeouts, higher failure thresholds) to prevent Kubernetes from marking the pod as NotReady too quickly.
  • Result: No improvement. describe pod showed a stable, running pod with 0 restarts, but the push still failed.

5. Ruled out Inter-Node Networking (nodeSelector)
To eliminate potential issues with the CNI network (Flannel) between nodes, I used a nodeSelector to pin the registry pod to a specific worker (k3s-worker1).

  • Result: This did not change the behavior.

The Final Diagnosis: Kubelet Logs Reveal the Truth

After exhausting all application-level configurations, I monitored the live kubelet logs on the worker node (k3s-worker1) during a failed push attempt. This provided the definitive evidence:

Readiness probe failed: Get "http://<POD-IP>:5000/v2/": context deadline exceeded

This confirms the exact failure chain:

  1. The docker push starts, creating a high I/O and CPU load on the virtual machine.
  2. The VM, running in UTM/QEMU on an ARM64 architecture, is unable to handle this load efficiently. The registry process becomes so slow that it fails to respond to Kubernetes' health checks in time.
  3. Kubernetes sees the failed probe, marks the pod as NotReady, and immediately removes it from the service's endpoints.
  4. Traefik, attempting to proxy the next request from the docker push, finds no available backend pods for the service and correctly returns a 404 page not found. The actual push requests never even reach the pod's logs because Traefik stops them first.

The problem is not the Kubernetes configuration, but the performance of the underlying virtualized infrastructure. Even after increasing the VM's RAM to 8GB, the issue persisted, pointing directly to an I/O performance bottleneck.

My question to the community:
Has anyone else experienced similar I/O performance bottlenecks with intensive workloads (like a Docker Registry) in a K3s environment running inside UTM/QEMU on ARM64? Are there known performance issues or specific virtual disk configurations (caching, drivers, etc.) in UTM that can improve I/O throughput for this kind of server workload?

Thanks for any insights you can provide!