Potential race condition when migrating ingress-nginx to traefik

After reading Use ingress-nginx Resources in Traefik | Traefik Labs i tried this method with version 3.6.7 (bundled with RKE2). It is working fine so far if the ingressClassName is set to nginx. Both Ingress controllers are using the same Ingress at the same time. Now I compare this with the “official” migration documentation from Suse (sadly not available in public). They suggest a different way because there could be race conditions between the both ingress controllers trying to update the same ingress, and to avoid that, they set a different (non-empty) ingressClassName for the compatibility mode in their Helm values. In particular they set providers.kubernetesIngressNginx.ingressClass="rke2-ingress-nginx-migration". This is also working, but involves a downtime to switch over the traffic to the new loadbalancer. I like the other method better, because with it a real seamless migration would be possible. It’s just necessary to switch the ports of the existing loadbalancer service. But I can’t really judge the danger of such a configuration. In my tests I didn’t see a race condition in the logs, but this doesn’t mean that there is none, maybe I just don’t see it with the current log level? Has anybody here some insights about that?

Hi @megabreit, thanks for raising this. The race can be real, and SUSE is right to flag it.
Here's what's actually going on under the hood.

Where the race lives

Both controllers reconcile against the same set of Ingresses (same ingressClassName: nginx) and both write back to status.loadBalancer.ingress[] to publish the LoadBalancer IP they front. Concretely on Traefik's side, the write happens in pkg/provider/kubernetes/ingress-nginx/kubernetes.go (updateIngressStatus) and is triggered whenever providers.kubernetesIngressNginx.publishService or publishStatusAddress is set. The provider does skip the update when the observed status already equals its target (isLoadBalancerIngressEquals), but as long as ingress-nginx is also publishing its own (different) IP, the two controllers keep overwriting each other in a tight loop.

It usually doesn't show as an error in the logs, just repeated Updated ingress status info lines on both sides. So your "I didn't see it in the logs" matches what we'd expect.

Why it matters beyond ExternalDNS

The flapping status.loadBalancer.ingress[] affects anything that watches the Ingress status: ExternalDNS (the most obvious one, and what our migration guide already calls out), kube-state-metrics, ArgoCD/Flux which will keep showing the resource as out-of-sync, dashboards, and custom operators. Routing itself is unaffected (both controllers happily route traffic in parallel during the window), but the observability side gets noisy.

Two clean ways to avoid it

  1. Disable the publish on Traefik during coexistence (the path our migration guide recommends for ExternalDNS users): set

    providers:
      kubernetesIngressNginx:
        publishService:
          enabled: false
    

    Traefik will keep serving the Ingresses normally; it just stops writing the status. ingress-nginx remains the sole writer. Re-enable publishService on Traefik after you've uninstalled ingress-nginx. Full walkthrough: see the "ExternalDNS Users" block in the migration guide.

  2. Use a transitional IngressClass (SUSE's approach on RKE2): give the migrating ingress-nginx a distinct class like rke2-ingress-nginx-migration so it and Traefik never own the same resource at the same time. This is cleaner operationally and is what we'd recommend specifically on RKE2 where SUSE already wires this up for you.

Both approaches work; pick the one that fits your cluster ops better.

On the docs

The race-on-status was mentioned in our migration guide but tucked inside a collapsed "ExternalDNS Users" note, which under-sold it. The issue is broader than just ExternalDNS, and the keyword "race condition" did not appear, which made it hard to find when you go looking. I've opened traefik/traefik#13205 to fix that: the note is promoted to a visible warning, the affected field is named explicitly, the impact list is broadened (kube-state-metrics, ArgoCD/Flux, dashboards, custom operators), and the SUSE transitional-class method is added as a second mitigation option. Thanks for the nudge.

Hope this clears it up. Happy to dig further if you have a repro showing different behavior than the above.