I'm looking for advise or accurate directions regarding the configuration of my socket.io servers running behind a Traefik load-balancer. My infrastructure is hosted on AWS, and there is 1 EC2 instance that is currently running as a manager node for Traefik service. The other EC2 instances are worker nodes and the socket.io nodejs app is running on these instances.
Currently all network is routed through the manager node (which hosts the traefik service) and hence all the socket io connections are routed through it as well (to the socket.io service cluster in other worker nodes i.e. EC2 instances). There is only traefik service running on a single manager node. My concern are:
If I were to expect a lot of websocket users/connections (millions of connections), would this configuration be scalable and okay?
I'm not exactly sure, but would having only a single manager node somehow cause a network throughput bottleneck for the websocket connections and cause the websocket service to be unstable?
In general you can scale Traefik and your app horizontally, meaning you can just add more servers. Docker Swarm is happy with 3+ manager nodes (fault tolerance) and can handle 1000+ worker nodes.
The real challenge from my point of view is how you handle your single-point-of-failure, that is your single node with the IP address. What happens if the node, Docker or Traefik crashes?
One solution: we use Docker Swarm, Traefik and 100+ services with different sub-domains. We use an extra load-balancer in front of the whole setup, assuming that our provider can handle the first node better than we can. I am sure AWS provides similar services.
There are many more things to consider. Open connections per node is one, but also be aware that NodeJS is usually a single-threaded application. If you have multiple CPUs or threads available on your server, your app can not necessarily use those. Either you enable your app for multi-threading or you create as many app containers as you have CPUs/threads available on the node. But of course it all depends on CPU, RAM, networking usage, it's all about the balance - and about monitoring, too, to understand what's happening and where the bottlenecks are.