Ideas and stories around Linux-HA (heartbeat) and Linux general.

19.5.08

How many nodes does a cluster tolerate?

Hi,

I was giving a seminar about Linux-HA version 2. In the end of the week we had some time left and the group decided to test how many nodes can be put into a cluster.

For the seminar we had 32 virtual VMware machines. All virtual machines were located on one hardware. During the semiar 16 of the machines were used for setting up Linux-HA and LVS cluster. The other 16 served as the backend machines in a LVS setup.
One participant of the seminar made a ha.cf and all copied it to their Linux-HA machines. That was done just before lunch and we started then all together, which was not a good idea. The communication in the cluster was quite slow, machines were not seeing each other and the system load was slowly rising. When we left for lunch uptime displayed a system load of 3 and after lunch all of the 16 virtual machines had a load of 9. The result was that I had to apologize to the administrator of the host on which the virtual machines were running.

OK. We stopped heartbeat eveywhere, erased all the config again with cibadmin -E, deleted all contents of /var/lib/heartbeat/crm and the file hostcache.

After the first experience we started the cluster slowly, one node after the other. Up to 9 members of the clusters and no resources defined anywhere we noticed no performance problems. When the cluster had more that 12 members the performance was really bad and all reactions of the hosts were quite slow. crm_mon did update the status, but watching the logfiles we saw that the propagation of the information through the cluster was quite slow. Cluster membership updates of a new node in the cluster did sometimes show up only after one minute on all members.

But finally we reached 16 nodes and managed to define a SysInfo clone resource on every node. Please find the screenshot below:
I hope you can read the screenshot. It shows the output of crm_mon for a cluster of 16 nodes and a SysInfo clone resource (clone_max="16" and clone_node_max="1") defined.

But working with such cluster is definitely no fun. It is really slow! The cluster also got into really problems when we defined a group of an IP-address and a webserver.

So from your experience I can tell that a 9-node-cluster makes sense. But why would you need more nodes in a Linux-HA cluster anyway? But it was great fun testing it.

1 comment:

theclusterguy said...

This should be significantly improved in the next 0.7 release. I spent the last two weeks looking at performance and managed to reduce the load on a newly elected DC by 70-80%.