Ideas and stories around Linux-HA (heartbeat) and Linux general.

19.5.08

How many nodes does a cluster tolerate?

Hi,

I was giving a seminar about Linux-HA version 2. In the end of the week we had some time left and the group decided to test how many nodes can be put into a cluster.

For the seminar we had 32 virtual VMware machines. All virtual machines were located on one hardware. During the semiar 16 of the machines were used for setting up Linux-HA and LVS cluster. The other 16 served as the backend machines in a LVS setup.
One participant of the seminar made a ha.cf and all copied it to their Linux-HA machines. That was done just before lunch and we started then all together, which was not a good idea. The communication in the cluster was quite slow, machines were not seeing each other and the system load was slowly rising. When we left for lunch uptime displayed a system load of 3 and after lunch all of the 16 virtual machines had a load of 9. The result was that I had to apologize to the administrator of the host on which the virtual machines were running.

OK. We stopped heartbeat eveywhere, erased all the config again with cibadmin -E, deleted all contents of /var/lib/heartbeat/crm and the file hostcache.

After the first experience we started the cluster slowly, one node after the other. Up to 9 members of the clusters and no resources defined anywhere we noticed no performance problems. When the cluster had more that 12 members the performance was really bad and all reactions of the hosts were quite slow. crm_mon did update the status, but watching the logfiles we saw that the propagation of the information through the cluster was quite slow. Cluster membership updates of a new node in the cluster did sometimes show up only after one minute on all members.

But finally we reached 16 nodes and managed to define a SysInfo clone resource on every node. Please find the screenshot below:
I hope you can read the screenshot. It shows the output of crm_mon for a cluster of 16 nodes and a SysInfo clone resource (clone_max="16" and clone_node_max="1") defined.

But working with such cluster is definitely no fun. It is really slow! The cluster also got into really problems when we defined a group of an IP-address and a webserver.

So from your experience I can tell that a 9-node-cluster makes sense. But why would you need more nodes in a Linux-HA cluster anyway? But it was great fun testing it.

1.5.08

pacemaker on debian etch

Hi,

pacemaker is a replacement for the CRM of heartbeat which is actively developed by some programmers at Novell/SuSE. For more information see their website. The latest software is published there. The SuSE build service also publishes it for a large number of distributions.
So it is quite easy to install the latest version of heartbeat / pacemaker on your system. In debian etch you just have to add the SuSE build server to the repository servers in /etc/apt/sources:

deb http://download.opensuse.org/repositories/server:/ha-clustering/Debian_Etch/ ./

Tell apt about the changes:

# apt-get update

and get the latest version of heartbeat / pacemaker:

# apt-get install pacemaker


If you also want to use the latest GUI:

# apt-get install pacemaker-pygui python-xml python-gtk2 xbase-clients python-glade2

pacemaker does not enable the use of the GUI by default. If you want to use it (why did you download it anyway?) please see this entry in the FAQ.

Have fun with the latest version of heartbeat!

17.4.08

Did the spammer give up?

Today I want to analyze the statistics of my mail server because there are interesting new developments since middle March 2008. A little bit off topic Linux-HA. But this blog should not be exclusively about heartbeat.

In mid March 2008 the hit rate to some blacklist providers (spamhaus and others) dropped by 50%. There was a story on German heise news portal on March 31st about that topic. In the statistics of our mailserver we also can see the drop in rejected mail from two different blacklists:

Fig 1: Hit rates of our mailserver for two different blacklists. Please note that the scale is logarithmic.

My blacklist hits from Spamhaus did rise in from 100/h in Nov/Dec 2007 to 300/h in Jan/Feb 2008. But suddenly in mid March it dropped back to the level of November. Did the spammer really give up? Was that the first victory? Or is this just the preparation for a better attack?

If you analyze the statistics of our mailserver further and take a closer look the the mails that passed the blacklisting you find the rate of mails being stopped by Spamassassin did rise the same amount blacklisting did loose. The next figure shows the percents of mail being stopped by blacklisting, greylisting, spamassassin and finally all mails accepted.


The percentage of the accepted mails basically remains constant from mid December between 5% and 10% mainly depending on the day of the week. The fraction of the mails rejected by blacklisting dropped from 70% for both blacklists in January to 50% now. In the same time the fraction rejected from spamassassin rose from 10% to 20% while greylisting rates remain constant.

Given all that data I would conclude that the spammers did not gave up, but enhanced their capabilities and reacting faster to blacklisting. They recognize where it is in effect and stop wasting resources there. They even more concentrate on these domains / servers / secondary servers, where blacklisting is not included. From these observations two conclusions can be drawn:

1) Spammers are clever and constantly working on their tools.

2) Blacklisting seems to work against spammers. It seems that this is really a good option in the fight to regain control over the inboxes.

16.4.08

Reload vs. Restart a Resource

On the mailinglist there was a nice discussion today: "Is it possible to reload a resource instead to restarting it every time?"

If you change the attribute of a resource heartbeat normally restarts the resource. You can also invoke this manually stopping and starting the resource again. But sometimes it is desirable only to reload the resource i.e. to tell the daemon to reread the latest configuration file. Restarting might take some time and reloading is much faster. So is it possible to reload the resource from heartbeat?

Yes! The basics are already included in the Dummy RA. If you want to add this feature to your RA follow this hint:
1) Add an attribute to the resource agent:


<parameter name="reload_trigger" unique="0">
<longdesc lang="en">
Random parameter that, when changed, causes a reload instead of a restart
</longdesc>
<shortdesc lang="en">Reload trigger</shortdesc>
<content type="string" default="" />
</parameter>



Beware: The parameter has to have the attribute unique="0"!

2) Add a reload action to the RA:

<name="reload" timeout="90"/>

Now restart the node. Andrew wrote on the list, that this is necessary to clear the cache of the meta_attributes.

Now every time you change the value this attribute the resource is reloaded instead of a complete restart of the resource. Nice?

15.4.08

Starting the blog

Hi,

after a talk from SUN about Web2.0 I decided to start my own blog. This blog will discuss topics around the Linux OS but especially about building clusters with Linux-HA version 2. This blog shout act a kind of fast supplement to the book "Clusterbau mit Linux-HA Version 2" I wrote for O'Reilly. Sorry, but at the moment this book is only available in German. If enough people ask O'Reilly perhaps they decide to translate it some day.

I hope you will find interesting articles in this blog which inspire you to do your own experiments with clusters, offer your services high available play with Linux at all. Please feel free to post comments.

Michael.