Pacemaker: Moving from IPaddr to IPaddr2

For Linux users only, moving from ocf:heartbeat:IPaddr to ocf:heartbeat:IPaddr2 can be a good idea, especially since IPaddr had bug(s) in it (see my other pacemaker post[1]). I said “Linux users” because ocf:heartbeat:IPaddr2 uses /sbin/ip where IPaddr uses /sbin/ifconfig.

One of the main differences between /sbin/ifconfig and /sbin/ip is that the primer creates and manages aliases such as eth0.3:0 while the later creates and manages addresses only visible ip addr show. Read link[2] for comparison. If /sbin/ifconfig can’t handle /sbin/ip addresses, fortunatly the opposite isn’t true, this will help after the migration, you’ll understand #1

Let’s go ! You can move from IPaddr to IPaddr2 without reloading or restarting your resources, i.e: you can do it live (but don’t blame me please :p)

The process is quite simple:

  • Enter maintenance-mode: your resources aren’t managed anymore but keep their current state ;
  • Edit your resources and replace IPaddr with IPaddr2 ;
  • At this state, Pacemaker replaces your old IPaddr resources with new IPaddr2 resources. It doesn’t create orphans. But, Pacemaker has to know the state of your new resources before commiting, it’s very important. Hence the reprobe ;
  • Once you reprobed, Pacemaker is aware your new IPaddr2 resources are up because the monitor operation found the previous IPaddr aliases up. That’s because /sbin/ip is aware of /sbin/ifconfig aliases #1 ;
  • Finally, after validating that your resources won't move with ptest, you can leave maintenance-mode and commit your changes.
crm
options editor vim
configure property maintenance-mode=true
configure edit
:%s/IPaddr/&2/gc
:wq
resource reprobe
configure ptest scores
configure property maintenance-mode=false
exit

At this point, your cluster now handles IPaddr2 resources but the previous aliases are still used and visible in /sbin/ifconfig output, that's because has I said before #1 /sbin/ip handles aliases fine. Do not down them, they are in use ! These aliases will be removed and recreated with the new syntax as soon as the resource restarts.

[1] http://floriancrouzat.net/2011/09/pacemaker-tips/
[2] http://www.tty1.net/blog/2010-04-21-ifconfig-ip-comparison_en.html

Find -exec, actions, fork and speed

Assuming you understand such a command: find . type -f -exec ls -l {} \; this post will explains certain subtleties about find, fork, and speed. First, remember not to pipe find with the very bad and dangerous xargs, as explained in this must read “find guide”[1].

So, if you can’t use xargs, you’ll use -exec. What I wanted to talk about is the difference(s) between \; and \+ at the end of a find -exec command.

find /tmp/find -type f -exec ls -artl {} \;

In this find expression (using \;), each time find finds a matching filname, ls -artl is fired. Meaning it forks a lot (actually, it clones) and the files aren’t sorted by modification time since ls -artl is used file by file…

 #  find /tmp/find -type f -exec ls -artl {} \;
-rw-r--r--  1  root  root  0     Nov  23  11:49  /tmp/find/find_new/quux
-rw-r--r--  1  root  root  1024  Nov  16  10:22  /tmp/find/find_veryold/foo
-rw-r--r--  1  root  root  520   Oct  8   2010   /tmp/find/find_veryold/bar
-rw-r--r--  1  root  root  961   Nov  22  14:39  /tmp/find/find_old/baz

As you can see, files are not sorted, and strace shows lots of PID because it clones on each file to execute ls.

# strace  find /tmp/find -type f -exec ls -artl {} \; &>/dev/stdout | grep clone
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7705728) = 7947
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7705728) = 7948
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7705728) = 7949
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7705728) = 7950

find /tmp/find -type f -exec ls -artl {} \+

In this find expression (using \+), find computes a list of filenames before passing it to -exec. It means ls -artl is fired only once, on the whole file list: there is no forks and only one execution of ls which is really really really faster and actually sort files by modification time.

#  find /tmp/find -type f -exec ls -artl {} \+
-rw-r--r-- 1 root root  520 Oct  8  2010 /tmp/find/find_veryold/bar
-rw-r--r-- 1 root root 1024 Nov 16 10:22 /tmp/find/find_veryold/foo
-rw-r--r-- 1 root root  961 Nov 22 14:39 /tmp/find/find_old/baz
-rw-r--r-- 1 root root    0 Nov 23 11:49 /tmp/find/find_new/quux

As you can see, files are sorted, and strace shows only a single PID.

# strace  find /tmp/find -type f -exec ls -artl {} \+ &>/dev/stdout | grep clone
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7581728) = 7956

To make it short, -exec rm {} \+ on 1000 files will do 1xrm(1000files) where \; would have done 1000xrm(1file).

[1] http://mywiki.wooledge.org/UsingFind#Actions_in_bulk:_xargs.2C_-print0_and_-exec_.2B-

CentOS 6: Pacemaker “tips”

There are two guides out there that help understand the concepts and syntax: “Cluster from scratch” [1], and “Cluster configuration explained” [2], though, there are certain subtleties I had a difficult time to find and/or understand, that’s why I decided to share my poor experience. IRC Freenode #linux-ha is a good place to ask for help too.

  1. About ocf:pacemaker:ping resource, in order to monitor the real score associated with each node by the ping resource, you’d have to use:
    • cibadmin -Q | grep pingd | grep value
    • crm_mon -frontA1 | grep ping
  2. To prevent moving resources on loss of a common ping node, you might want to have
    dampen >= 2*ping.op_monitor-interval. Read doc[2] for dampen explanations.

  3. Location constraints based on connectivity have to use the ocf:pacemaker:ping resource’s name, not the primitive id. Most of the howtos out there to create a ping resource don’t fill the name parameters but only the primitive’s id (reminder: primitive id class:provider:type params name=foo host_list=...). With an empty name, you have to use the default name for an ocf:pacemaker:ping resource which is pingd.

    location IPHA-on-connected-node IPHA \
        rule $id="IPHA-on-connected-node-rule" pingd: defined pingd
    

    This constraint (with a score of pingd: instead of +/-INF:) is explained in a good blog entry that summarize ping scoring behavior, syntax and formula. To understand ping scoring, you must read link[3].

  4. If you want to receive SNMP traps whenever a resource changes state, you should create an ocf:heartbeat:ClusterMon resource:

    primitive SNMPMonitor ocf:heartbeat:ClusterMon \
        params pidfile="/var/run/crm_mon.pid" extra_options="-S 192.168.1.2 -C public" \
        op monitor on-fail="restart" interval="10s"
    
  5. The <op> tag is used to define parameters for operations performed by the cluster such as starting or stopping a resource. Eg, you can tell pacemaker that one of your resources takes a long time to start using <op start timeout="3min" ...> (same goes for stop of course). If you don’t, pacemaker will decide your resource has failed because of the default built-in timeout for the start operation ! (see point number 5 below for a concrete example). Finally, the interval parameter is only used for repetitive operations, the only one right now beeing monitor :

    primitive firewall lsb:my-complex-firewall-initscript \
        op monitor on-fail="restart" interval="10s" \
        op start interval="0" timeout="3min" \
        op stop interval="0" timeout="1min" \
        meta target-role="Started"
    

  6. Prior to CentOS 6.2 (I haven’t been able to find the BZ#id in the release notes…), there is an uneeded and bugged check in the shell code of the ip_stop() function in ocf:heartbeat:IPaddr (/usr/lib/ocf/resource.d/heartbeat/IPaddr).
    When trying to stop such a resource, before deleting the alias, the command if route | grep $IP ; then ... will screw your cluster in two case: your node has a really really big local routing table (BGP ?) or you don’t have any DNS resolver reachable.
    The failure will happen because route will take more than 20 seconds which is the default timeout for a stop action. The resource will have an INFINITY failcount and go unmanaged, if it’s part of a bigger shutdown process, it will break here and other node(s) won’t be able to pick up resources: EPIC FAILURE.
    In the process of fixing this issue, route has first been replaced with route -n which is indeed way faster but can also require more than 20 seconds to be browsed (for example a BGP router can have up to 350K lines), then, it’s been totally removed because it’s totally useless: problem solved. So, you can either: update to 6.2, patch your IPaddr shell script, patch from GitHub, move to IPaddr2. [4]

  7. I had a hard time finding the correct crm shell syntax for collocations and ordered sets. Add the fact that the documentation is wrong, and only have XML examples and you’ll have bad headhecks. So here is a sample crm shell syntax for five ocf::pacemaker:Dummy resources:

    • About colocation foo inf: ( E D C ) B A ; the documentation says:

      The only thing that matters is that in order for any member of set N to be active,
      all the members of set N+1 must also be active (and naturally on the same node),
      and that if a set has sequential="true", then in order for member M to be active,
      member M+1 must also be active.

      Then, this should be read as, for B (M) to start, A (M+1) has to be active. C,D,E (N) can start in any order (sequential=false) once A and B are active (N+1).

      Sadly, this is wrong, here’s the real behavior (you’d have to switch A and B in the shell syntax to match the above statement.)

      [...] in order for member M to be active, member M-1 must also be active.
  8. By default, when a failed node comes back online it claims back it’s old resources, meaning they are moved, again. You can avoid this by setting a non-zero resource-stickiness.

[1] http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Clusters_from_Scratch/
[2] http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/
[3] http://www.woodwose.net/thatremindsme/2011/04/the-pacemaker-ping-resource-agent/
[4] http://floriancrouzat.net/2012/01/pacemaker-moving-from-ipaddr-to-ipaddr2/