Linux-ha: moving from one corosync ring and no fencing to redundant rings and fencing(s)

I finally got time to move from a two nodes Linux-HA cluster running Pacemaker on top of Corosync with only one corosync ring and no fencing to redundant rings and IPMI fencing. Next step must be redundant hardware-level fencing (OS level fencing is worthless).

Of course, every cluster should have at least two rings and two fencing devices. Say IPMI BMC fails because power has been removed from the machine and it has no backup battery, then the cluster will hang if you don’t have another fencing mechanism (typically, PDU).

Well, for the moment I can only have one fencing mechanism, so let’s move to the configuration part, I’ll edit once I got a second fencing mechanism available (PDU).

1. Corosync: moving from one ring to redundant rings on a running cluster

# Put your cluster in maintenance (resources are now unmanaged)
crm configure property maintenance-mode=true

# Shutdown the linux-ha stack
service pacemaker stop
service corosync stop

# Edit /etc/corosync/corosync.conf and add the second ring
# I use unicast, authenticated rings (secauth) and active redundant ring protocol.
# Note: this is a corosync 1.x configuration file, so:
#     * There is no mention of quorum stuff
#     * Pacemaker service is enabled with /etc/corosync/service.d/pcmk

cat /etc/corosync/corosync.conf

# Please read the corosync.conf.5 manual page
compatibility: whitetank

totem {
        version: 2
        secauth: on
        threads: 0
        # RRP can have three modes (rrp_mode): if set to active, Corosync uses both
        # interfaces actively. If set to passive, Corosync uses the second interface
        # only if the first ring fails. If rrp_mode is set to none, RRP is disabled.
        rrp_mode: active
        interface {
                member {
                        memberaddr: 192.168.12.1
                }
                member {
                        memberaddr: 192.168.12.2
                }
                ringnumber: 0
                bindnetaddr: 192.168.12.0
                mcastport: 5405
                ttl: 1
        }
        interface {
                member {
                        memberaddr: 192.168.1.11
                }
                member {
                        memberaddr: 192.168.1.13
                }
                ringnumber: 1
                bindnetaddr: 192.168.1.0
                mcastport: 5405
                ttl: 1
        }
        transport: udpu
}

logging {
        fileline: off
        to_logfile: no
        to_syslog: yes
        to_stderr: no
        debug: off
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
        }
}

# Restart only corosync and validate rings, see doc [1]
service corosync start
corosync-cfgtool -s
corosync-objctl | fgrep member

# Now you can start Pacemaker and start managing your resources again
service pacemaker start
crm configure property maintenance-mode=false

2. Activate, configure and start using IPMI (requires reboot)

First, you have to activate IPMI on your BMC. This can’t be done without rebooting I guess. First two commands assume you are running RHEL/CentOS >= 6.2

yum install ipmitool OpenIPMI
yum install fence-agents # provides fence_* scripts including IPMI
reboot # Activate IPMI during the boot process
# Now you can start configuring IPMI from the CLI, see doc [2]
ipmitool lan print 1 # find your channel ID by incrementing 1 if it fails.
ipmitool lan set 1 ipsrc static
ipmitool lan set 1 ipaddr 192.168.206.224
ipmitool lan set 1 netmask 255.255.255.0
# Eventually configure a gateway, vlan etc, see doc [2]
ipmitool lan set 1 auth ADMIN PASSWORD # Activate the PASSWORD auth type for the ADMIN level.
ipmitool lan print 1

ipmitool user list
ipmitool user enable 3 # Enable a new user (last id was 2)
ipmitool user set name 3 foo
ipmitool user set password 3 bar
# ipmitool user priv <uid> <priv> <channel>
ipmitool user priv 3 4 1 # ADMINISTRATOR
ipmitool user list

# test local IPMI
ipmitool -U foo -P bar -I lan chassis status
# test remote IPMI (from second node)
ipmitool -H 192.168.206.224 -U foo -P bar -I lan chassis status

Now, we must enable IPMI fencing and create the associated resources, one per node. See docs [2] and [3] for further explanations. We must also make sure that the fencing resource for node1 only runs on node2 and reverse (on this specific two nodes cluster) and that stonith is enabled in the properties.

primitive ipmi_node1 stonith:fence_ipmilan \
        params auth="password" login="foo" passwd="bar" ipaddr="192.168.206.224" verbose="true" timeout="20" power_wait="10" pcmk_host_check="static-list" pcmk_host_list="node1"
primitive ipmi_node2 stonith:fence_ipmilan \
        params auth="password" login="foo" passwd="bar" ipaddr="192.168.206.225" verbose="true" timeout="20" power_wait="10" pcmk_host_check="static-list" pcmk_host_list="node2"

[...]

location ipmi_node1-on-node2 ipmi_node1 \
        rule $id="ipmi_node1-on-node2-rule" -inf: #uname eq node1
location ipmi_node2-on-node1 ipmi_node2 \
        rule $id="ipmi_node2-on-node1-rule" -inf: #uname eq node2

Now you can trigger fencing by killing corosync on a node (killall -KILL corosync).

[1] http://www.hastexo.com/resources/hints-and-kinks/checking-corosync-cluster-membership
[2] https://alteeve.com/w/IPMI
[3] http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch09.html
[4] http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_configuration_recap.html

4 comments so far.

  1. THANK YOU :-). You helped me so much with your post, i had some serious trouble with corosync and with changing to the “member function” everything works fine now.

    Thomas

  2. I have this configured and I am happy with the setup. But every time I reboot one of the nodes, the online node stops all of the services… How do I stop this behavior?

  3. Ok nevermind, I’ve got it. Thank you for the article

Share your thoughts

*