Monitor a Pacemaker Cluster with ocf:pacemaker:ClusterMon and/or external-agent

If you want to monitor your Pacemaker cluster status and get alerted in real time on any cluster transition, you must define an ocf:pacemaker:ClusterMon resource[1].

This resource must be cloned and will run on all nodes of the cluster. It works by using crm_mon[2] in the background, which is a binary that provides a summary of cluster’s current state. This binary has a couple options to send email (SMTP) or traps (SNMP) on any transition to a chosen recipient. You can pass these options from ocf:pacemaker:ClusterMon to the underlaying crm_mon via the extra_options parameter, see [1].

Here is a sample configuration to receive SNMP traps:

primitive ClusterMon ocf:pacemaker:ClusterMon \
        params user="root" update="30" extra_options="--snmp-traps snmphost.example.com --snmp-community public" \
        op monitor on-fail="restart" interval="60"
[...]
clone ClusterMon-clone ClusterMon \
        meta target-role="Started"

For further information and XML examples, see Chapter 7. of Pacemaker Explained[5].

Sadly, since RHEL 6.x and pacemaker-cli-1.1.6-1, both SNMP and ESMTP support are not compiled anymore in crm_mon as advised in the changelogs:

$ sudo rpm -q --changelog pacemaker-cli
* Thu Oct 06 2011 Andrew Beekhof  - 1.1.6-1
[...]
- Do not build in support for snmp, esmtp by default

It means the previous example of an ocf:pacemaker:ClusterMon definition cannot be used anymore as the --snmp-traps parameter doesn’t exist anymore. The same goes for SMTP-related parameters.

Alright, then how to automatically monitor Pacemaker cluster’s transitions and how to get alerted when they occurs ? Of course, paying someone to watch the output of crm_mon -Arf on every cluster is a solution. Not my favorite choice though…

Thankfully, crm_mon[2] is still shipped with the external-agent capability:

  • -E, --external-agent=value
        A program to run when resource operations take place.
  • -e, --external-recipient=value
        A recipient for your program (assuming you want the program to send something to someone).

When triggered, the external agent (-E) is fed with dynamically filled environment variables allowing you to know what transition happened and to react accordingly in your external-agent code. By making clever usage of this capability, you can develop whatever reaction you want and reproduce built-in SNMP support, which I did with a little bash script (pcmk_snmp_helper.sh) that is now included within the extra folder of Pacemaker sources[3].

This notification mechanism (external-agent) and all environnement variables are now documented in chapter 7 of Pacemaker Explained[6]

This “helper” script has been designed to match *my* needs: receive a SNMP trap on each failed monitor operation or on any other event (even successful) that is not a monitor operation (start, stop, …).
It is compliant with pacemaker MIB[4] and sends SNMP v2c traps (only requires snmptrap binary which can be found in net-snmp-utils).

But remember: you can script whatever you want and it will be done on any cluster transition, eg: insert the event into a database, deliver email notifications through SMTP, HTTP-POST something somewhere…

In the end, you will end-up with the bellow configuration.
Just adapt the code of the helper to your needs.

primitive ClusterMon ocf:pacemaker:ClusterMon \
        params user="root" update="30" extra_options="-E /usr/local/bin/pcmk_snmp_helper.sh -e snmphost.example.com" \
        op monitor on-fail="restart" interval="60"
[...]
clone ClusterMon-clone ClusterMon \
        meta target-role="Started"

[1] – http://linux.die.net/man/7/ocf_heartbeat_clustermon
[2] – http://linux.die.net/man/8/crm_mon
[3] – https://github.com/ClusterLabs/pacemaker/blob/master/extra/pcmk_snmp_helper.sh
[4] – https://github.com/ClusterLabs/pacemaker/blob/master/extra/PCMK-MIB.txt
[5] – http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/ch07.html
[6] – http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-notification-external.html

MySQL: SHOW GRANTS for all users, from CLI, with a one-liner

If you ever have to double-check all your MySQL Users permissions database per database, and server by server, it’s a pain in the ass, and as far as SHOW GRANTS only takes one username, you cannot export everything simply. Let’s not discuss about how you could have used PhpMyAdmin, you just can’t, or it has become your full-time job as it will be very long if you have many servers with many databases.
Thanks to a combination of MySQL and bash magics, you still can achieve this with a bash one-liner. You’ll only have to input your eventual MySQL password twice.

mysql -u root -p -P 3306 -s <<< 'SELECT CONCAT("SHOW GRANTS FOR ",user,"@",host,";") FROM mysql.user WHERE host="host.example.com";' | sed -e "s/FOR /&'/" -e "s/@/'&'"/ -e "s/;/'&/" | mysql -u root -p -P 3306 -s

This sample one-liner limits the users connecting from a remote host, but if you remove the WHERE part, you can have all grants for all known users.

Note: you have to input your password twice. The second time, it might be displayed on-screen, but you can be sure it will never be in your bash history.

Linux-ha: moving from one corosync ring and no fencing to redundant rings and fencing(s)

I finally got time to move from a two nodes Linux-HA cluster running Pacemaker on top of Corosync with only one corosync ring and no fencing to redundant rings and IPMI fencing. Next step must be redundant hardware-level fencing (OS level fencing is worthless).

Of course, every cluster should have at least two rings and two fencing devices. Say IPMI BMC fails because power has been removed from the machine and it has no backup battery, then the cluster will hang if you don’t have another fencing mechanism (typically, PDU).

Well, for the moment I can only have one fencing mechanism, so let’s move to the configuration part, I’ll edit once I got a second fencing mechanism available (PDU).

1. Corosync: moving from one ring to redundant rings on a running cluster

# Put your cluster in maintenance (resources are now unmanaged)
crm configure property maintenance-mode=true

# Shutdown the linux-ha stack
service pacemaker stop
service corosync stop

# Edit /etc/corosync/corosync.conf and add the second ring
# I use unicast, authenticated rings (secauth) and active redundant ring protocol.
# Note: this is a corosync 1.x configuration file, so:
#     * There is no mention of quorum stuff
#     * Pacemaker service is enabled with /etc/corosync/service.d/pcmk

cat /etc/corosync/corosync.conf

# Please read the corosync.conf.5 manual page
compatibility: whitetank

totem {
        version: 2
        secauth: on
        threads: 0
        # RRP can have three modes (rrp_mode): if set to active, Corosync uses both
        # interfaces actively. If set to passive, Corosync uses the second interface
        # only if the first ring fails. If rrp_mode is set to none, RRP is disabled.
        rrp_mode: active
        interface {
                member {
                        memberaddr: 192.168.12.1
                }
                member {
                        memberaddr: 192.168.12.2
                }
                ringnumber: 0
                bindnetaddr: 192.168.12.0
                mcastport: 5405
                ttl: 1
        }
        interface {
                member {
                        memberaddr: 192.168.1.11
                }
                member {
                        memberaddr: 192.168.1.13
                }
                ringnumber: 1
                bindnetaddr: 192.168.1.0
                mcastport: 5405
                ttl: 1
        }
        transport: udpu
}

logging {
        fileline: off
        to_logfile: no
        to_syslog: yes
        to_stderr: no
        debug: off
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
        }
}

# Restart only corosync and validate rings, see doc [1]
service corosync start
corosync-cfgtool -s
corosync-objctl | fgrep member

# Now you can start Pacemaker and start managing your resources again
service pacemaker start
crm configure property maintenance-mode=false

2. Activate, configure and start using IPMI (requires reboot)

First, you have to activate IPMI on your BMC. This can’t be done without rebooting I guess. First two commands assume you are running RHEL/CentOS >= 6.2

yum install ipmitool OpenIPMI
yum install fence-agents # provides fence_* scripts including IPMI
reboot # Activate IPMI during the boot process
# Now you can start configuring IPMI from the CLI, see doc [2]
ipmitool lan print 1 # find your channel ID by incrementing 1 if it fails.
ipmitool lan set 1 ipsrc static
ipmitool lan set 1 ipaddr 192.168.206.224
ipmitool lan set 1 netmask 255.255.255.0
# Eventually configure a gateway, vlan etc, see doc [2]
ipmitool lan set 1 auth ADMIN PASSWORD # Activate the PASSWORD auth type for the ADMIN level.
ipmitool lan print 1

ipmitool user list
ipmitool user enable 3 # Enable a new user (last id was 2)
ipmitool user set name 3 foo
ipmitool user set password 3 bar
# ipmitool user priv <uid> <priv> <channel>
ipmitool user priv 3 4 1 # ADMINISTRATOR
ipmitool user list

# test local IPMI
ipmitool -U foo -P bar -I lan chassis status
# test remote IPMI (from second node)
ipmitool -H 192.168.206.224 -U foo -P bar -I lan chassis status

Now, we must enable IPMI fencing and create the associated resources, one per node. See docs [2] and [3] for further explanations. We must also make sure that the fencing resource for node1 only runs on node2 and reverse (on this specific two nodes cluster) and that stonith is enabled in the properties.

primitive ipmi_node1 stonith:fence_ipmilan \
        params auth="password" login="foo" passwd="bar" ipaddr="192.168.206.224" verbose="true" timeout="20" power_wait="10" pcmk_host_check="static-list" pcmk_host_list="node1"
primitive ipmi_node2 stonith:fence_ipmilan \
        params auth="password" login="foo" passwd="bar" ipaddr="192.168.206.225" verbose="true" timeout="20" power_wait="10" pcmk_host_check="static-list" pcmk_host_list="node2"

[...]

location ipmi_node1-on-node2 ipmi_node1 \
        rule $id="ipmi_node1-on-node2-rule" -inf: #uname eq node1
location ipmi_node2-on-node1 ipmi_node2 \
        rule $id="ipmi_node2-on-node1-rule" -inf: #uname eq node2

Now you can trigger fencing by killing corosync on a node (killall -KILL corosync).

[1] http://www.hastexo.com/resources/hints-and-kinks/checking-corosync-cluster-membership
[2] https://alteeve.com/w/IPMI
[3] http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch09.html
[4] http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_configuration_recap.html