Monitor a Pacemaker Cluster with ocf:pacemaker:ClusterMon and/or external-agent

If you want to monitor your Pacemaker cluster status and get alerted in real time on any cluster transition, you must define an ocf:pacemaker:ClusterMon resource[1].

This resource must be cloned and will run on all nodes of the cluster. It works by using crm_mon[2] in the background, which is a binary that provides a summary of cluster’s current state. This binary has a couple options to send email (SMTP) or traps (SNMP) on any transition to a chosen recipient. You can pass these options from ocf:pacemaker:ClusterMon to the underlaying crm_mon via the extra_options parameter, see [1].

Here is a sample configuration to receive SNMP traps:

primitive ClusterMon ocf:pacemaker:ClusterMon \
        params user="root" update="30" extra_options="--snmp-traps snmphost.example.com --snmp-community public" \
        op monitor on-fail="restart" interval="60"
[...]
clone ClusterMon-clone ClusterMon \
        meta target-role="Started"

For further information and XML examples, see Chapter 7. of Pacemaker Explained[5].

Sadly, since RHEL 6.x and pacemaker-cli-1.1.6-1, both SNMP and ESMTP support are not compiled anymore in crm_mon as advised in the changelogs:

$ sudo rpm -q --changelog pacemaker-cli
* Thu Oct 06 2011 Andrew Beekhof  - 1.1.6-1
[...]
- Do not build in support for snmp, esmtp by default

It means the previous example of an ocf:pacemaker:ClusterMon definition cannot be used anymore as the --snmp-traps parameter doesn’t exist anymore. The same goes for SMTP-related parameters.

Alright, then how to automatically monitor Pacemaker cluster’s transitions and how to get alerted when they occurs ? Of course, paying someone to watch the output of crm_mon -Arf on every cluster is a solution. Not my favorite choice though…

Thankfully, crm_mon[2] is still shipped with the external-agent capability:

  • -E, --external-agent=value
        A program to run when resource operations take place.
  • -e, --external-recipient=value
        A recipient for your program (assuming you want the program to send something to someone).

When triggered, the external agent (-E) is fed with dynamically filled environment variables allowing you to know what transition happened and to react accordingly in your external-agent code. By making clever usage of this capability, you can develop whatever reaction you want and reproduce built-in SNMP support, which I did with a little bash script (pcmk_snmp_helper.sh) that is now included within the extra folder of Pacemaker sources[3].

This notification mechanism (external-agent) and all environnement variables are now documented in chapter 7 of Pacemaker Explained[6]

This “helper” script has been designed to match *my* needs: receive a SNMP trap on each failed monitor operation or on any other event (even successful) that is not a monitor operation (start, stop, …).
It is compliant with pacemaker MIB[4] and sends SNMP v2c traps (only requires snmptrap binary which can be found in net-snmp-utils).

But remember: you can script whatever you want and it will be done on any cluster transition, eg: insert the event into a database, deliver email notifications through SMTP, HTTP-POST something somewhere…

In the end, you will end-up with the bellow configuration.
Just adapt the code of the helper to your needs.

primitive ClusterMon ocf:pacemaker:ClusterMon \
        params user="root" update="30" extra_options="-E /usr/local/bin/pcmk_snmp_helper.sh -e snmphost.example.com" \
        op monitor on-fail="restart" interval="60"
[...]
clone ClusterMon-clone ClusterMon \
        meta target-role="Started"

[1] – http://linux.die.net/man/7/ocf_heartbeat_clustermon
[2] – http://linux.die.net/man/8/crm_mon
[3] – https://github.com/ClusterLabs/pacemaker/blob/master/extra/pcmk_snmp_helper.sh
[4] – https://github.com/ClusterLabs/pacemaker/blob/master/extra/PCMK-MIB.txt
[5] – http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/ch07.html
[6] – http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-notification-external.html

18 comments so far.

  1. I want to send fail over notification using external agent so I am using https://github.com/ClusterLabs/pacemaker/blob/master/extra/pcmk_snmp_helper.sh file.
    But where that files output goes?
    What is that PCMK-MIB.txt and what is its use?

    Please reply…

    Thank You,
    Ranjan

    • Hey Ranjan, everything is in the doc and in the script, if you want notifications but don’t know what a MIB is, I suggest you do some reading first, as for where the “output” goes, it depends on how you configure it, and again, it is explained in the script itself and even more, in my blog post.

  2. I saw PCMK-MIB.txt, number of objects defined there and that are used in .sh file. -e is the option for external receipent. I put my local network ip there, but on that ip machine where should I check that notifications/output of .sh file?
    Please explore

    Thank You,
    Ranjan

  3. I saw for these two locations but how can I find snmpd logslog file?
    vim /etc/sysconfig/snmpd
    vim /usr/sbin/snmpd

    It’s completely new for me

  4. Hello,

    Thank you! This is really helpful, only one question.

    I configure clusterMon using the example you give here, and using your script (sh), but only for SMTP mails, and I only receive mails in case of fence events… when one of the services of the cluster or a resource goes down not.

    This is how I create my ClusterMon resource:

    primitive ClusterMon ocf:pacemaker:ClusterMon \
    params user=”root” update=”30″ extra_options=”-rfocAW -h /var/tmp/crm_mon.html -E /usr/local/bin/pcmk_snmp.sh” \
    op monitor on-fail=”restart” interval=”60″
    [...]
    clone ClusterMon-clone ClusterMon \
    meta target-role=”Started”

    I am doing somthing wrong, thank you so much!

    • Hey Daniel, I’m not familiar with all the options you are passing to crm_mon in the extra_options field. Maybe remove all of them except -E (obviously) and see if it works for resources going down. If it does (it should) then you can re-enable options one-by-one, probably starting with -W, reading the doc I’m not sure you need the others.

  5. Can the same method be applied to an Ubuntu environment (I’m currently using version 14.04). Despite successfully creating an ocf:pacemaker:ClusterMon resource, the script is not getting triggered upon event changes. I tried triggering an external script using ‘crm_mon -d -E ” “‘ as well, but neither seem to be working.

  6. Hello guys,

    first of all a big thumb up for your work Florian!

    I am suffering under a strange behaviour. I can add the resource agent clustermon using your code, but it can not be cloned (nothing happens), it does not execute my script and the agent entry disappears after a reboot (??) [only this agent, not the usual ones]. With ‘disappear’ I mean it shows up in ‘configure show’, but only until I reboot. When adding the agent, it says “Warning … timout shorter than…” so I guess it works.

    Do I have to use a complient .sh? But why does it also disappear…
    Can someone tell me what I am doing wrong? Thank you very much!

    Details
    ——-

    How I add the agent:

    -> crm
    -> configure

    primitive ClusterMon ocf:pacemaker:ClusterMon \
    params user=”root” update=”30″ extra_options=”-E /usr/lib/ocf/resource.d/sebastian/pcmkFailureNotifier.sh” \
    op monitor on-fail=”restart” interval=”30″

    clone ClusterMon-clone ClusterMon \
    meta target-role=”Started”

    pcmkFailureNotifier.sh:

    if [[ ${CRM_notify_rc} != 0 && ${CRM_notify_task} == "monitor" ]] || [[ ${CRM_notify_task} != "monitor" ]] ; then
    # Implement the actions you need
    # Like nagios info, email or else
    # here it is a simple echo output

    # If DC is Node A or B
    ssh *user@*Ip ‘echo “Node A: An pacemaker monitoring operation failed” >> /home/*name/pcmkFailureReceiver’
    # If DC is Node C
    echo “Node A: An pacemaker monitoring operation failed” >> /home/*name/pcmkFailureReceiver && exit 0 || exit 1
    fi

    exit 0

  7. Hello,
    What I would like to have is for pacemaker to call my script when a cluster is created; not when there is a transition. My script then configures BDR and glusterfs based on the cluster configuration in pacemaker.

    Can someone suggest a way to have a script called when the command ‘pcs cluster create…’ is invoked.

    Thanks

    • Huy, if I understand correctly what you want to achieve there is a chiken-and-egg situation here, you want your cluster to alert you that the cluster is beeing created…
      I see a couple solutions, maybe you could forbid root access and track sudo usage to the “pcs cluster create” command, or use linux audit capability ?

  8. Hi,

    I am trying to use crm_mon for sending some notification to diferent server. somehow multiple crm_mon instances are getting started. I have used pidfile option, but its not helping.

Share your thoughts

*