Tuesday, January 5, 2010

Nagios, Passive Service and Host Checks

I recently had to set up a remote nagios installation that would connect back to a central server via NSCA. I followed the standard Nagios documentation for distributed monitoring here. The passive service checks worked fine.

The problem was that I wanted passive checks for hosts, too, and not just services, as the central server would not be able to ping the remote hosts directly in the event that the data is considered stale. I set up all the checks on the central server side to be the service-is-stale check - so any staleness results in an alert, "the check data is stale." Unfortunately, the nagios documentation is a little vague about the ohcp command (the final paragraph on the nagios link above.) I couldn't find any real answer on passive host checks on the web, either.

Here's what I did on the remote side (and not the central collector server.) I copied the submit_check_result script in the documentation, modified it, and saved it as /etc/nagios/bin/submit_host_result. The final version:

# Arguments:
# $1 = host_name (Short name of host that the service is
# associated with)
# $2 = host check output (0, 1, 2, etc.)
# $3 = plugin_output (A text string that should be used
# as the plugin output for the service checks)
# Convert the state string to the corresponding return code

/usr/bin/printf "%s\t%s\t%s\n" "$1" "$2" "$3" | /usr/sbin/send_nsca -H $central_server -c /etc/nagios/send_nsca.cfg

I then added the following entry to the command definition file:

define command{
command_name submit_host_result
command_line /etc/nagios/bin/submit_host_result $HOSTNAME$ $HOSTSTATEID$ '$HOSTOUTPUT$'


I then modified the nagios.cfg file like so:

# This is the command that is run for every host check that is
# processed by Nagios. This command is executed only if the
# obsess_over_hosts option (above) is set to 1. The command
# argument is the short name of a command definition that you
# define in your host configuration file. Read the HTML docs for
# more information on implementing distributed monitoring.

Of course, it took a bit of work to figure that out. So, the end result is that both service and host checks are passive on the central server. You might want to make the remote server the parent of all the other remote servers as if it's down, inaccessible, there's no way you'll receive check data for the other hosts, and you'll probably get some unnecessary alerts. I'm sure I'll see some more issues, and will likely post again on this issue.