Monitoring Windows Reboots through SNMP traps with Nagios
|
by Frank4DD, @2008 |
|
|
One fine day it happened: Nagios missed to alarm us for a server going
down. One of the Windows servers (what else) rebooted due to a unknown
cause (what else). Only it happened so darn fast that it fell exactly
in between the five minute intervals when Nagios sends its 'ping'
checks to verify the system is up. It is a quite rare case, only one
single Nagios 'ping' check failed. With the 'ping' being set
to re-test after one minute for 2 more times to avoid sending
false alerts, it was just recording one fail but did not send the
necessary notification.
Clearly, passive 'ping' monitoring is not perfect, so a better way to monitor these pesky 'secret' Windows reboots is to make them send SNMP traps. Now, at least we will know for sure when they come back up. ;-) |
![]() |
|
The following examples have been developed and verified unter Nagios 3.0.2 running on
SuSE Linux Enterprise Server 10 SP2, receiving traps from Windows 2003 Server and
Windows XP clients. Nagios had been installed into /home/app/nagios. This path is
used in all examples below, please adjust it to your [nagioshome].
|
|
On the Windows server, we need to have the SNMP service installed. It is available in
the normal Windows package (Add/Remove Windows Components) under Management and
Monitoring tools. Once installed, we go to
"Start->Settings>Control Panel->Administrative Tools->Services->
SNMP Service->Properties".
I assume SNMP read access is already set up. So, currently we are only interested in
SNMP traps. First we go to the "Traps" tab. Following good practise we configure a
dedicated trap community (different from public) and add the SNMP trap server
destination IP there. Now we can start sending our first test traps. Stopping and
starting the Windows SNMP service will generate some. Let's check what traps were
send and if they are received on our trap sink server, using tcpdump:
|
# tcpdump -s 0 -X udp port 162
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 10:31:11.189693
IP 192.168.203.140.capioverlan > ml08460.frank4dd.com.snmptrap: C=SECtrap Trap(31) E:311
.1.1.3.1.1 192.168.203.140 coldStart 0
0x0000: 4500 004b 07f1 0000 7f11 eb08 0afd cb8c E..K............
0x0010: 0afd 6722 047b 00a2 0037 f289 302d 0201 ..g".{...7..0-..
0x0020: 0004 074e 424e 7472 6170 a41f 060c 2b06 ...SECtrap....+.
0x0030: 0104 0182 3701 0103 0101 4004 0afd cb8c ....7.....@.....
0x0040: 0201 0002 0100 4301 0030 00 ......C..0.
10:31:26.227627 IP 192.168.203.140.capioverlan > ml08460.frank4dd.com.snmptrap: C=SECtrap
Trap(49) E:311.1.1.3.1.1 192.168.203.140 linkUp 1532 interfaces.ifTable.ifEntry.ifIndex.1
=1
0x0000: 4500 005d 07f2 0000 7f11 eaf5 0afd cb8c E..]............
0x0010: 0afd 6722 047b 00a2 0049 503d 303f 0201 ..g".{...IP=0?..
0x0020: 0004 074e 424e 7472 6170 a431 060c 2b06 ...SECtrap.1..+.
0x0030: 0104 0182 3701 0103 0101 4004 0afd cb8c ....7.....@.....
0x0040: 0201 0302 0100 4302 05fc 3011 300f 060a ......C...0.0...
0x0050: 2b06 0102 0102 0201 0101 0201 01 +............
10:31:26.229296 IP 192.168.203.140.capioverlan > ml08460.frank4dd.com.snmptrap: C=SECtrap
Trap(49) E:311.1.1.3.1.1 192.168.203.140 linkUp 1538 interfaces.ifTable.ifEntry.ifIndex.2
=2
0x0000: 4500 005d 07f3 0000 7f11 eaf4 0afd cb8c E..]............
0x0010: 0afd 6722 047b 00a2 0049 4f36 303f 0201 ..g".{...IO60?..
0x0020: 0004 074e 424e 7472 6170 a431 060c 2b06 ...SECtrap.1..+.
0x0030: 0104 0182 3701 0103 0101 4004 0afd cb8c ....7.....@.....
0x0040: 0201 0302 0100 4302 0602 3011 300f 060a ......C...0.0...
0x0050: 2b06 0102 0102 0201 0102 0201 02 +............
10:31:26.229692 IP 192.168.203.140.capioverlan > ml08460.frank4dd.com.snmptrap: C=SECtrap
Trap(49) E:311.1.1.3.1.1 192.168.203.140 linkUp 1538 interfaces.ifTable.ifEntry.ifIndex.3
=3
0x0000: 4500 005d 07f4 0000 7f11 eaf3 0afd cb8c E..]............
0x0010: 0afd 6722 047b 00a2 0049 4e35 303f 0201 ..g".{...IN50?..
0x0020: 0004 074e 424e 7472 6170 a431 060c 2b06 ...SECtrap.1..+.
0x0030: 0104 0182 3701 0103 0101 4004 0afd cb8c ....7.....@.....
0x0040: 0201 0302 0100 4302 0602 3011 300f 060a ......C...0.0...
0x0050: 2b06 0102 0102 0201 0103 0201 03 +............
4 packets captured |
|
We can see that 4 traps were send when the Windows SNMP service is started. The first
trap packet is a notification of 'coldstart', the following 3 are notifications for
each available network interface (including 127.0.0.1) about their "link up" status.
|
|
For our purpose of testing and receiving traps from Windows systems, we are adding 2 MIB
file to the library in /usr/share/snmp/mibs. The file MSFT.txt describes the Windows OID
tree, while TRAP-TEST-MIB.txt will help us to generate a test trap later.
|
# vi /usr/share/snmp/mibs/MSFT.txt
MSFT-MIB DEFINITIONS ::= BEGIN
IMPORTS
enterprises
FROM RFC1155-SMI;
microsoft OBJECT IDENTIFIER ::= { enterprises 311 }
software OBJECT IDENTIFIER ::= { microsoft 1 }
systems OBJECT IDENTIFIER ::= { software 1 }
os OBJECT IDENTIFIER ::= { systems 3 }
windowsNT OBJECT IDENTIFIER ::= { os 1 }
windows OBJECT IDENTIFIER ::= { os 2 }
workstation OBJECT IDENTIFIER ::= { windowsNT 1 }
server OBJECT IDENTIFIER ::= { windowsNT 2 }
dc OBJECT IDENTIFIER ::= { windowsNT 3 }
END
# vi /usr/share/snmp/mibs/TRAP-TEST-MIB.txt
TRAP-TEST-MIB DEFINITIONS ::= BEGIN
IMPORTS ucdExperimental FROM UCD-SNMP-MIB;
demotraps OBJECT IDENTIFIER ::= { ucdExperimental 990 }
demo-trap TRAP-TYPE
STATUS current
ENTERPRISE demotraps
VARIABLES { sysLocation }
DESCRIPTION "This is just a demo"
::= 17
END
|
|
Next, we configure the 'snmptrapd' daemon. Although the daemon comes with the SNMP
daemon package and is installed in /usr/sbin, no startup script has been put into
/etc/init.d. Fortunately, there is a template in /usr/share/doc/packages/net-snmp.
|
# cp /usr/share/doc/packages/net-snmp/rc.snmptrapd /etc/init.d/snmptrapd # vi /etc/init.d/snmptrapd OPTIONS="-On -p /var/run/snmptrapd.pid -M /usr/share/snmp/mibs -m ALL" change: startproc $SNMPTRAPD $OPTIONS -c /etc/snmptrapd.conf -Lf /var/log/net-snmpd.log to: startproc $SNMPTRAPD $OPTIONS -c /etc/snmp/snmptrapd.conf -Lf /var/log/net-snmpd.log |
|
Now we create the configuration file for the 'snmptrapd' daemon. We define the trap
community for simple access control and we add a trap handler 'default' to handle all
traps by a test script we are going to create. Then we enable and start 'snmptrapd'
through yast->system->system services (runlevel)-> enable snmptrapd for
runlevel 2 3 5.
|
# vi /etc/snmp/snmptrapd.conf # --------------------------------------------------------------------------- # # snmptrapd.conf: # # configuration file for configuring the ucd-snmp snmptrapd agent. # # ----------------------------------------------------------------------------# # first, we define the access control authCommunity log,execute,net SECtrap # next , the trap handlers traphandle default /tmp/snmptraptest.sh # END of snmptrapd.conf ---------------------------------------------------- # |
Let's create a simple test script snmptraptest.sh that
writes all received SNMP traps into a log file.
|
# vi /tmp/snmptraptest.sh
#!/bin/sh
TESTLOG=/tmp/test
vars=
read host
read ip
while read oid val; do
if [ "$vars" = "" ]; then
vars="$oid = $val"
else
vars="$vars, $oid = $val"
fi
done
if [ -w $TESTLOG ]; then
touch $TESTLOG
fi
echo trap: $1 $host $ip $vars >> $TESTLOG
|
|
We are ready for our first test from the local system, using the 'snmptrap' command,
verifying the traps are received and processed by our test script. Also notice the
use of the TRAP-TEST-MIB we generated. |
# snmptrap -v 2c -c SECtrap 127.0.0.1 "" TRAP-TEST-MIB::demo-trap SNMPv2-MIB::sysLocation .0 s "here" # cat /tmp/traptest.log trap: localhost UDP: [127.0.0.1]:42706 DISMAN-EVENT-MIB::sysUpTimeInstance = 6:4:53:38.72, SNMPv2-MIB::snmpTrapOID.0 = TRAP-TEST-MIB::demo-trap, SNMPv2-MIB::sysLocation.0 = here trap: localhost UDP: [127.0.0.1]:42706 DISMAN-EVENT-MIB::sysUpTimeInstance = 6:4:53:38.72, SNMPv2-MIB::snmpTrapOID.0 = TRAP-TEST-MIB::demo-trap, SNMPv2-MIB::sysLocation.0 = here |
|
Well, we really are receiving traps, but why are we getting them twice? Let's check if
our 'snmptraptest.sh' script is called twice. We can change the last line writing the
output to include a random string and give it another try.
|
# vi /tmp/snmptraptest.sh change: echo trap: $1 $host $ip $vars >> $TESTLOG to: echo `/usr/bin/openssl rand 20 -base64` trap: $1 $host $ip $vars >> $TESTLOG # snmptrap -v 2c -c SECtrap 127.0.0.1 "" TRAP-TEST-MIB::demo-trap SNMPv2-MIB::sysLocation. 0 s "here" # cat /tmp/traptest.log vRgoIkp7Y/66EyxK6fETsR7lqhY= trap: localhost UDP: [127.0.0.1]:58476 DISMAN-EVENT-MIB::sys UpTimeInstance = 6:20:16:06.35, SNMPv2-MIB::snmpTrapOID.0 = TRAP-TEST-MIB::demo-trap, SNM Pv2-MIB::sysLocation.0 = here aRsf084ZC/fcJqeOCjFRH/SCNdI= trap: localhost UDP: [127.0.0.1]:58476 DISMAN-EVENT-MIB::sys UpTimeInstance = 6:20:16:06.35, SNMPv2-MIB::snmpTrapOID.0 = TRAP-TEST-MIB::demo-trap, SNM Pv2-MIB::sysLocation.0 = here |
|
Yep, the random hash is different, the script is indeed being called twice! Further down
the investigation ... it turns out that 'snmptrapd' is compiled with the default
configuration file path being already set to '/etc/snmp/snmptrapd.conf'. The explicit
setting of it using the '-c' option in '/etc/init.d/snmptrapd' causes the file being read
and executed twice. Feature or bug? No matter, we need to remove the '-c' option from
'/etc/init.d/snmptrapd'. Re-test, check, problem solved.
|
# vi /etc/init.d/snmptrapd change: startproc $SNMPTRAPD $OPTIONS -c /etc/snmp/snmptrapd.conf -Lf /var/log/net-snmpd.log to: startproc $SNMPTRAPD $OPTIONS -Lf /var/log/net-snmpd.log |
|
After we are able to reliably receive SNMP traps, its time to be selective about them.
This is achieved by defining a explicit snmpTrapOID value match in
'/etc/snmp/snmptrapd.conf'. Let's say we only care about the Windows 'coldstart' traps,
our match would be the trap having the oid=value pair of the 'SNMPv2-MIB::snmpTrapOID.0 =
SNMPv2-MIB::coldStart'. Then we restart the Windows SNMP service once more and verify
receiving the trap data. This time whe recorded just a single trap in '/tmp/traptest.log'.
|
# vi /etc/snmp/snmtrapd.conf change: traphandle default /tmp/snmptraptest.sh to: traphandle SNMPv2-MIB::coldStart /tmp/snmptraptest.sh # /etc/init.d/snmptrapd restart # cat /tmp/traptest.log trap: 192.168.203.140 UDP: [192.168.203.140]:1074 DISMAN-EVENT-MIB::sysUpTimeInstance = 0 :0:00:00.00, SNMPv2-MIB::snmpTrapOID.0 = SNMPv2-MIB::coldStart, SNMP-COMMUNITY-MIB::snmpT rapAddress.0= 192.168.203.140, SNMP-COMMUNITY-MIB::snmpTrapCommunity.0 = "SECtrap", SNMPv 2-MIB::snmpTrapEnterprise.0 = MSFT-MIB::workstation |
|
Nagios can be set to receive and process data sent from external programs. Lets
verify the related directives are enabled and set in the Nagios configuration file:
|
# egrep 'check_external_commands|command_check_interval|command_file' |
|
The Nagios data-receiving part is the named pipe '/home/app/nagios/var/rw/nagios.cmd'.
The format of the Nagios event to send to '/home/app/nagios/var/rw/nagios.cmd' is:
[Unix Timestamp] Message Descriptor;host name;service-name;severity-code;text data Example: [1141163054] PROCESS_SERVICE_CHECK_RESULT;ml08460;check_trap_ml08460;1;Trap test data We can now send a test event to Nagios to see if it is received properly: |
# echo "`date +[%s]` PROCESS_SERVICE_CHECK_RESULT;testserver;check_trap_test;1;test" > /home/app/nagios/var/rw/nagios.cmd # tail /home/app/nagios/var/nagios.log | grep trap [1224133947] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;testserver;check_trap_test;1; test [1224133947] Warning: Passive check result was received for service 'check_trap_test' on host 'testserver', but the host could not be found! |
It is time to think of a program that translates our SNMP trap into a Nagios event and
sends it to Nagios through its command file. We want this trap service for Windows
reboots associated with each Nagios host in order to allow for a separate notification
to the appropriate host support team. We also want the severity code set to warning, but
avoid confirmation by hand. Instead we want the event to be cleared quickly to OK state,
and no notification should go out for this auto-confirmation. The association with a Nagios host requires us to get the correct host name derived from the trap IP. The auto-confirmation is made by a second, slighty delayed event submission with severity code '0'. the notification for OK is disabled in the service template. I programmed and named this program send_trap_data.pl, then put it into my nagios-home/libexec directory. If the DEBUG option is set to 1, the program writes some parameters and the submitted Nagios events into a temp file. Let's enable 'send_trap_data.pl' to start process incoming SNMP traps for Nagios: |
# vi /etc/snmp/snmtrapd.conf change: traphandle SNMPv2-MIB::coldStart /tmp/snmptraptest to: # traphandle SNMPv2-MIB::coldStart /tmp/snmptraptest traphandle SNMPv2-MIB::coldStart /home/app/nagios/libexec/send_trap_data.pl # /etc/init.d/snmptrapd restart # cat /tmp/test3 trapline >proxyjp02.frank4dd.com UDP: [192.168.100.184]:12380 DISMAN-EVENT-MIB::sysUpTimeInstance 0:0:00:00.00 SNMPv2-MIB::snmpTrapOID.0 SNMPv2-MIB::coldStart SNMP-COMMUNITY-MIB::snmpTrapAddress.0 192.168.100.184 SNMP-COMMUNITY-MIB::snmpTrapCommunity.0 "SECtrap" SNMPv2-MIB::snmpTrapEnterprise.0 MSFT-MIB::server < traphost >proxyjp02.frank4dd.com< snmpname >SNMPv2-MIB::sysName.0 = STRING: JPNHOMG035< hostname >jpnhomg035< eventstr >[1224478743] PROCESS_SERVICE_CHECK_RESULT;jpnhomg035;check_trap_coldstart;1;Syst em *reboot* or SNMP service restarted.< Wrote eventstr to /home/app/nagios/var/rw/nagios.cmd eventstr >[1224478743] PROCESS_SERVICE_CHECK_RESULT;jpnhomg035;check_trap_coldstart;0;Syst em *reboot* or SNMP service restarted. auto-OK< Wrote eventstr to /home/app/nagios/var/rw/nagios.cmd End of send_trap_data.pl. |
|
Here, we define a service template and add services to it. Depending on how many
different notifications we need to generate, we need to separate the actual services.
|
vi /home/app/nagios/etc/nagios.cfg
# passive service check for SNMP traps
cfg_file=/home/app/nagios/etc/objects/trap-services-template.cfg
cfg_file=/home/app/nagios/etc/objects/trap-services.cfg
# vi /home/app/nagios/etc/objects/trap-services-template.cfg
###############################################################################
# Define a servicegroup for SNMP trap service checks
# All SNMp trap service checks will be members of this group
###############################################################################
define servicegroup{
servicegroup_name snmptrap-checks ; The name of the servicegroup
alias SNMP Trap Services ; Long name of the group
}
###############################################################################
# Define the database check template service
###############################################################################
define service{
name generic-trap
active_checks_enabled 0 ; traps are only passive checks
passive_checks_enabled 1 ; yes, check passive
parallelize_check 1 ; yes, please
obsess_over_service 0 ; we don't run extra commands
check_freshness 0 ; don't check for freshness
notifications_enabled 1 ; send notifications
event_handler_enabled 1 ; yes, but we have none
flap_detection_enabled 0 ; with auto-OK, we don't
failure_prediction_enabled 1 ; dependency checks
process_perf_data 0 ; don't send this to perfdata
retain_status_information 1 ; yes, once auto-OK'ed, keep it
retain_nonstatus_information 1
is_volatile 1 ; enable for passive checks
check_period 24x7 ; always check for submissions
max_check_attempts 1 ; one trap is enough
normal_check_interval 1
retry_check_interval 1
contact_groups frankonly
notification_options w ; notify for warnings only
notification_interval 120 ; notify every 2 hrs
notification_period 24x7 ; always notify
register 0 ; template, don't register
servicegroups snmptrap-checks
check_command check_none ; we do not run any checks
}
# vi /home/app/nagios/etc/objects/trap-services.cfg
###############################################################################
# Receive SNMP traps for windows boot events via eventhandler scripts
###############################################################################
define service {
use generic-trap
host_name jpnhomg035
name check_trap_coldstart
service_description check_trap_coldstart
}
###############################################################################
|


