Monitoring Windows Reboots through SNMP traps with Nagios

by Frank4DD, @2008

suse-logo nagios-logo windows-logo  
One fine day it happened: Nagios missed to alarm us for a server going down. One of the Windows servers (what else) rebooted due to a unknown cause (what else). Only it happened so darn fast that it fell exactly in between the five minute intervals when Nagios sends its 'ping' checks to verify the system is up. It is a quite rare case, only one single Nagios 'ping' check failed. With the 'ping' being set to re-test after one minute for 2 more times to avoid sending false alerts, it was just recording one fail but did not send the necessary notification.
Clearly, passive 'ping' monitoring is not perfect, so a better way to monitor these pesky 'secret' Windows reboots is to make them send SNMP traps. Now, at least we will know for sure when they come back up. ;-)
The following examples have been developed and verified unter Nagios 3.0.2 running on SuSE Linux Enterprise Server 10 SP2, receiving traps from Windows 2003 Server and Windows XP clients. Nagios had been installed into /home/app/nagios. This path is used in all examples below, please adjust it to your [nagioshome].

  1. The 'Sending' part: Generating SNMP traps from Windows
    On the Windows server, we need to have the SNMP service installed. It is available in the normal Windows package (Add/Remove Windows Components) under Management and Monitoring tools. Once installed, we go to "Start->Settings>Control Panel->Administrative Tools->Services-> SNMP Service->Properties". I assume SNMP read access is already set up. So, currently we are only interested in SNMP traps. First we go to the "Traps" tab. Following good practise we configure a dedicated trap community (different from public) and add the SNMP trap server destination IP there. Now we can start sending our first test traps. Stopping and starting the Windows SNMP service will generate some. Let's check what traps were send and if they are received on our trap sink server, using tcpdump:
    # tcpdump -s 0 -X udp port 162
    listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 10:31:11.189693
    
    IP 192.168.203.140.capioverlan > ml08460.frank4dd.com.snmptrap:  C=SECtrap Trap(31) E:311
    .1.1.3.1.1 192.168.203.140 coldStart 0
            0x0000:  4500 004b 07f1 0000 7f11 eb08 0afd cb8c  E..K............
            0x0010:  0afd 6722 047b 00a2 0037 f289 302d 0201  ..g".{...7..0-..
            0x0020:  0004 074e 424e 7472 6170 a41f 060c 2b06  ...SECtrap....+.
            0x0030:  0104 0182 3701 0103 0101 4004 0afd cb8c  ....7.....@.....
            0x0040:  0201 0002 0100 4301 0030 00              ......C..0.
    
    10:31:26.227627 IP 192.168.203.140.capioverlan > ml08460.frank4dd.com.snmptrap: C=SECtrap
    Trap(49) E:311.1.1.3.1.1 192.168.203.140 linkUp 1532 interfaces.ifTable.ifEntry.ifIndex.1
    =1
            0x0000:  4500 005d 07f2 0000 7f11 eaf5 0afd cb8c  E..]............
            0x0010:  0afd 6722 047b 00a2 0049 503d 303f 0201  ..g".{...IP=0?..
            0x0020:  0004 074e 424e 7472 6170 a431 060c 2b06  ...SECtrap.1..+.
            0x0030:  0104 0182 3701 0103 0101 4004 0afd cb8c  ....7.....@.....
            0x0040:  0201 0302 0100 4302 05fc 3011 300f 060a  ......C...0.0...
            0x0050:  2b06 0102 0102 0201 0101 0201 01         +............
    
    10:31:26.229296 IP 192.168.203.140.capioverlan > ml08460.frank4dd.com.snmptrap: C=SECtrap
    Trap(49) E:311.1.1.3.1.1 192.168.203.140 linkUp 1538 interfaces.ifTable.ifEntry.ifIndex.2
    =2
            0x0000:  4500 005d 07f3 0000 7f11 eaf4 0afd cb8c  E..]............
            0x0010:  0afd 6722 047b 00a2 0049 4f36 303f 0201  ..g".{...IO60?..
            0x0020:  0004 074e 424e 7472 6170 a431 060c 2b06  ...SECtrap.1..+.
            0x0030:  0104 0182 3701 0103 0101 4004 0afd cb8c  ....7.....@.....
            0x0040:  0201 0302 0100 4302 0602 3011 300f 060a  ......C...0.0...
            0x0050:  2b06 0102 0102 0201 0102 0201 02         +............
    
    10:31:26.229692 IP 192.168.203.140.capioverlan > ml08460.frank4dd.com.snmptrap: C=SECtrap
    Trap(49) E:311.1.1.3.1.1 192.168.203.140 linkUp 1538 interfaces.ifTable.ifEntry.ifIndex.3
    =3
            0x0000:  4500 005d 07f4 0000 7f11 eaf3 0afd cb8c  E..]............
            0x0010:  0afd 6722 047b 00a2 0049 4e35 303f 0201  ..g".{...IN50?..
            0x0020:  0004 074e 424e 7472 6170 a431 060c 2b06  ...SECtrap.1..+.
            0x0030:  0104 0182 3701 0103 0101 4004 0afd cb8c  ....7.....@.....
            0x0040:  0201 0302 0100 4302 0602 3011 300f 060a  ......C...0.0...
            0x0050:  2b06 0102 0102 0201 0103 0201 03         +............
    
    4 packets captured
    We can see that 4 traps were send when the Windows SNMP service is started. The first trap packet is a notification of 'coldstart', the following 3 are notifications for each available network interface (including 127.0.0.1) about their "link up" status.

  2. The 'Receiving' part: Picking up the SNMP traps using the 'snmptrapd' daemon
    For our purpose of testing and receiving traps from Windows systems, we are adding 2 MIB file to the library in /usr/share/snmp/mibs. The file MSFT.txt describes the Windows OID tree, while TRAP-TEST-MIB.txt will help us to generate a test trap later.
    # vi /usr/share/snmp/mibs/MSFT.txt
    MSFT-MIB DEFINITIONS ::= BEGIN
    
    
    IMPORTS
        enterprises
            FROM RFC1155-SMI;
    
    microsoft       OBJECT IDENTIFIER ::= { enterprises 311 }
    software        OBJECT IDENTIFIER ::= { microsoft 1 }
    systems         OBJECT IDENTIFIER ::= { software 1 }
    os              OBJECT IDENTIFIER ::= { systems 3 }
    windowsNT       OBJECT IDENTIFIER ::= { os 1 }
    windows         OBJECT IDENTIFIER ::= { os 2 }
    workstation     OBJECT IDENTIFIER ::= { windowsNT 1 }
    server          OBJECT IDENTIFIER ::= { windowsNT 2 }
    dc              OBJECT IDENTIFIER ::= { windowsNT 3 }
    
    END
    
    # vi /usr/share/snmp/mibs/TRAP-TEST-MIB.txt
    TRAP-TEST-MIB DEFINITIONS ::= BEGIN
            IMPORTS ucdExperimental FROM UCD-SNMP-MIB;
    
    demotraps OBJECT IDENTIFIER ::= { ucdExperimental 990 }
    
    demo-trap TRAP-TYPE
            STATUS current
            ENTERPRISE demotraps
            VARIABLES { sysLocation }
            DESCRIPTION "This is just a demo"
            ::= 17
    
    END
    Next, we configure the 'snmptrapd' daemon. Although the daemon comes with the SNMP daemon package and is installed in /usr/sbin, no startup script has been put into /etc/init.d. Fortunately, there is a template in /usr/share/doc/packages/net-snmp.
    # cp /usr/share/doc/packages/net-snmp/rc.snmptrapd /etc/init.d/snmptrapd
    
    # vi /etc/init.d/snmptrapd
    
    OPTIONS="-On -p /var/run/snmptrapd.pid -M /usr/share/snmp/mibs -m ALL"
    
    change:
    startproc $SNMPTRAPD $OPTIONS -c /etc/snmptrapd.conf -Lf /var/log/net-snmpd.log
    to:
    startproc $SNMPTRAPD $OPTIONS -c /etc/snmp/snmptrapd.conf -Lf /var/log/net-snmpd.log
    Now we create the configuration file for the 'snmptrapd' daemon. We define the trap community for simple access control and we add a trap handler 'default' to handle all traps by a test script we are going to create. Then we enable and start 'snmptrapd' through yast->system->system services (runlevel)-> enable snmptrapd for runlevel 2 3 5.
    # vi /etc/snmp/snmptrapd.conf
    
    # --------------------------------------------------------------------------- #
    # snmptrapd.conf:                                                             #
    #    configuration file for configuring the ucd-snmp snmptrapd agent.         #
    # ----------------------------------------------------------------------------#
    
    # first, we define the access control
    authCommunity log,execute,net SECtrap
    
    # next , the trap handlers
    traphandle      default                                 /tmp/snmptraptest.sh
    # END of snmptrapd.conf ---------------------------------------------------- #

  3. The 'Testing' part: Learning to send, receive and filter SNMP traps
    Let's create a simple test script snmptraptest.sh that writes all received SNMP traps into a log file.
    # vi /tmp/snmptraptest.sh
    
    #!/bin/sh
    
    TESTLOG=/tmp/test
    vars=
    
    read host
    read ip
    
    while read oid val; do
      if [ "$vars" = "" ]; then
        vars="$oid = $val"
      else
        vars="$vars, $oid = $val"
      fi
    done
    
    if [ -w $TESTLOG ]; then
      touch $TESTLOG
    fi
    
    echo trap: $1 $host $ip $vars >> $TESTLOG
    We are ready for our first test from the local system, using the 'snmptrap' command, verifying the traps are received and processed by our test script. Also notice the use of the TRAP-TEST-MIB we generated.
    # snmptrap -v 2c -c SECtrap 127.0.0.1 "" TRAP-TEST-MIB::demo-trap SNMPv2-MIB::sysLocation
    .0 s "here"
    
    # cat /tmp/traptest.log
    trap: localhost UDP: [127.0.0.1]:42706 DISMAN-EVENT-MIB::sysUpTimeInstance = 6:4:53:38.72,
     SNMPv2-MIB::snmpTrapOID.0 = TRAP-TEST-MIB::demo-trap, SNMPv2-MIB::sysLocation.0 = here
    trap: localhost UDP: [127.0.0.1]:42706 DISMAN-EVENT-MIB::sysUpTimeInstance = 6:4:53:38.72,
     SNMPv2-MIB::snmpTrapOID.0 = TRAP-TEST-MIB::demo-trap, SNMPv2-MIB::sysLocation.0 = here
    Well, we really are receiving traps, but why are we getting them twice? Let's check if our 'snmptraptest.sh' script is called twice. We can change the last line writing the output to include a random string and give it another try.
    # vi /tmp/snmptraptest.sh
    
    change:
    echo trap: $1 $host $ip $vars >> $TESTLOG
    to:
    echo `/usr/bin/openssl rand 20 -base64` trap: $1 $host $ip $vars >> $TESTLOG
    
    # snmptrap -v 2c -c SECtrap 127.0.0.1 "" TRAP-TEST-MIB::demo-trap SNMPv2-MIB::sysLocation.
    0 s "here"
    
    # cat /tmp/traptest.log
    vRgoIkp7Y/66EyxK6fETsR7lqhY= trap: localhost UDP: [127.0.0.1]:58476 DISMAN-EVENT-MIB::sys
    UpTimeInstance = 6:20:16:06.35, SNMPv2-MIB::snmpTrapOID.0 = TRAP-TEST-MIB::demo-trap, SNM
    Pv2-MIB::sysLocation.0 = here
    aRsf084ZC/fcJqeOCjFRH/SCNdI= trap: localhost UDP: [127.0.0.1]:58476 DISMAN-EVENT-MIB::sys
    UpTimeInstance = 6:20:16:06.35, SNMPv2-MIB::snmpTrapOID.0 = TRAP-TEST-MIB::demo-trap, SNM
    Pv2-MIB::sysLocation.0 = here
    Yep, the random hash is different, the script is indeed being called twice! Further down the investigation ... it turns out that 'snmptrapd' is compiled with the default configuration file path being already set to '/etc/snmp/snmptrapd.conf'. The explicit setting of it using the '-c' option in '/etc/init.d/snmptrapd' causes the file being read and executed twice. Feature or bug? No matter, we need to remove the '-c' option from '/etc/init.d/snmptrapd'. Re-test, check, problem solved.
    # vi /etc/init.d/snmptrapd
    
    change:
    startproc $SNMPTRAPD $OPTIONS -c /etc/snmp/snmptrapd.conf -Lf /var/log/net-snmpd.log
    to:
    startproc $SNMPTRAPD $OPTIONS -Lf /var/log/net-snmpd.log
    After we are able to reliably receive SNMP traps, its time to be selective about them. This is achieved by defining a explicit snmpTrapOID value match in '/etc/snmp/snmptrapd.conf'. Let's say we only care about the Windows 'coldstart' traps, our match would be the trap having the oid=value pair of the 'SNMPv2-MIB::snmpTrapOID.0 = SNMPv2-MIB::coldStart'. Then we restart the Windows SNMP service once more and verify receiving the trap data. This time whe recorded just a single trap in '/tmp/traptest.log'.
    # vi /etc/snmp/snmtrapd.conf
    
    change:
    traphandle      default                                 /tmp/snmptraptest.sh
    to:
    traphandle      SNMPv2-MIB::coldStart           	/tmp/snmptraptest.sh
    
    # /etc/init.d/snmptrapd restart
    
    # cat /tmp/traptest.log
    trap: 192.168.203.140 UDP: [192.168.203.140]:1074 DISMAN-EVENT-MIB::sysUpTimeInstance = 0
    :0:00:00.00, SNMPv2-MIB::snmpTrapOID.0 = SNMPv2-MIB::coldStart, SNMP-COMMUNITY-MIB::snmpT
    rapAddress.0= 192.168.203.140, SNMP-COMMUNITY-MIB::snmpTrapCommunity.0 = "SECtrap", SNMPv
    2-MIB::snmpTrapEnterprise.0 = MSFT-MIB::workstation

  4. The 'Translating' part, converting the SNMP traps into Nagios format and send them to Nagios
    Nagios can be set to receive and process data sent from external programs. Lets verify the related directives are enabled and set in the Nagios configuration file:
    # egrep 'check_external_commands|command_check_interval|command_file' /etc/nagios.cfg
    check_external_commands=1
    #command_check_interval=-1
    command_check_interval=5s
    command_file=/home/app/nagios/var/rw/nagios.cmd
    
    # grep accept_passive /home/app/nagios/etc/nagios.cfg
    accept_passive_service_checks=1
    accept_passive_host_checks=1
    
    The Nagios data-receiving part is the named pipe '/home/app/nagios/var/rw/nagios.cmd'. The format of the Nagios event to send to '/home/app/nagios/var/rw/nagios.cmd' is:
    [Unix Timestamp] Message Descriptor;host name;service-name;severity-code;text data

    Example: [1141163054] PROCESS_SERVICE_CHECK_RESULT;ml08460;check_trap_ml08460;1;Trap test data

    We can now send a test event to Nagios to see if it is received properly:
    # echo "`date +[%s]` PROCESS_SERVICE_CHECK_RESULT;testserver;check_trap_test;1;test" > 
    /home/app/nagios/var/rw/nagios.cmd
    
    # tail /home/app/nagios/var/nagios.log | grep trap
    
    [1224133947] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;testserver;check_trap_test;1;
    test
    [1224133947] Warning:  Passive check result was received for service 'check_trap_test' on
     host 'testserver', but the host could not be found!
    It is time to think of a program that translates our SNMP trap into a Nagios event and sends it to Nagios through its command file. We want this trap service for Windows reboots associated with each Nagios host in order to allow for a separate notification to the appropriate host support team. We also want the severity code set to warning, but avoid confirmation by hand. Instead we want the event to be cleared quickly to OK state, and no notification should go out for this auto-confirmation.
    The association with a Nagios host requires us to get the correct host name derived from the trap IP. The auto-confirmation is made by a second, slighty delayed event submission with severity code '0'. the notification for OK is disabled in the service template. I programmed and named this program send_trap_data.pl, then put it into my nagios-home/libexec directory. If the DEBUG option is set to 1, the program writes some parameters and the submitted Nagios events into a temp file. Let's enable 'send_trap_data.pl' to start process incoming SNMP traps for Nagios:
    # vi /etc/snmp/snmtrapd.conf
    
    change:
    traphandle      SNMPv2-MIB::coldStart          /tmp/snmptraptest
    to:
    # traphandle    SNMPv2-MIB::coldStart          /tmp/snmptraptest
    traphandle      SNMPv2-MIB::coldStart          /home/app/nagios/libexec/send_trap_data.pl
    
    # /etc/init.d/snmptrapd restart
    
    # cat /tmp/test3
    trapline >proxyjp02.frank4dd.com
     UDP: [192.168.100.184]:12380
     DISMAN-EVENT-MIB::sysUpTimeInstance 0:0:00:00.00
     SNMPv2-MIB::snmpTrapOID.0 SNMPv2-MIB::coldStart
     SNMP-COMMUNITY-MIB::snmpTrapAddress.0 192.168.100.184
     SNMP-COMMUNITY-MIB::snmpTrapCommunity.0 "SECtrap"
     SNMPv2-MIB::snmpTrapEnterprise.0 MSFT-MIB::server
    <
    traphost >proxyjp02.frank4dd.com<
    snmpname >SNMPv2-MIB::sysName.0 = STRING: JPNHOMG035<
    hostname >jpnhomg035<
    eventstr >[1224478743] PROCESS_SERVICE_CHECK_RESULT;jpnhomg035;check_trap_coldstart;1;Syst
    em *reboot* or SNMP service restarted.<
    Wrote eventstr to /home/app/nagios/var/rw/nagios.cmd
    eventstr >[1224478743] PROCESS_SERVICE_CHECK_RESULT;jpnhomg035;check_trap_coldstart;0;Syst
    em *reboot* or SNMP service restarted. auto-OK<
    Wrote eventstr to /home/app/nagios/var/rw/nagios.cmd
    End of send_trap_data.pl.
    

  5. The 'Processing' part, displaying and notifying SNMP trap generated events with Nagios

  6. Here, we define a service template and add services to it. Depending on how many different notifications we need to generate, we need to separate the actual services.
    vi /home/app/nagios/etc/nagios.cfg
    
    # passive service check for SNMP traps
    cfg_file=/home/app/nagios/etc/objects/trap-services-template.cfg
    cfg_file=/home/app/nagios/etc/objects/trap-services.cfg
    
    # vi /home/app/nagios/etc/objects/trap-services-template.cfg
    
    ###############################################################################
    # Define a servicegroup for SNMP trap service checks
    # All SNMp trap service checks will be members of this group
    ###############################################################################
    define servicegroup{
      servicegroup_name        snmptrap-checks     ; The name of the servicegroup
      alias                    SNMP Trap Services  ; Long name of the group
    }
    ###############################################################################
    # Define the database check template service
    ###############################################################################
    define service{
      name                          generic-trap
      active_checks_enabled         0		; traps are only passive checks
      passive_checks_enabled        1               ; yes, check passive
      parallelize_check             1		; yes, please
      obsess_over_service           0		; we don't run extra commands
      check_freshness               0               ; don't check for freshness
      notifications_enabled         1		; send notifications
      event_handler_enabled         1		; yes, but we have none
      flap_detection_enabled        0		; with auto-OK, we don't
      failure_prediction_enabled    1		; dependency checks
      process_perf_data             0		; don't send this to perfdata
      retain_status_information     1		; yes, once auto-OK'ed, keep it
      retain_nonstatus_information  1
      is_volatile                   1               ; enable for passive checks
      check_period                  24x7		; always check for submissions
      max_check_attempts            1		; one trap is enough
      normal_check_interval         1		
      retry_check_interval          1
      contact_groups                frankonly
      notification_options          w               ; notify for warnings only
      notification_interval         120             ; notify every 2 hrs
      notification_period           24x7		; always notify
      register                      0		; template, don't register
      servicegroups                 snmptrap-checks
      check_command                 check_none	; we do not run any checks
    }
    
    # vi /home/app/nagios/etc/objects/trap-services.cfg
    
    ###############################################################################
    # Receive SNMP traps for windows boot events via eventhandler scripts
    ###############################################################################
    define service {
      use                           generic-trap
      host_name                     jpnhomg035
      name                          check_trap_coldstart
      service_description           check_trap_coldstart
    }
    ###############################################################################
    
    

  7. We are done, enjoy the Nagios SNMP trap monitoring (example screenshots)
  8. nagios snmp trap service detail 1        nagios snmp trap service detail 2

    nagios snmp trap service detail 3

    nagios snmp trap service detail 4

  9. ... and the resulting Nagios notification (example screenshot of the e-mail body)
  10. nagios e-mail notification

  11. Credits, copyrights original scripts etc