Monitoring Windows and SQL Server with Nagios

05:45 reading time


Our web systems live on UNIX-y hosts, and we’ve got a robust Nagios implementation to monitor and alert us for all those systems. However, our BI platform is Windows and SQL Server based, and we didn’t want to have a separate monitoring system for those servers and databases. We came up with some tricks that have worked well for us to integrate Nagios into our Windows ecosystem.

Install Nsclient++ on Windows boxes

In order to get monitoring stats into Nagios, we’ve installed the Nsclient++ application on all our Windows machines. This is a very lightweight, handy client that enables all sorts of monitoring data. For our purposes, we’ve got it configured to pass Nagios checks to Windows Performance Monitor counters via the check_nt protocol.

Nsclient++ utilizes a simple ini file to set the basic configuration. Our Nagios server was added to the Allowed Hosts section, and “NSClientserver” was set =1 to allow the check_nt command to flow through.

Create objects in Nagios

On our nagios server, we’ve created a command for check_nt in the commands directory:

file check_nt.cfg:

define command {
    command_name  check_nt
    command_line  $USER1$/check_nt -H $HOSTADDRESS$ -p 12489 -v $ARG1$ $ARG2$
}

And of course, Nagios has to know about our server. File sql_server_hostname.cfg:

define host  {
    host_name        [sql_server_hostname]
    alias            [sql_server_hostname].leapfrogonline.com
    use              windows-server
    address          192.168.1.2
    hostgroups       windows-servers,mssql-servers
    contact_groups   admins,BI
    contacts         BI-Nagios
    parents          WVMHost
}

This allows us to create Nagios checks easily using any Windows Performance Counter we’d like.

Creating check strings via Windows Performance Counters

An easy way to generate performance counters (and see the counters that are available to query) is to open perfmon.exe and temporarily create a new user-defined data collector set. If you go to create a manual counter, there are an enormous variety of available counters for both the OS and the database.

Back to Nagios we go!

Once you have a command, you just need to make a few small tweaks to it in order to add it to your Nagios check. The output from the Windows command is: \SQLServer:Databases(_Total)\Data File(s) Size (KB)

So then we write a check in our Nagios repo:

define service  {
    use                  generic-service
    host_name            [sql_server_hostname]
    service_description  SQL Data File Size
    check_command        check_nt!COUNTER! -l "\\SQLServer:Databases(_Total)\\Data File(s) Size (KB)","%.f KB"
    check_interval       5
    contacts             [somecontactlist]
}

You’ll notice a extra back slashes were needed as escape characters; otherwise Nagios strips the back slashes out when parsing the check into a command.

This give us a nice check that runs every five minutes in Nagios. This particular check doesn’t have an alert associated, but the following example will.

We found that we needed to add the “%.f” units on the string in order to get Nagios to graph our checks.

Going deeper into SQL checks

In order to get alerts on a couple specific checks that weren’t readily available, we wrote some custom counters. Windows provides ten “User Settable” SQL counters per server that can be incremented via a SQL query. For example, we wanted to be alerted if there was a SQL query blocked for more than 30 minutes, and to be alerted if a backup job hadn’t run for more than two days.

The way this is implemented is by passing an integer to increment a user counter via the stored procedure sp_user_counter[1-10]. You can read more about it here.

Here’s the string produced from perfmon.exe for user counter 1:
\SQLServer:User Settable(User counter 1)\Query

Our Nagios check, with alerts for warning at 30 minutes and critical at 40 minutes:

define service  {
    use                  generic-service,srv-pnp
    host_name            [sql_server_hostname]
    service_description  SQL Longest block
    check_command        check_nt!COUNTER! -l "\\SQLServer:User Settable(User counter 1)\\Query","%.f min" -w 30 -c 40
    check_interval       5
    contacts             [somecontactlist]
}

We then created a SQL Agent job that runs on a schedule every 2 minutes and increments the performance counter, which then gets passed to Nagios via the check above.

T-SQL:

declare @waittime int;

set @waittime = (select cast(isnull(MAX(waittime)/60000,0) as int) as wait_mins from [master].[sys].[sysprocesses]
where blocked > 0)

exec sp_user_counter1 @waittime;

The SQL Agent job just runs this SQL, and now Nagios sends an email alert any time a query is blocked for more than 30 minutes.

Useful Checks

Here are some checks we’ve found useful for our sql servers:

-CPU Load
-Avg disk queue length
-Disk reads per sec
-Disk writes per sec
-Disk used space (with alerting for 85% full)
-ISCI bytes received for our SAN NICs
-Memory usage
-Ping
-Page file usage
-SQL Data size, both (all) and per database
-SQL Log file size - all
-SQL Longest block (home-made)
-SQL longest query time (home-made)
-SQL days since last backup (home-made)
-SQL page splits per second
-SQL services (SQL Server, SQL Agent, SSIS, SSAS, SSRS)
-SQL user connections
-SQL waits - network IO
-SQL waits - pagie IO latch
-SQL workload - active parallel threads
-TCPv4 segments retransmitted per sec
-Uptime

Overall we’ve found this solution to be convenient and extremely reliable. We have all our hosts, both Linux and Windows, in one monitoring system, which means streamlined management and easy group notifications. Checks are easy to add and the whole configuration is kept in GitHub for versioning. Automatic graphing aids in troubleshooting, and the system’s flexibility allows us to monitor just the things we want to and avoid alert fatigue.


0a4fc7bfa3c8aa22deff1b07a77c08a1

Luther Rochester
Sr. Database Administrator