Exporting Windows and SQL Server metrics to Prometheus with Sonar

15:16 reading time


A while back I wrote about integrating our Windows systems and SQL Server into our Nagios implementation. These days we’re looking to replace Nagios (and Ganglia) with Prometheus for metrics collection, monitoring, and alerting.

While exporters already exist for most of our Linux systems, it seems like not too many people are integrating their Windows metrics yet. There look to be two primary exporters as of this writing: a WMI exporter, and a package called Sonar that can export both WMI and Windows Performance Monitor counters. I chose to implement Sonar because we were already using several Windows perfmon counters that can’t be duplicated with WMI, including custom SQL Server counters, and I appreciate the flexibility of being able to use either metric type. I found the maintainers of Sonar to be helpful and responsive when I had questions. It also seems to have nice Docker integration capabilities that we aren’t using (yet) but have potential.

We’re using Grafana to visualize the metrics, and we’ve got a prototype alerting framework set up with Alertmanager which is integrated into our Slack and Pagerduty instances as well as email.

 

Installing Sonar on Windows

Sonar doesn’t have an installer package. The process involves downloading the files, giving them a directory to live in, and then creating a service from the executable. There’s documentation on the Infragravity site, but I’m going to explain the process we followed here.

  1. Download the zip install file from the repo.
  2. Create an install directory (we used C:\Program Files\Sonar).
  3. Unzip the files to the install directory.
  4. A service now needs to be created:
> sc.exe create sonard binpath= C:\Program files\Sonar\out\Sonard.exe start= auto obj= LocalSystem depend= "WinRM"

 

Configuring Sonar

There are two xml files in the install directory that will need to be modified to configure the exporter and specify the metrics desired:

 

C:\Program Files\Sonar\out\sonar.config - click to view

<?xml version="1.0"?>
<configuration>
  <configSections>
    <section name="Sonar" type="Infragravity.Sonar.SonarConfigurationSection, Sonar"/>
  </configSections>
  <appSettings>
    <add key="ConfigPath" value="C:\Program Files\Sonar\Sonar.config"/>
    <add key="RuntimeType" value="Service"/>
    <add key="LogLevel" value="Warning"/>
    <add key="LogPath" value="C:\Program Files\Sonar\Sonard.log"/>
    <add key="ExporterPort" value="9190"/>
    <add key="ExporterCacheMilliseconds" value="300000"/>
    <add key="ExporterEnabled" value="true"/>
  </appSettings>
</configuration>

What we changed from the default here were the ConfigPath (the path to the main Sonar config file) the LogPath (ditto but for the log file), and the ExporterPort value, which we set to 9190; also the ExporterCacheMilliseconds, which specifies the length of time that checks will be exposed to Prometheus for scraping. Before we raised this value, the stats would disappear from the exposed file before being regenerated on their schedules, which caused some inconsistencies in metric availability. The maintainers confirmed that raising this value will not prevent the metric from being refreshed on its regular schedule.

 

C:\Program Files\Sonar\Sonard.dll.config - click to view

<?xml version="1.0"?>
<configuration>
  <configSections>
    <section name="Sonar" type="Infragravity.Sonar.SonarConfigurationSection, Sonar"/>
  </configSections>
  <connectionStrings>
    <add name="perfmon" providerName="mspdh" connectionString="." />
    <add name="sonar" providerName="wsman" connectionString="Server=sonar;" />
 </connectionStrings>
  <Sonar>
    <Runtime scrapeIntervalSeconds="5" skipSSLCheck="true" threads="1"/>

    <Servers>
        <add name="sonar" url="http://localhost:5985/wsman" username="" password="" timeoutMilliseconds="1000" authType="Negotiate"/>
    </Servers>

    <Schedules>

    <add name="service_sql_server" query="win_wmi_service_sql_server" input="sonar" intervalSeconds="20" />
    <add name="disk_c" query="win_perfmon_disk_c" input="perfmon" intervalSeconds="20" />

    </Schedules>

    <Queries>

    <add name="win_wmi_service_sql_server" filter="SELECT ProcessID FROM Win32_Service WHERE Name = 'MSSQLSERVER'"
      resource="http://schemas.microsoft.com/wbem/wsman/1/wmi/root/cimv2/*" namespace="root\cimv2">
      <Tags> <add name="Name" value = "Name"/> </Tags>
      <Values> <add name="ProcessId" value="CimType.UInt32"/> </Values>
    </add>

    <add name="win_perfmon_disk_c" type="raw" filter=""
      resource="." namespace="LogicalDisk(C:)">
      <Tags> <add name="Name" value = "Name"/> </Tags>
      <Values>
        <add name="read_bytes_per_second" value="Disk Read Bytes/sec"/>
        <add name="write_bytes_per_second" value="Disk Read Bytes/sec"/>
        <add name="avg_queue_length" value="Avg. Disk Queue Length"/>
      </Values>
    </add>

    </Queries>
  </Sonar>
 </configuration>

This is the main config file that creates the connections to query as well as the actual queries and their schedules. This is an abbreviated example showing two checks, one WMI and one Perfmon. You can download a full config file from one of our SQL Servers with 23 metric queries here.

The connectionStrings section specifies each of the protocols: “perfmon” for Windows Performance Monitor checks, and “sonar” for WMI checks. Each query needs its own schedule specified. Note that some checks can be grouped together, as in the WMI disk size check which returns both overall size and free space. We found that boolean values don’t currently work in Sonar, so we opted to use PIDs to check to see if services are running.

Remember that you’ll need to restart the service after you make any changes to the config files in order to pick up the updated configs.

 

Other settings

  • WMI by default denies unencrypted querying, so we needed to allow that.
> C:\windows\system32\winrm.cmd set winrm/config/Service @{AllowUnencrypted = "true"}
  • We had to allow traffic between our subnets via an AWS security group rule on port 9190, which was configured via Terraform:
  # windows exporter scrape
egress {
  protocol = "tcp"
  from_port = 9190
  to_port = 9190
  cidr_blocks = ["0.0.0.0/0"]
}
  • Finally, the hosts to be monitored need to be added as static targets in Prometheus (via Ansible):
- job_name: 'windows'
  static_configs:
    - targets: ['server1:9190','server2:9190','server3:9190']

 

Is it working?

We can test by looking for the stats at the following url. The stats should be viewable in a browser: http://<server>:9190/metrics

The metrics should then be scraped into Prometheus, where you can query them:

Finally, I built a nifty dashboard in Grafana to visualize the metrics:

…and created some useful alerts (just a few examples shown here):

Alertmanager alerts - click to view

groups:
  - name: windows
    rules:
    - alert: win_disk_free
      expr: round(((win_wmi_disk_size_Size{Name=~"S:|C:|F:"} - win_wmi_disk_size_FreeSpace{Name=~"S:|C:|F:"} ) / win_wmi_disk_size_Size{Name=~"S:|C:|F:"} ) * 100) > 85
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "`{{ $labels.collector }}`: `{{ $labels.Name }}` drive is {{ $value }}% full!"
        description: "This disk is getting close to being full."

    - alert: win_load
      expr: round(win_perfmon_cpu_percent_cpu_time) > 90
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "`{{ $labels.collector }}`: CPU used is {{ $value }}% "
        description: "The CPU load is high on this host."

    - alert: sql_server_days_since_last_backup
      expr: win_perfmon_sql_server_days_since_last_backup_days > 3
      for: 30m
      labels:
        severity: critical
      annotations:
        summary: "`{{ $labels.collector }}`: it has been {{ $value }} days since a clean backup."
        description: "Please check the SQL Server backups."

    - alert: sql_server_service
      expr: win_wmi_service_sql_server_ProcessId < 1
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "`{{ $labels.collector }}`: The SQL Server service is down."
        description: "SQL Server service"

If you’d like to see how we created the custom SQL checks, you can see the steps in my previous monitoring post.

 

Checking certificate expiration on our AD servers

We wanted to be alerted when the cert on our LDAP servers was close to expiring. We were already using the Blackbox exporter for other metrics, so I thought I’d use it for this purpose too. You can’t use https to connect to LDAP, but you can use TCP with TLS. We also only wanted to be notified once per day, so I created a new route in Alertmanager with a repeat_interval of 24h that matches on a label I created called frequency. These are the additions I made to the config files (replacing <hostname> with our domain controller name):

prometheus.yml:

  - job_name: 'ldap_cert_check'
    scrape_interval: 2m
    metrics_path: /probe
    params:
      module: [tcp_cert]
    static_configs:
      - targets:
        - '<hostname>:636'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115  # Blackbox exporter.

blackbox_exporter.yml

  tcp_cert:
    prober: tcp
    tcp:
      tls: true

alertmanager.yml

  routes:
    # alerts that we only want to notify once per day
    - receiver: slack
      match:
        frequency: daily
      continue: false
      repeat_interval: 24h

alert:

    - alert: ldap_cert_expiration
      expr: round((probe_ssl_earliest_cert_expiry - time()) / 86402) < 15
      labels:
        severity: warning
        audience: "some_group"
        frequency: daily
      annotations:
        summary: "`{{ $labels.instance }}` has an ldap cert which will expire in {{ $value }} days."
        description: "The ldap certificate is expiring soon."  

 

Not fancy enough?

I created an ansible role to install Sonar, create the service, and push out all the config to our Windows VMs en masse. I grouped different servers based on their roles and the associated checks, so the config file is built using only the checks appropriate for each host. This makes use of Dan Fuchs’s fantastic Merge Vars plugin for Ansible.

sonar_ansible_role/main.yml - click to view

## this role downloads the sonar zip from gh, unzips it to the install location, creates the appropriate config files, creates a service, and starts it.

- name: create main sonar directory if not present
  win_file:
    path: C:\Program Files\Sonar
    state: directory

- name: see if install file is present
  win_stat:
    path: C:\Program Files\sonar\sonar-{{ sonar_version_short }}-win10-x64.zip
  register: install_file

- name: download install file
  win_get_url:
    url: https://github.com/infragravity/sonar-docker/releases/download/{{ sonar_version }}/sonar-{{ sonar_version_short }}-win10-x64.zip
    dest: C:\Program Files\sonar\sonar-{{ sonar_version_short }}-win10-x64.zip
    force: no
  when: not install_file.stat.exists

- name: create unzip powershell command from template
  win_template:
    src: sonar_unzip.ps1.j2
    dest: C:\Program Files\Sonar\sonar_unzip.ps1

- name: use the handy powershell command to unzip install file to its new home
  win_command: powershell "& " '"C:\Program Files\Sonar\sonar_unzip.ps1"'
  when: not install_file.stat.exists

- name: set wsman to allow unencrypted queries
  win_command: C:\windows\system32\winrm.cmd set winrm/config/Service @{AllowUnencrypted = \"true\"}
  when: not install_file.stat.exists

- name: create sonard service
  win_service:
    name: sonard
    path: C:\Program Files\Sonar\out\Sonard.exe
    start_mode: auto
    dependencies: ['WinRM']

- name: collect and merge checks
  merge_vars:
    suffix_to_merge: "sonar_checks__to_merge"
    merged_var_name: "sonar_checks"
    dedup: true
    expected_type: "dict"

- name: write sonard config file
  win_template:
    src: sonar.conf.j2
    dest: C:\Program Files\Sonar\Sonar.config
  register: sonar_config

- name: write sonard dll config file
  win_template:
    src: sonard.dll.config.j2
    dest: C:\Program Files\Sonar\out\Sonard.dll.config
  register: sonard_dll_config

- name: restart sonar service
  win_service:
    name: sonard
    state: restarted
  when: sonar_config.changed or sonard_dll_config.changed

 

We’ve been running Sonar on all our Windows servers for a couple of months now, and have had no issues thus far. I recommend it if you’re considering monitoring your Windows or SQL Server instances with Prometheus. According to their documentation, it can also be used with IndfluxDB.

 


0a4fc7bfa3c8aa22deff1b07a77c08a1

Luther Rochester
Sr. Database Administrator