Sensu and InfluxDB: Storing Data from Metrics Collection Checks

Navigate to:

Sensu is a popular monitoring solution for both applications and infrastructure, designed to address the needs of a modern cloud computing environment.

The Sensu framework is composed of client and server applications that communicate via a message bus — RabbitMQ by default, although other transports can be used. Configuration is entirely done using JSON files, making it easy to integrate with automation tools like Ansible or Chef, and clients can be registered and deregistered without having to restart the server.

A copy of the client resides on each host in your infrastructure. Clients pull requests for health checks from the transport, execute them, and push the results back onto the message bus as an “event”. Sensu checks follow the same format as Nagios Plugins, which lets developers take advantage of a vast number of plugins in the Nagios ecosystem as well as those provided by the Sensu community. Checks can be any program or script that writes data to STDOUT or STDERR and returns an error code that  corresponds with a given status: 0 OK, 1 Warning, 2 Critical, and 3 or higher for unknown or custom statuses.

Once the check results have been pushed to the message bus, one or more Sensu servers pulls the events from the bus and handles them, processing the results, triggering alerts, or forwarding metrics to a long-term store.

Out of the box, Sensu doesn’t do anything with data that might be collected during a check, but provides the ability to configure handlers which will process and forward the data to an external store — in our case, InfluxDB.

By storing the metrics data, development teams can use the data for analysis at a later date; looking at performance data to drive the engineering roadmap, or as part of the incident response or postmortem process. Additionally, by querying metrics in a time series database, Sensu can perform checks against multiple datapoints, reducing noise and flapping alerts.

So how do we save this data?

Setup

This article assumes that you are already running Sensu as part of your infrastructure, with metrics collection checks running on your clients, and that you’ve set up an InfluxDB instance. If you don’t, you can find installation instructions here:

For this example we’ll be using the CPU Percentage Check from the Sensu CPU Checks Plugin to gather metrics about processor usage.

I have two hosts configured running Ubuntu 16.04: the Sensu server itself, which is running Sensu, Redis, RabbitMQ, and Uchiwa, and a second server for InfluxDB. The Sensu server is configured with a legacy-servers subscription in addition to the dev and ubuntu-servers subscription common to both hosts. Subscriptions allow you to determine which checks are run by various clients; in this case, we’ll pretend that our existing infrastructure is configured to gather resource metrics via Sensu, while the InfluxDB server collects those metrics using Telegraf. We’ll use the legacy-servers subscription to ensure that we only run our metrics collection checks on servers without Telegraf installed.

The Sensu InfluxDB Plugin

Sensu has a large number of community-contributed plugins that provide easy-to-use integrations between Sensu and third-party applications from Apache to Zookeeper, which can be found under the Sensu Community Plugins organization on GitHub.

We’ll be using the InfluxDB Sensu Plugin, which provides a number of integrations between InfluxDB and Sensu:

We can install the InfluxDB plugin to Sensu’s embedded Ruby environment, /opt/sensu/embedded/, using the following command:

$ sudo sensu-install influxdb

We’re already running a metrics check on our “legacy” hosts, so we want to set up the the InfluxDB handler to deal with events as they’re received by the Sensu server. First, we’ll enable the handler by adding it to /etc/sensu/conf.d/handlers.json:

{
  "handlers": {
    "influx-tcp": {
       "type": "pipe",
       "command": "/opt/sensu/embedded/bin/metrics-influxdb.rb"
    }
  }
}

and add a configuration at /etc/sensu/conf.d/influx.json:

{
    "influxdb": {
        "host"          : "192.168.227.134",
        "port"          : "8086",
        "database"      : "sensumetrics"
    }
}

The InfluxDB instance in this example isn’t using any kind of authentication or SSL, so we’re using an extremely simple configuration, but there are a number of additional configuration options that can be specified.

Now that we have the handler configured, we want to invoke it whenever we get results from our CPU Percentage Check by adding the handler to the check configuration, in our case /etc/sensu/conf.d/cpu_percentage.json:

{
  "checks": {
    "cpu_metrics": {
      "type": "metric",
      "command": "metrics-cpu-pcnt-usage.rb",
      "subscribers": [
        "legacy-hosts"
      ],
      "interval": 10,
      "handlers": [
        "debug",
        "influxdb-tcp"
      ]
    }
  }
}

As you can see, we’ve configured the subscribers to include only legacy-hosts. Newer hosts will be set up with Telegraf, so we won’t need to run these checks there.

Now let’s restart the Sensu services to pick up any configuration changes:

$ sudo systemctl restart sensu-server sensu-api sensu-client

Collecting Metrics

At this point, we should be seeing data in InfluxDB. Sensu’s default format for metrics is Graphite plaintext; each metric is represented as a period-delineated path, value, and timestamp. For the CPU Percentage Check we’re using, we’re returning nine metrics, as follows:

sensu.cpu.user 0.50 1515534170
sensu.cpu.nice 0.00 1515534170
sensu.cpu.system 0.00 1515534170
sensu.cpu.idle 99.50 1515534170
sensu.cpu.iowait 0.00 1515534170
sensu.cpu.irq 0.00 1515534170
sensu.cpu.softirq 0.00 1515534170
sensu.cpu.steal 0.00 1515534170
sensu.cpu.guest 0.00 1515534170

InfluxDB, on the other hand, uses the concept of tags and fields for greater efficiency. An InfluxDB “measurement” can contain multiple values stored in fields, as well as indexed tags which can be used to perform more complex queries at a later date.

All of the metrics above could be stored in InfluxDB as a single measurement, with the host represented by a tag and each metric represented as a field. The same metrics in InfluxDB line protocol might look like this:

cpu,host=sensu user=0.50,nice=0.00,system=0.00,idle=99.50,iowait=0.00,softirq=0.00,steal=0.00,guest=0.00 1515534170

Sensu’s InfluxDB plugin, however, is designed to work with any and all metrics that might be generated by checks, and so it doesn’t do much to parse the Graphite values into tags and fields. The plugin creates one measurement for each of the Graphite metrics, extracting the hostname from the beginning of the Graphite path and the metric name from the event data to be used as tags. So

sensu.cpu.user 0.50 1515534170

becomes

cpu_user,host=sensu,metric=cpu_percentage 0.50 1515534170

This isn’t an ideal solution; it fails to take advantage of the efficiencies afforded by tags and fields, but it can be a quick way to get metrics into InfluxDB and onto a dashboard, and it’s likely the first thing many InfluxDB users will try.

We can verify that data is being received by InfluxDB using the InfluxDB CLI. Log into the InfluxDB host and type influx at the prompt:

$ influx
Connected to http://localhost:8086 version 1.4.2
InfluxDB shell version: 1.4.2
>

Use the SHOW MEASUREMENTS command to verify that all metrics have been created:

>SHOW MEASUREMENTS
name: measurements
name
----
cpu_guest
cpu_idle
cpu_iowait
cpu_irq
cpu_nice
cpu_softirq
cpu_steal
cpu_system
cpu_user

and finally we can query one of the measurements to see individual data points:

SELECT * from cpu_idle WHERE time > now() - 1m
name: cpu_idle
time                host  metric         value
----                ----  ------         -----
1515534170000000000 sensu cpu_percentage 99.5
1515534270000000000 sensu cpu_percentage 100
1515534370000000000 sensu cpu_percentage 100
1515534470000000000 sensu cpu_percentage 100
1515534570000000000 sensu cpu_percentage 100
1515534670000000000 sensu cpu_percentage 100

An Alternative Approach: Sending Graphite plaintext to Telegraf

By creating a measurement for each metric, we’re already not taking full advantage of InfluxDB, but additionally, the way the Sensu InfluxDB Plugin works isn’t terribly efficient.

Sensu’s Graphite handler comes with the following warning:

Note however that using this mutator as a mutator command can be very expensive, as Sensu has to spawn a new Ruby process to launch this script for each result of a metrics check. Consider instead producing the correct metric names from your plugin and sending them directly to Graphite via the socket handler. See https://groups.google.com/d/msg/sensu-users/1hkRSvL48ck/8Dhl98lR24kJ for more information.

The handler we’re using, check-influxdb-query.rb, suffers from the same problem, as the script needs to be run for each result. The discussion in the linked thread recommends using only_check_output, a built-in Sensu mutator extension that does not require Sensu to spawn a new process. It extracts the Graphite plaintext from the event data, which can then be forwarded along using the socket handler.

Fortunately, both InfluxDB and Telegraf can ingest metrics in Graphite plaintext format, so this is a viable option. InfluxDB or Telegraf can be responsible for transforming the metrics as they come in, and Sensu can keep on doing Sensu things.

Note: While writing this post I came across a bug in Telegraf that caused it to reject packets that contained newlines, which meant it was rejecting the data being sent by Sensu. Fortunately it was an easy fix (Pull Request here: #3684), and the change is already in the master branch on GitHub! It will be present in the next release of Telegraf, or you can download one of the nightly builds linked in the repo to get started right away.

By default, metrics sent to Telegraf or InfluxDB in Graphite plaintext will be stored using the full Graphite path as the measurement name, similar to what the Sensu InfluxDB Plugin does, but both applications also provide template functionality, which tells InfluxDB or Telegraf how to transform the metrics into InfluxDB format.

Templates can be defined for each listener you have configured, and have a similar format to Graphite paths, with period-delineated values. The position of the value in the template determines how the same section in the Graphite path will be handled, as a tag, measurement, or field. You can find more information in the Template Documentation.

Let’s look again at our CPU Percentage metrics:

sensu.cpu.user 0.50 1515534170
sensu.cpu.nice 0.00 1515534170
sensu.cpu.system 0.00 1515534170
...

Each metric has the same format, so we can use a single template to parse them. They each have three sections in the path, the first for the host and second two for the metric. We want to break each metric into a measurement and a field, so that we can have a single cpu measurement with fields for user, nice, system, etc. To accomplish this, we can use the following template:

host.measurement.field
Note: Many Sensu Metrics Collection Checks include an argument, --scheme, which can be used to customize the metric output. For the CPU Percentage check, the scheme defaults to #{Socket.gethostname}.cpu, which is good enough for us.

Since both InfluxDB and Telegraf can process Graphite plaintext, where you decide to do the work depends on your workload, infrastructure, and tooling. If you’re only collecting a few types of metrics, you can process them easily using a few templates, and have resources available on your InfluxDB box, you might want to ship the metrics there directly.

If you’re trying to collect metrics from many different services, however, you might find that you run into conflicts when creating templates, or that you’re putting an unreasonable amount of load on your InfluxDB server. In that case, it might make more sense to drop an instance of Telegraf alongside your application so that you can distribute the processing load and minimize the number of templates required per listener. This approach is more easily scaled, but requires additional configuration overhead.

For this example we’ll install Telegraf alongside Sensu and ship Graphite plaintext to it via a socket.

Telegraf Setup

First, install Telegraf on your Sensu host. You can find installation instructions for Telegraf on various platforms here:

Next, we’re going to disable the collection of CPU statistics that Telegraf performs by default, because they are redundant to Sensu’s Metrics Collection Checks. Make sure to comment out the [[inputs.cpu]], [[inputs.disk]], [[inputs.mem]], [[inputs.network]] sections from the default configuration.

We’ll need to configure an input for Sensu to send metrics to as well. We’ll define a socket_listener, give it a port and a data format, and also specify our template for parsing the Graphite plaintext.

Add this section to the Telegraf config at /etc/telegraf/telegraf.conf:

[[inputs.socket_listener]]
  service_address = "udp://:8094"
  data_format = "graphite"
  templates = [
    "host.measurement.field"
  ]

Restart Telegraf to pick up the new configuration using sudo systemctl restart telegraf.

Sensu Setup

Configure a UDP handler in Sensu with the only_check_output mutator by adding this to your /etc/sensu/conf.d/handlers.json:

{
  "handlers": {
    "telegraf-graphite-handler": {
      "type": "udp",
      "socket": {
        "host": "127.0.0.1",
        "port": 8094
      },
      "mutator": "only_check_output"
    }
  }
}

Update the check so that it uses the new handler by editing /etc/sensu/conf.d/cpu_percentage.json:

{
  "checks": {
    "cpu_metrics": {
      "type": "metric",
      "command": "metrics-cpu-pcnt-usage.rb",
      "subscribers": [
        "legacy-hosts"
      ],
      "interval": 10,
      "handlers": [
        "debug",
        "telegraf-grafphite-handler"
      ]
    }
  }
}

And finally, restart the Sensu services:

$ sudo systemctl restart sensu-server sensu-api sensu-client

Verifying That It Works

Our metrics are coming out of Sensu in this format:

sensu.cpu.user 0.50 1515534170
sensu.cpu.nice 0.00 1515534170
sensu.cpu.system 0.00 1515534170
sensu.cpu.idle 99.50 1515534170
sensu.cpu.iowait 0.00 1515534170
sensu.cpu.irq 0.00 1515534170
sensu.cpu.softirq 0.00 1515534170
sensu.cpu.steal 0.00 1515534170
sensu.cpu.guest 0.00 1515534170

running through this template:

host.measurement.field

and resulting in one measurement, cpu, with multiple fields. Let’s check that’s what we’re seeing in the database. Open up the InfluxDB CLI, select the sensumetrics database, show the measurements and select the last minute of data, like we did before:

> USE sensumetrics
Using database sensumetrics
> SHOW MEASUREMENTS
name
----
cpu
> SELECT * FROM cpu WHERE time < now() - 1m
name: cpu
time                 guest host  idle  iowait irq nice softirq steal system user
----                 ----- ----  ----  ------ --- ---- ------- ----- ------ ----
2018-01-09T20:32:09Z 0     sensu 98.51 0      0   0    0       0     0.5    1
2018-01-09T20:32:19Z 0     sensu 98.51 0      0   0    0       0     0.5    1
2018-01-09T20:32:29Z 0     sensu 98.51 0      0   0    0       0     0.5    1
2018-01-09T20:32:39Z 0     sensu 98.51 0      0   0    0       0     0.5    1
2018-01-09T20:32:49Z 0     sensu 98.51 0      0   0    0       0     0.5    1

Next Steps

We’re still doing a bit of unnecessary work by transforming the metrics from Graphite plaintext; at some point it might make sense to either update our checks to output data in InfluxDB line protocol, or use Telegraf to collect and ship metrics directly to InfluxDB from our applications.

Next week we’ll continue exploring the integration of Sensu and InfluxDB by creating a Metrics Check based on the data we’ve captured.