Synthetic Monitoring with Telegraf

Navigate to:

There are two main modes for collecting data about your systems and software: the first is by collecting data from within the application itself, often called white-box monitoring, and the second is by querying the system from the outside and collecting data about the response.

Google’s SRE book defines white-box monitoring as “Monitoring based on metrics exposed by the internals of the system,” and black-box monitoring as “testing externally visible behavior as a user would see it.” Synthetic monitoring is an implementation of a black-box monitoring system which involves creating requests which simulate user activity.

With synthetic monitoring, the aim is to expose any active issues that the user might be experiencing with a system, such as a website being inaccessible. Since it represents real user pain, this data is especially useful as an alerting signal for paging.

This complements the white-box approach, which allows developers and operators to get insight into the internal functioning of the system, providing insight into issues that may be obscured to the user, such as failures that result in a successful retry, and providing invaluable information for debugging purposes.

Telegraf can gather many white-box metrics using application-specific plugins like the ones for NGINX or MySQL, and you can instrument your applications using the InfluxDB client libraries, but we can also use Telegraf as a synthetic monitoring tool to monitor the status of our systems from the outside.

HTTP Response Input Plugin

Telegraf’s http_response input plugin checks the status of HTTP and HTTPS connections by polling an endpoint with a custom request, and then recording information about the result. The configuration for the plugin allows you to specify a list of URLs to query, define the request method, and send a custom request body or headers to simulate actions that might be taken by external users and systems. It also allows you to verify the behavior of those endpoints by verifying that the responses to these requests match certain predefined strings using regular expressions. These options give us a lot of flexibility in terms of how we monitor our applications.

For each target server that is being polled, the plugin will send a measurement to InfluxDB with tags for the server (the target URL), request method, status code, and result, and fields with data about response times, whether the response string matched, the HTTP response code, and a numerical representation of the result called the result code.

We can create a new block in our Telegraf configuring for each endpoint we want to monitor. Telegraf will collect data for each config block once per collection interval.

Monitoring influxdata.com

Let’s look at a quick example: we’ll create a simple synthetic monitoring check that will tell us whether influxdata.com is up or not. Because we want these monitoring checks to come from outside of the system, we’ll need to set up some kind of independent infrastructure, separate from the rest of our systems, for running Telegraf. This could mean running in a different Availability Zone on AWS, or using a different cloud provider altogether. Since I don’t actually need long-lived infrastructure for this example, I’ll configure Telegraf to run on my Mac, which is external to the influxdata.com infrastructure.

I already have Telegraf installed using Homebrew, so the next step will be to create a new config file with our http_response settings. Here’s a snippet of what the inputs.http_response block would look like:

# HTTP/HTTPS request given an address a method and a timeout
[[inputs.http_response]]
  ## List of urls to query.
  urls = ["https://www.influxdata.com"]

[...]

  ## Optional substring or regex match in body of the response (case sensitive)
  response_string_match = "InfluxDB is the open source time series database"

This queries the InfluxData homepage and looks to match the phrase “InfluxDB is the open source […]”.

One thing to note is that Telegraf’s collection interval is especially important for this plugin because it determines how often to make requests to the endpoint in question. Individual plugins can define their own collection interval by including a interval parameter in the appropriate config block. For the sake of example we’ll use the Telegraf defaults, but you’ll need to decide what an appropriate interval is for your own systems. You can find a complete configuration file in this gist.

We can then launch a copy of Telegraf using the new config, and should see some output, as follows:

$ telegraf --config synthetic-telegraf.conf --debug
2019-07-01T11:51:52Z I! Starting Telegraf 1.10.4
2019-07-01T11:51:52Z I! Loaded inputs: http_response
2019-07-01T11:51:52Z I! Loaded aggregators: 
2019-07-01T11:51:52Z I! Loaded processors: 
2019-07-01T11:51:52Z I! Loaded outputs: influxdb
2019-07-01T11:51:52Z I! Tags enabled: host=noah-mbp.local
2019-07-01T11:51:52Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:noah-mbp.local", Flush Interval:10s
2019-07-01T11:51:52Z D! [agent] Connecting outputs
2019-07-01T11:51:52Z D! [agent] Attempting connection to output: influxdb
2019-07-01T11:51:52Z D! [agent] Successfully connected to output: influxdb
2019-07-01T11:51:52Z D! [agent] Starting service inputs
2019-07-01T11:52:10Z D! [outputs.influxdb] wrote batch of 1 metrics in 9.118061ms
2019-07-01T11:52:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-07-01T11:52:20Z D! [outputs.influxdb] wrote batch of 1 metrics in 7.672117ms
2019-07-01T11:52:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics.

Next steps

The http_reponse plugin provides a lot of flexibility in terms of creating monitoring requests which you can use to more accurately model how users and applications might interact with your site. For example, on influxdata.com you might want to verify that your search page is working by submitting a POST request and verifying that the response includes text from the search result page, rather than just checking if the page loads. Because synthetic monitoring is intended to model the user experience, the specific number, frequency, and implementation of your checks will depend on the design and function of your product, but in general you’re looking for things like slow response times or high rates of errors - things that effect user happiness.

Since black-box monitoring often exposes issues that are already impacting users, you’ll also want to create a sane alerting strategy based on this data. That might mean paging an engineer, or dropping a notification in Slack, at which point they will have to turn to data better suited for debugging: white box metrics and events.

Black box monitoring isn’t a replacement for data captured from your application, but it provides end-to-end coverage that’s useful in the most catastrophic of scenarios. When used in conjunction with white box tools, it can give you that extra bit of confidence in the functioning of your software, making it a critical component of your monitoring system.