Cleaning and Interpreting Time Series Metrics with InfluxDB

Navigate to:

This article was originally published in The New Stack and is reposted here with permission.

A look at how to use Flux for data cleansing and analytics through the browser and via Visual Studio.

Time series data is data you want to analyze and monitor over time. For example, you might want to know the water levels over the course of the day for a plant, or how much sunlight it receives and when. This is a simple but easy-to-understand example. Obviously on a larger scale the stakes can be higher. You could be monitoring server infrastructure in a data center or pressure of a machine on a factory floor.

These are times when failure and real-time reactions can be extremely important to avoid an emergency. Time series data is commonly metrics, normally from IoT devices or server infrastructure.

Metrics are normally data that arrives in a constant stream, a value every second, but sometimes it can be more random. Raw time series metrics data can benefit from cleanup and normalization before exposing it for broader use and storage. When dealing with large amounts of time series metrics, it can be helpful to standardize the ways in which others can search through that data for specific time frames using easy-to-understand tags. There are many types of time series metrics, but for this blog post, we will focus on metrics from our internal storage engine, provided by one of our site reliability engineers (SREs).

physica-vs-virtual

For this tutorial, I will use InfluxDB’s time series data platform. The core of InfluxDB is a highly performant time series database that is great when processing millions of data per seconds, but it also comes with data collectors and scripting languages. This technical session focuses on using Flux, a data-processing and querying language used by InfluxDB. Flux has many of the capabilities of a query language like SQL, but it also comes prebuilt with analyzation and data science capabilities. Later, we will also use Flux to create alerts and downsample tasks. In the future we will also include SQL integration, which will allow for a new way to query your data.

Flux is already built into Influx, so there is no need for extra installation. Examples of how to leverage Flux for data cleansing and analytics through the browser and via Visual Studio will be demonstrated. You can check out the Visual Studio extension here for more details. But you can also use the command line or the cloud UI to interact with your data.

flux

We will start with a simple Flux query to get more familiar with how Flux is written and works. Your bucket is your database name. Each bucket can be customized on that data it accepts and the length of time it retains that data. First notice the range() function. Since this language is for time series, logically we need to query for data from a time range. In this example our range is about 20 seconds (from the start and stop times). You could choose to have no range, but that will return all the data in your bucket.

This is data from our measurement called “node_points_total.” It has a set of counters, which increment every time a point gets written. If it’s a good point, the “ok” gets incremented; if it’s a “bad” point, the corresponding point gets incremented. Here we’re calculating the total number of points that were successfully written. The top of the function is filtering down to a specific node and host we want to monitor. We search for points with four statuses (ok, denied, error and dropped). Then we pivot the data and calculate the percent that were ok with the map() function. The map() function generates a “percentage good” number, which will allow you to say “99.98% of all points written were OK.”

import-universe

This is the result from the query being run. As you can see, we are visualizing the result.

singlge percent

The next example is going to use a larger range, four days of data. That’s a lot! We are using an aggregateWindow() that will give us the sum of our values for every hour (we will return 96 results, one for each hour of the past four days). This is not only a faster query, but if we were to look at the data result in a table, it would also be easier to read only 96 results instead of 2 million!

import-universe-2

This is the last query graphed. Here we see a small drop at 99.88% “ok” writes over the span of a few days. Our service-level agreement (SLA) for cloud is 99.9% monthly availability. But this is an example of a specific day that had a small incident. Overall, we still meet our SLA goals, but it may be good to notice even minor dips. This allows our SRE team to take action when needed.

Percent Ok over time with aggregateWindow

This following Flux is a simplified version of what we call downsampling. Downsampling is the process of reducing raw high-precision data to lower precision aggregates. We combine the use of the aggregateWindow() function with the to() function to write the downsampled data to a new bucket. In this example, we get the mean value for every 10-minute interval of data and store that data point in a new bucket.

This can be helpful for three problems:

  1. Making it easier to run analysis on a smaller data set and gain high-level insights from historical data.
  2. Leaning up erroneous data.
  3. Storing a smaller data set for the long term and reducing your overall instance size while retaining the overall shape of your data.

new bucket

Finally we will take a look at alerting on our data set. This first part is all about filtering down to the field we want to monitor. We specifically want to filter for the “ok” status.

alerting on our data set

We are using the quantile function to determine when a value is outside the 95th percentile (95p). The compression in our quantile compression is how detailed we would want the data to be when determining the 95p. The larger the compression, the slower it would take to calculate. You can find more information in the docs in how to set up the quantile function.

This code is checking that value and setting the type and level to “critical” or “info.” From there we also create a Slack message that sends an alert if the status type is “critical.” We have an alert function here that is being called, but it’s not needed to go in depth on it. The bottom line is you can use Flux to build these alert tasks and receive alerts if your data is out of a defined boundary or is doing OK. To learn more about creating alerts and notifications with Flux, take a look at the following documentation.

creating alerts and notifications with Flux

These are just a few of the simpler capabilities for cleaning and interpreting your time series metrics in InfluxDB. We have a large amount of Flux documentation and examples you can reference depending on your use case and needs. We love to see what people are building in our open source community and look forward to connecting with you in our community forums and Slack channel to see what amazing projects you build!