Tools for Time Series Data Science Problems with InfluxDB

Navigate to:

This article was originally published in The New Stack and is reposted here with permission.

You might need to perform anomaly detection or forecasting if you’re working with time-series data. The first step before working on your time series is finding the right data store. To effectively detect or forecast your data, you will require a data store that can handle a large volume of data at a high ingest rate. Therefore, you might want to look at using a purpose-built time-series database. Time-series databases stand out from more common relational databases because instead of using rows and columns to quickly find relationships between data points, they are designed to handle the unique workloads of time-series data.

For this article, we will use the time-series database InfluxDB and discuss some of the tools you can use to perform forecasting and anomaly detection. We’ll also discuss some of the enhancements to InfluxDB v2 Python Client Library that make querying data from InfluxDB and applying data science tools to your time-series data easier.

Tools for time series data science problems

While you can use InfluxDB for some time-series data science problems, the scope of problems you can solve with it are somewhat limited. For most use cases, use InfluxDB to store all of your raw time-series data, then use Flux, the native query and scripting language for InfluxDB to prepare your data.

Next, pull out data into another environment with a client library. If you’re tackling a time-series data science problem, consider using Python. I see more tools and packages for time series being developed in Python than any other language. Let’s take a look at some of the most popular Python tools for time series.

Pandas is an open source Python library used for data manipulation. If you intend to use many of the libraries mentioned next, you’ll have to use Pandas to get the data into a Pandas DataFrame, as that is the expected data format. A Pandas DataFrame is a two-dimensional, size-mutable data structure. Pandas has gained popularity among the data science community for two reasons. First, it’s intuitive and has a great user experience. Transforming and reshaping data with Pandas is fun. Second, several data science libraries require that the input be a Pandas DataFrame.

TensorFlow is an open source machine learning and artificial intelligence platform. Data scientists use TensorFlow to build and train models using Python or JavaScript. TensorFlow is primarily used for deep learning applications. Deep learning is a type of machine learning that employs the use of neural networks. Neural networks are composed of computational or logical gates that control the flow and manipulation of data. The way in which they control the flow of data draws inspiration from the way the human brain learns, which is where neural networks derive their name from.

TensorFlow is geared toward beginners and experts alike. Neural networks are great forecasters for data with multiple features. Features are all of the related attributes that will help you make a forecast — more on that in the next tool description. You can also use TensorFlow to perform anomaly detection.

Keras is an open source wrapper for TensorFlow. While TensorFlow is geared toward beginners as well, it can still be overwhelming to those who are new to deep learning. Keras aims to simplify the process of learning TensorFlow by providing an interface for it. I recommend checking out this Keras tutorial on time-series forecasting for weather data. In it, you learn how to forecast temperature using 14 input features including pressure, humidity, temperature, wind and speed. It also demonstrates how to perform some basic feature selection by generating a correlation plot.

Feature selection is the process of removing features that are redundant or irrelevant so they don’t poorly influence your model. A correlation plot or correlogram helps us visualize the correlation between different variables. Features that are highly correlated with each other are considered redundant. Features that aren’t correlated with any feature are considered to be irrelevant. For example, humidity and relative humidity are probably going to be considered redundant features for a temperature forecast. You can also use Keras to perform anomaly detection.

Prophet is a Python library for forecasting. It fits the forecasting problem as a curve-fitting exercise or creating a mathematical model that provides a “best fit” line. Prophet is similar to Holt-Winters or triple exponential smoothing, a well-known statistical forecasting method. It was created to enable users of all backgrounds to make forecasts. Prophet doesn’t require specialized knowledge about time series data, which has several unique statistical properties, in order to make forecasts. Instead, it allows the users to easily include domain expertise into their model by specifying unique holidays, schedules and saturation points (or carrying capacities). Additionally, unlike other popular statistical forecasting methods, Prophet can handle idiosyncratic features and irregular time-series data well without requiring data preparation work. This is because Prophet performs interpolation and removes outliers automatically.

NeuralProphet is another Python library for forecasting. It is the successor to Prophet. It aims to be a hybrid approach between Prophet and deep learning. NeuralProphet aims to make near-future forecasts more accurate than what Prophet is capable of. In other words, NeuralProphet can make more accurate forecasts with sparse or less data. NueralProphet achieves higher accuracy by using deep learning to identify and be sensitive to local context in time series. Local context refers to a subsection of seasonality or particular shape within time-series data. Identifying local context can increase the forecast accuracy especially when these shapes have periodicity. However, it’s worth noting that if you have ample reliable and cleaned historical data, then Prophet can outperform NeuralProphet by a slim margin.

ADTK (anomaly detection tool kit) is an open source Python package for rule-based anomaly detection in time-series data. ADTK is geared toward industrial IoT use cases. In most industrial IoT cases operators have domain expertise around how their system should normally behave and what common anomalies or failure scenarios might look like.

ADTK allows users to build anomaly detection models by combining different modules. One type of module is detectors. Detectors find anomalous points. They include methods that look at quantiles, thresholds, level shifts, seasonality and more. ADTK allows the user to chain these functions together to build a model that, for example, evaluates whether the data exceeds this threshold while exhibiting a level shift of a certain amplitude and violates the season pattern. If the data meets all those rule-based criteria, then an anomaly is present.

There are many other open source packages for anomaly detection, but ADTK is refreshing because of its relative simplicity. This package is ideal for identifying anomalies that follow certain well-defined statistical attributes or conversely for discovering anomalies in IoT data that exhibit very predictable or controlled behavior.

Telegraf is an open source collection agent for metrics and events. It’s database-agnostic but is part of the InfluxDB stack, so you can easily collect time-series data and write it to InfluxDB. However, it’s more than just a collection agent. You can also use Telegraf to process data. The Telegraf Execd Processor Plugin makes Telegraf extensible in any language. This includes making continuous forecasts on small batches of data with statistical forecasting algorithms. This repo contains an example of using Telegraf to make continuous forecasts on beer temperature to better regulate the fermentation process.

Amazon Forecast is a time-series forecasting service that allows users to easily make forecasts without requiring that they have specific training in model selection, training or deployment. Amazon Forecast offers several popular forecasting algorithms and neural nets including ARIMA, CNN-QR, DeepAR+, ETS (exponential trend smoothing), NPTS (non-parametric time series) and Prophet. Amazon Forecast makes model selection easier by allowing users to simultaneously train all of those forecasting algorithms. Amazon Forecast will also pick the winning algorithm after evaluating prediction accuracy with multiple predictors including root mean square error (RMSE), weighted quantile loss (wQL), mean absolute percentage error (MAPE), mean absolute scaled error (MASE) and weighted absolute percentage error (WAPE) metrics. Amazon Forecast also knows how to interpret the results of those accuracy predictors and return the winning algorithm, so you don’t have to learn how to (statistics is hard!). Finally, the user can use the trained model to make predictions. Amazon Forecast also offers users a convenient way to include different standard feature sets, such as holidays, into their data before training to improve accuracy.

Enhancements to the InfluxDB v2 Python Client Library

While you can perform some forecasting and anomaly detection with Flux, you’ll probably want to take advantage of the language and tools you’re already familiar with. Some of the most popular tools for time-series data science problems are Python libraries. To both take advantage of these popular Python packages and InfluxDB together, you’ll want to use the InfluxDB v2 Python client library.

In a previous post, we learned about how to obtain weather data from OpenWeatherMap API and store it in InfluxDB with the InfluxDB v2 Python Client Library. However, there are some additional client library features that improve the experience of using it. The InfluxDB v2 Python client library supports Pandas. You can write Pandas DataFrames and return the results of your query as a Pandas DataFrame.

The first enhancement is that the shape of your Pandas DataFrame can be more flexible than before. Previously a requirement of writing a DataFrame to InfluxDB was that you had to convert your timestamp column to a DataFrame index. Now you can simply specify which column is your timestamp column.

Additionally, many more timestamp formats are accepted. Finally, you can also specify the time zone of your timestamp column as a part of the write method, so you don’t have to do that conversion beforehand.

Ultimately all these changes mean it’s even easier to write data to InfluxDB with Pandas because users don’t have to spend more lines of code to transform their DataFrame to meet old requirements. In fact, you can directly read data from a CSV and write it to InfluxDB with:

import pandas as pd

from influxdb_client import InfluxDBClient
from influxdb_client.client.write_api import SYNCHRONOUS

url = "http://localhost:8086" 
// or the URL of your Cloud instance e.g. https://us-west-2-1.aws.cloud2.influxdata.com/
token = "my-token"
org = "my-org"

with InfluxDBClient(url=url, token=token, org=org) as client:

    df = pd.read_csv("path/to/CSV.csv")
    client \
        .write_api(write_options=SYNCHRONOUS) \
        .write(bucket="my-bucket",
               record=df,
               data_frame_timestamp_column="Date",
               data_frame_timestamp_timezone="EST")

Limitations and advantages of InfluxDB for time series data science problems

InfluxDB has the following limitations for tackling time-series data science problems:

  • There isn’t a lot of native tooling when it comes to tackling time-series data science problems with InfluxDB. While you can use InfluxDB for some basic forecasting or anomaly detection, you’ll likely want to use a client library to query your data and tackle your data science problems with other purpose-built tools.

InfluxDB has the following advantages when it comes to solving time-series data science problems:

  • InfluxDB can handle the ingest volumes that many time-series problems require.
  • You can use InfluxDB to preprocess your data to prepare it for any additional data science work.
  • You can perform some basic forecasting and anomaly detection with InfluxDB, which might suffice.

The advantage of using Flux, the query and data scripting language for InfluxDB, is that it contains a lot of functions and features for working with and manipulating time-series data including, but not limited to:

  • Transformations functions for statistical time-series analysis
    • Functions for correlation, covariant, standard deviation, quantile, spread, etc.
  • Transformations functions for dynamic statistical and fundamental time-series analysis
    • Functions for derivative, moving averages, time-weighted average, etc.
  • Technical momentum indicators for financial analysis
    • Chande momentum oscillators, Kaufman’s moving average, etc.
  • Math package for math applications
  • Geopackage for geo-temporal data

The disadvantage of Flux is that while it has a map() function, which lets you iterate over rows in a query result, it doesn’t provide support for loops. This makes writing sophisticated forecasting or anomaly detection algorithms in Flux impossible. However, Flux has been used to write some basic anomaly detection algorithms including median absolute deviation and Naive Bayes. The median absolute deviation algorithm is used to find time series that deviate from a pack of like time series. Naive Bayes is a probabilistic classifier used to determine if an input belongs to a certain class.

An-example-of-similar-time-series

An example of similar time series, which bears similarity to disk IO on a Linux kernel. Median absolute deviation is used to determine which series is deviating from the pack.

Flux also contains the holtWinters() function, which applies double or triple exponential smoothing to time series data. Double and triple exponential smoothing is a type of statistical time-series forecasting that uses the data, trend and seasonality in an exponentially weighted average to make predictions.

Conclusion

I hope this blog post inspires you to take advantage of any of the aforementioned tools for time-series data science problems. I encourage you to take a look at the following repo which includes examples for how to work with many of these Python libraries and InfluxDB to make forecasts and perform anomaly detection.