Time Series Forecasting with Python and Facebook Kats

Navigate to:

This article was written by Vidhi Chugh. Scroll down for the author’s bio and photo.

Time series analysis is the study of a sequence of data points and records that are collected over a constant period. The analysis indicates how a variable or a group of variables has changed and helps in discovering underlying trends and patterns.

Time series data is generally used for forecasting problems by predicting the likelihood of future data based on historical information. Weather forecasts, stock price predictions, and industry growth forecasts are some of the most popular applications of time series analysis.

Recent advancements in machine learning algorithms, like long short-term memory (LSTM) and Prophet, have led to significant improvements in forecast accuracy.

In this article, you’ll learn how to use InfluxDB to store and access time series data with Python API and analyze it using the Facebook Kats library.

What is the Facebook Kats Toolkit

There are a number of Python libraries used for analyzing data, including sktime, Prophet, Facebook Kats, and Darts. This post will focus on the Kats library because it’s a lightweight and easy-to-use framework frequently used for analyzing time series data. It offers various functionalities, including the following:

  • Forecasting: The Kats library provides a range of tools, including forecasting algorithms, ensembles, a meta-learning algorithm with hyperparameter tuning, backtesting, and empirical prediction intervals.

  • Detection: It detects patterns, like trends, seasonalities, anomalies, and change points.

  • Auto feature engineering and embedding: The time series feature in the Kats tsfeatures module autogenerates features for supervised learning algorithms.

  • Utilities: The Kats library provides time series simulators for learning and experimentation.

What is InfluxDB

Time series analysis requires a database suitable for storing and retrieving data effectively and efficiently. Here, you’ll use InfluxDB, which is one of the leading platforms for building time series applications. It’s a high-performing engine that is open sourced, has vast community reach, and is easy to use. In addition, it can be hosted locally or on the cloud.

Implementing time series forecasting with Python and Facebook Kats

In the following tutorial, you’ll learn how to create a simple forecasting data set that uses InfluxDB to store the data and then analyze it with Facebook Kats.

Prerequisites

Before you begin, you should have a basic understanding of Python syntax and command line/terminal commands.

All the code for this tutorial is available in this GitHub repository.

Connect to an InfluxDB Cloud instance

To get started with the InfluxDB Cloud instance, visit InfluxData’s website and click on Get InfluxDB in the upper-right-hand corner:

Get InfluxDB

Select Use it for Free for the cloud-only account interface:

Select - Use it for Free

Then you’ll be navigated to a sign-up page where you can sign up after inputting the necessary information:

Sign-up page -InfluxDB cloud

You need to select a cloud service provider (Amazon Web Services - AWS) was selected here) to store your InfluxDB time series data. You don’t need to be familiar with any of these services, as InfluxDB abstracts away all the underlying complexities. Once you’ve selected your cloud provider, add your company name and agree to the terms:

InfluxDB Cloud-Sign-up continued

Now you’re ready to begin setting up your database. Choose the plan of your preference. In this instance, a free plan is sufficient:

InfluxDB Cloud - Choose your plan

After selecting your plan, you’ll be taken to a Get Started screen that lists a number of programming languages. Select Python for this demo:

Get Started-Choose your language

On the next page, you need to watch the video that shows you how to set up InfluxData. Click Next once you’re done:

Setting up InfluxDB

Set up your local machine

To access InfluxDB through Python, you need to install the influxdb-client library on your machine. You can do this by running the following command in the terminal:

pip3 install influxdb-client

Please note: pip3 is used for installing libraries in Python 3.X and is used here.

Generate your API token from the web interface by navigating to API Tokens > Generate API Token > All Access API Token. You’ll be using an All Access API Token for this tutorial, though you can also choose to generate a Custom API Token if you want to choose the authentication level of the user.

Run the following command in the terminal or command line to add your token as an environment variable:

export INFLUXDB_TOKEN = "your token"

The token is not included in the code to maintain shareability across teams.

From here, you’ll be using a Python IDE or a Jupyter Notebook to write and read data to InfluxDB.

Write data to InfluxDB

To write data to InfluxDB, you’ll need access to some data. Here, you’ll use the Air Passengers data set, which is a model data set representing monthly air passengers from 1949 to 1960.

To begin, install the pandas library on your operating system using the following command:

pip3 install pandas

Create a file named writePassengerData.py and paste the following code. And make sure to put the AirPassengers.csv in the same directory:

writePassengerData.py

import pandas as pd

import influxdb_client, os, time

from influxdb_client import InfluxDBClient, Point, WritePrecision

from influxdb_client.client.write_api import SYNCHRONOUS

token = os.environ.get("your influx db token")

org = "your influx db org name"

url = "your influx db custom url"

bucket = "your influx bucket name"

client = influxdb_client.InfluxDBClient(url=url, token=token, org=org)

write_api = client.write_api(write_options=SYNCHRONOUS)

df = pd.read_csv('AirPassengers.csv')

for i in df.index:

point = (

Point("passengers")

.tag("month", df.iloc[i,0])

.field("passengers", df.iloc[i,1])

)

write_api.write(bucket=bucket, org=org, record=point)

time.sleep(1) # separate points by 1 second

In the previous code block, make sure you use the token generated on the cloud platform and the org name entered earlier. Here, you import necessary libraries such as pandas for reading CSV data and influxdb_client for writing data to InfluxDB Cloud instances. Then you declare the string variables to hold information like the token, org, URL, and bucket. Next, you instantiate the client using InfluxDBClient() and activate the write API using the write_api() method. The CSV file is read using pandas.read_csv and stored in a data frame object. Then you iterate over each row in the data frame, create a temporary point object, and write the point object as a record to InfluxDB.

Next, run the file from your terminal:

writePassengerData.py

Verify that the data is written/uploaded to an InfluxDB Cloud bucket by going to InfluxDB Cloud and clicking on Buckets. Then click on your bucket and verify that the measurement name passengers is available:

Passengers data inside the bucket

Read data from InfluxDB

To be able to read data from InfluxDB, you need to create a file named timeSeriesAnalysis.ipynb and include the following code in it:

timeSeriesAnalysis.ipynb

import influxdb_client, os, time

from influxdb_client import InfluxDBClient, Point, WritePrecision

from influxdb_client.client.write_api import SYNCHRONOUS

token = os.environ.get("your influx db token")

org = "your influx db org name"

url = "your influx db custom url"

client = influxdb_client.InfluxDBClient(url=url, token=token, org=org)

query_api = client.query_api()

query = """from(bucket:"your influx db bucket name")

|> range(start: -1000m)

|> filter(fn: (r) => r._measurement == "passengers")

|> mean()"""

tables = query_api.query(query, org=org)

results = []

for table in tables:

for record in table.records:

results.append(({'month': record.values.get('month'), record.get_field(): record.get_value()}))

In this code block, you reimport all the libraries (as demonstrated earlier). Next, you declare the variables that are holding information like the token, org, and URL. Then you instantiate the client using InfluxDBClient() and activate the query API using query_api().

Please note: Querying is used as a synonym for reading in database languages.

Once you’ve activated the query API, you can add details like bucket and measurement names to the Flux query. Using the query API, fetch and store data in the tables objects and iterate over the tables, parse the required information, and append it to a list:

After running this code, you’ll get a list of dictionaries containing your data that looks like this:

[{'month': '1949-03', 'passengers': 132.0},

{'month': '1954-04', 'passengers': 227.0},

{'month': '1952-06', 'passengers': 218.0},

{'month': '1956-07', 'passengers': 413.0},

{'month': '1958-06', 'passengers': 435.0},

{'month': '1955-09', 'passengers': 312.0},

{'month': '1956-02', 'passengers': 277.0},

{'month': '1958-01', 'passengers': 340.0},

{'month': '1954-11', 'passengers': 203.0},

{'month': '1959-07', 'passengers': 548.0}]

Now it’s time to make this readable and convert it to the pandas dataframe.

Import libraries for time series forecasting

Now that you have read your data, it’s time to begin analyzing it.

In order to carry out the analysis, you need to install and import the following libraries:

Pip3 install numpy

Pip3 install kats

Pip3 install statsmodels

Pip3 install warnings

Pip3 install matplotlib

pandas, NumPy, and Matplotlib are installed for data manipulation and visualization. You also need to import SARIMA, Holt-Winters, and Prophet for time series analysis from Kats. The models and TimeSeriesData are installed in order to convert a standard data frame to a time series object consumable by the Kats library:

timeSeriesAnalysis.ipynb

import pandas as pd

import numpy as np

import sys

import matplotlib.pyplot as plt

import warnings

import statsmodels.api as sm

from kats.models.sarima import SARIMAModel, SARIMAParams

from kats.models.holtwinters import HoltWintersParams, HoltWintersModel

from kats.models.prophet import ProphetModel, ProphetParams

from kats.consts import TimeSeriesData

Convert data to a time series format

To convert data to a time series object, you need to begin by converting the results list to a data frame. Sort the data frame values by month in ascending order and rename the columns to “time” and “value” from “month” and “passengers” (this is a standard step in the Kats library), respectively. Finally, convert the data frame object to a time series data object:

timeSeriesAnalysis.ipynb

air_passengers_df = pd.DataFrame(results)

air_passengers_df.sort_values('month', inplace=True)

air_passengers_df.columns = ["time", "value"]

air_passengers_ts = TimeSeriesData(air_passengers_df)

Check for stationarity

Now, it’s time to visualize if the time series is stationary, which is one of the prerequisites to model time series data:

timeSeriesAnalysis.ipynb

plt.figure(figsize=(35,20))

fig = plt.plot(air_passengers_df['time'], air_passengers_df["value"])

plt.xticks(rotation=90)

plt.show()

The time series is non-stationary, as the mean and variances are not constant:

Data is non-stationary

You can confirm this with the Augmented Dickey-Fuller test:

timeSeriesAnalysis.ipynb

from statsmodels.tsa.stattools import adfuller

X = air_passengers_df["value"]

result = adfuller(X)

print('ADF Statistic: %f' % result[0])

print('p-value: %f' % result[1])

A high p-value suggests that we cannot establish stationarity for this time series, as you can see from the output of the ADF test:

ADF Statistic: 0.815369

p-value: 0.991880

You can make the series stationary by differentiating when you run the following code:

timeSeriesAnalysis.ipynb

plt.figure(figsize=(35,20))

fig = plt.plot(air_passengers_df['time'], air_passengers_df["value"].diff())

plt.xticks(rotation=90)

plt.show()

Static time series

As you can see, the Augmented Dickey-Fuller test gives better results:

timeSeriesAnalysis.ipynb

result = adfuller(X.diff()[1:])

print('ADF Statistic: %f' % result[0])

print('p-value: %f' % result[1])

Your output will look like this:

ADF Statistic: -2.829267

p-value: 0.054213

Now that the series is almost stationary, it’s time to plot the autocorrelation function (ACF) and partial autocorrelation function (PACF) charts.

ACF and PACF plots

ACF and PACF charts determine the moving average (MA) lag (q) and autocorrelation (AR) lag (p):

timeSeriesAnalysis.ipynb

fig, ax = plt.subplots(2,1)

fig.set_figheight(15)

fig.set_figwidth(15)

fig = sm.graphics.tsa.plot_acf(air_passengers_df["value"].diff()[1:], lags=50, ax=ax[0])

fig = sm.graphics.tsa.plot_pacf(air_passengers_df["value"].diff()[1:], lags=50, ax=ax[1])

plt.show()

Based on the following PACF and ACF charts, p = 2 and q = 1 are good values to begin with. The hyperparameters can be tuned further using grid search and an out-of-bag (OOB) sample:

Autocorrelation and partial autocorrelation charts

SARIMA model

To train and predict the SARIMA model, start by declaring the p, d, q, and m params, where d = 1 for linearly trending data and m is the seasonality index of twelve months (air travel seasonality). Instantiate a SARIMA model object using the training data and parameters and fit the model. Then predict using the trained model and plot the data and predictions as shown here:

timeSeriesAnalysis.ipynb

# declare SARIMA parameters - use acf/pacf charts and grid search

params = SARIMAParams(p = 2, d=1, q=1, seasonal_order=(1,0,1,12), trend = 'ct')

# train sarima model

m = SARIMAModel(data=air_passengers_ts, params=params)

m.fit()

#forecast for next 30 months

fcst = m.predict(steps=30, freq="MS")

# visualize predictions

m.plot()

Though the model is able to identify seasonality and autocorrelation, it does not recognize the increasing range in the data, leading to a high error variance:

SARIMA predictions

Holt-Winters model

The Holt-Winters model overcomes the shortcomings of the SARIMA model by capturing the increasing range in seasonality and generates predictions with higher confidence:

timeSeriesAnalysis.ipynb

# declare parameters for Holt Winters model

params = HoltWintersParams(trend="add", seasonal="mul", seasonal_periods=12)

# fit a Holt Winters model

hw_model = HoltWintersModel(data=air_passengers_ts, params=params)

hw_model.fit()

# forecast for next 30 months

fcst = hw_model.predict(steps=30, alpha = 0.1)

# plot predictions

hw_model.plot()

Holt-Winters predictions

Prophet time series model

The Prophet time series model builds upon the Holt-Winters model to further improve the predictions. The steps followed are similar to the two examples discussed previously:

timeSeriesAnalysis.ipynb

# declare parameters for Prophet model - choose between additive or multiplicative

# multiplicative gives better results

params = ProphetParams(seasonality_mode='multiplicative')

# fit a prophet model instance

model = ProphetModel(air_passengers_ts, params)

model.fit()

# forecast for next 30 months

fcst = model.predict(steps=30, freq="MS")

# visualize predictions

model.plot()

The Prophet model predictions have the least error variance and, thus, the highest confidence:

Prophet predictions

All the code for this tutorial is available in this GitHub repository.

Conclusion

In this article, you learned about the significance of time series data and some of its key applications across multiple industries. You also learned how to set up an InfluxDB Cloud instance and write a CSV using a Python API. From there, you read data from InfluxDB for analysis using Facebook Kats. The article concluded with a step-by-step walkthrough of three popular time series algorithms, namely, SARIMA, Holt-Winters, and Prophet, along with their performance comparison.

If you’re looking for a platform for building and operating time series applications, check out InfluxDB. InfluxDB is open source and empowers developers to build and deploy transformative monitoring, analytics, and IoT applications faster and to scale. The platform can handle massive volumes of time series data produced by networks, IoT devices, apps, and containers.

About the author

Vidhi Chugh

Vidhi Chugh is an award-winning AI/ML innovation leader and a leading expert in data governance with a vision to build trustworthy AI solutions. She works at the intersection of data science, product and research teams to deliver business value and insights at Walmart Global Tech India. Learn more on her LinkedIn page and Medium profile.