Wouldn’t it be nice to be able to perfectly predict the future? We are a long way from being able to do that, but that is basically the goal of anybody working in the data science field — take a bunch of historical data and then try to make future predictions based on that data.
Time series data and machine learning
In this article you will learn about forecasting with time series data, in particular by combining InfluxDB to store time series data and TensorFlow to make predictions. Time series data is somewhat unique when it comes to using machine learning techniques because data points are actually correlated with each other, which is known as autocorrelation. This means that many data science algorithms can’t be used for working with time series data because they work off the assumption that the underlying data isn’t correlated. As a result, working with time series data for machine learning is a bit different compared to other domains.
What is TensorFlow?
TensorFlow is an open source framework for machine learning created by Google. It has rapidly become one of the most popular tools in the machine learning ecosystem. Over time TensorFlow has added more tools and features to not only make development of machine learning models easier, but also make deploying machine learning in production much easier.
What is InfluxDB?
InfluxDB is a purpose-built database for storing and working with time series data. InfluxDB is able to handle millions of data points being written per minute and makes this time series data available for querying extremely fast for the types of workloads that standard databases tend to struggle with.
InfluxDB also provides a number of tools to make working with time series data easier beyond just storage. Telegraf is an open source server agent that has 250 plugins available to collect metrics. Telegraf can then use processor plugins to transform or enrich that data before outputting it to InfluxDB or 50 other available data store plugins.
Once your data is inside InfluxDB you can query your data using Flux, a query language designed for working with time series data. You can also use the built-in visualization tools or export to something like Grafana.
InfluxDB is designed to make any situation where you are using time series data more developer friendly, but in this article we will focus on the benefits to data scientists or machine learning engineers. In this use case, InfluxDB is most helpful by helping to streamline data engineering which will be covered in detail later.
Machine learning vs. statistical models
Machine learning and neural nets aren’t magic; in some cases it still makes sense to use more standard statistical methods for forecasting. Generally speaking, using a statistical model will require less computing resources and will work better for univariate forecasting. A downside of statistical models is that they typically require lots of tuning and data preparation which can be time-consuming. In a business environment where your times series data is sparse, irregular, and multivariate or you need to generate many different models, it might make sense to use machine learning. Machine learning also tends to work better in situations where forecasts are multivariate in nature.
An example of the tradeoffs in action is the M-4 competition. M-4 is a time series forecasting competition that has been running since 1982. In 2020 the winning model came from Uber, which actually used a hybrid model that used deep learning and statistical modeling strategies. But many of the other competitors used standard statistical models and outperformed many pure machine learning models for univariate time series forecasting.
In short, for business use cases where you have multivariate time series and your data is sparse, it makes sense to use machine learning because they are engineered to address those data problems. If maximum accuracy is important, it might make sense to use a hybrid approach.
Why use TensorFlow with time series data
The primary reason TensorFlow is a good choice for working with time series data is the community. As the most popular available machine learning framework, you get the benefit of a number of helpful tools in the ecosystem. It also makes funding material for educating your team or hiring experienced engineers familiar with TensorFlow easier.
TensorFlow Lite is another particularly interesting reason to go with TensorFlow for time series data use cases because it was designed for working on devices with lower amounts of computing power. IoT is a perfect use case for TensorFlow Lite and could allow you to use machine learning on the edge from collected time series sensor data. TensorFlow Lite is optimized for mobile and embedded devices, has a much smaller binary size than standard TensorFlow, and has a faster initialization time.
Why use TensorFlow with InfluxDB
InfluxDB is a database designed for working with time series data. When working with TensorFlow using InfluxDB can simplify many problems related to data engineering and the overall data pipeline. InfluxDB is able to compress time series data efficiently which will save you on storage costs and can also act as a buffer when ingesting real-time data.
InfluxDB also improves the usability of your machine learning pipeline. You can apply your models to new data streams by simply modifying a query. By having your historical and real-time data in the same place, it makes moving from testing to production more streamlined.
For a practical example of using InfluxDB with Keras, a wrapper for TensorFlow that makes working with TensorFlow even easier, you can check out this GitHub repository that shows you how to make weather predictions by using TensorFlow to create a LSTM neural network.
The future of InfluxDB and data science
The volume of time series data continues to grow exponentially. As customers demand more reliable software, companies require more metrics at even finer granularity. The number of IoT devices and sensors is also increasing. According to McKinsey the IoT market size will grow to between $5.5 and $12.6 trillion dollars by 2030.
All of this data is time series and that data isn’t being collected just to sit around. The primary goal of this data will be to make forecasts, increase efficiency, and improve reliability of everything in both the real world and virtual world of software.
InfluxDB will only continue to provide value for data science and machine learning workloads in the future. InfluxDB IOx is InfluxDB’s new storage engine which will utilize Apache Arrow and DataFusion, which will make InfluxDB even easier to integrate with the big data ecosystem of tools.
So now that you’ve got an introduction into the how and why you’d want to use InfluxDB with TensorFlow and other data science workflows, I’ll leave you with some additional resources you can check out if you want a deeper dive into how InfluxDB can be used for data science.
- Forecasting with FB Prophet and InfluxDB
- Zeppelin, Spark, and InfluxDB for Big Data Time Series Scenarios
- Why Use K-Means for Time Series Data
- BIRCH for Anomaly Detection with InfluxDB
- Managing TensorFlow with InfluxDB
- Jupyter notebooks for anomaly detection and forecasting with InfluxDB