Getting Started with Time Series Data Science
By Anais Dotis-Georgiou / Mar 15, 2021 / InfluxDB, Community, Developer
Are you interested in performing time series forecasting or anomaly detection, but you don’t know where to start? If so, you’re not alone. There is an overwhelming variety of libraries, algorithms, and workflow recommendations for these tasks. As a Developer Advocate at InfluxDB, the leading time series database, I’ve researched time series data science methodologies and best practices for forecasting and anomaly detection. Today I want to summarize some important concepts about time series as well as share resources to get you started on your time series data science journey.
Why should a beginner interested in data science start learning about time series?
If you’re interested in becoming a data scientist, learning about data science as it pertains to time series is a great place to start. Time series data is data that is indexed chronologically. Because it’s indexed in time, often times, each time series data point is related to what came before. To explain what I mean, let’s take a look at weather data. The temperature of the city you live in right now is correlated to the temperature an hour ago and even last week or the same time last year. In other words, the temperature data is correlated with itself at other points in time. This statistical phenomenon is called autocorrelation, and it is one of the reasons that time series data is unique in the data world.
As a result, several data science algorithms that work for other types of data don’t work for time series data as well. This is because several advanced prediction and anomaly detection algorithms, or neural networks, rely on the assumption that your dataset doesn’t exhibit certain statistical attributes common to time series data, like autocorrelation.
You can still use neural networks on time series data that contains attributes that violate the assumption of the network, but you have to eliminate those attributes first. For example, you can remove autocorrelation from your time series through differencing, but this type of data pre-processing data can be tricky. Luckily, statistical algorithms are generally easier to understand than neural nets. Statistical methods are frequently excellent predictors and good at identifying anomalies. These two factors make learning about time series an excellent place for beginners to start their data science journey.
Recommended tools for a beginner looking to learn about time series data science
The first step in performing forecasting or anomaly detection is to learn about various algorithms and methods that exist to help you achieve your goal. Always make sure to research the underlying statistical assumptions of the algorithm you choose, and verify whether or not your data violates those assumptions. I always look towards Jupyter Notebooks to help me perform preliminary algorithm selection research. Using Jupyter Notebooks offers me the opportunity to try out algorithms on sample data sets to better understand various Python libraries and their time series algorithms. Once I feel that I’ve gained an understanding of the library and algorithm that I want to employ, then I’ll test the performance of that algorithm on my dataset. I store all of my time series data in InfluxDB. I use a Python Client to pull certain data sets out for further analysis.
Time series data science resource for InfluxDB
While InfluxDB allows you to transform your data and even write custom functions for anomaly detection with Flux, I want to introduce you to the Notebooks repo. This repo contains a variety of Jupyter Notebooks to help you get started with InfluxDB and time series data science tasks. Within this repo you can learn how to:
- Get started with Python and InfluxDB
- Get started with Pandas and InfluxDB
- Use the Flux interpreter for Jupyter Notebooks
- Perform anomaly detection
- Multiple time series
- Single time series
- Perform forecasting
- FB Prophet
Further reading on time series forecasting and anomaly detection with InfluxDB
The Notebooks repo is a consolidation of several other blogs. Here is a list of relevant blogs:
- Autocorrelation in Time Series Data
- When you Want Holt Winters Instead of Machine Learning
- Getting Started with InfluxDB and Pandas
- Getting Started with Python and InfluxDB v2.0
- Write Millions of Points From CSV to InfluxDB with the 2.0 Python Client
- Forecasting with FB Prophet and InfluxDB
- BIRCH for Anomaly Detection with InfluxDB
- Anomaly Detection with Median Absolute Deviation
- Zeppelin, Spark, and InfluxDB for Big Data Time Series Scenarios
- Why Use K-Means for Time Series Data? (Part One)
I hope the Notebooks repo helps you get started experimenting with various time series algorithms. If you are getting started with time series data science, please ask us for help and share your story! Share your thoughts, concerns, or questions in the comments section, on our community site, or in our Slack channel. We’d love to get your feedback and help you with any problems you run into!