What is time series analysis?
Time series analysis is the collection of data at specific intervals over a time period, with the purpose of identifying trend, seasonality, and residuals to aid in the forecasting of a future event. Time series analysis involves inferring what has happened to a series of data points in the past and attempting to predict future values.
Analyzing time series data allows extracting meaningful statistics and other characteristics of the data. Such analysis requires that the pattern of observed time series data is identified. Once the pattern is established, it can be interpreted, integrated with other data, and used for forecasting (which is fundamental for machine learning).
Importance of time series analysis
As more connected devices are implemented and data is expected to be collected and processed in real-time, the ability to handle time series data has become increasingly significant. Time series analysis can be used to:
- Show that data points taken over time may have an internal structure (such as autocorrelation, trend or seasonal variation)
- Understand the past as well as predict the future
Since the analysis is based on data plotted against time, the first step is to plot the data and observe any patterns that might occur over time.
Programming languages used
Among the many programming languages used for time series analysis and data science are:
Flux, developed by InfluxData, is one of the newest open source programming languages purpose-built for time series analysis. A data scripting and query language, Flux makes it easy to see change across time. Traditionally, grouping, shaping, and performing mathematical operations across large dynamic time series datasets is cumbersome. Flux makes working with these datasets much more elegant.
Flux is a powerful language for working with data and is available on top of InfluxData’s time series platform. It is unlocking exciting new time series use cases and allowing developers to work with data where it resides both within InfluxDB and other data sources such as MySQL, Google Bigtable, MariaDB, and Postgres. The InfluxData platform does the heavy lifting of collecting data, storing it, and providing computing power to analyze the data so software builders can focus on implementing solutions.
Flux makes it easy to create and share a dashboard. Flux is meant to empower every query and visualization tool so that they may bring together related data sets to generate insights using a common, powerful and unified language. Providing a tool to flexibly merge together data sources and analyze them across time is one of Flux’s primary use cases.
Time series analysis example using InfluxDB
To build a real-time risk monitoring system, Robinhood (a pioneer of commission-free investing) chose InfluxDB (an open source time series database) and Faust (an open source Python stream processing library). The architecture behind their system involves both time series anomaly detection (InfluxDB) and real-time stream processing (Faust/Kafka).
An example of infrastructure telemetry, collected with InfluxDB by Robinhood.
Robinhood alerted on the data with Faust, a real-time Python Library for Kafka Streams.
The aggregated data (yellow) is bounded by upper and lower limits (blue).
As the number of time series grows, the effort required to understand or detect anomalies in a time series becomes very costly. This is where an anomaly detection system can intelligently alert one when something doesn't go very well.
The first anomaly detection solution that Robinhood tried was threshold-based alerting, by which an alert is triggered whenever the underlying data is over or under the threshold. Threshold-based alerting works well with very simple time series but fails to account for more complex time series. As shown below, the time series here has a trend. It's trending upwards, and there are some up-and-down patterns within that upward trend. If the fixed threshold is used to alert on anomalies, it doesn't work well because it will go over the threshold, and will trigger an alert but will then drop down a threshold and go over a threshold again. So threshold-based alerting in the case of complex time series would require the same effort as checking the dashboard 24/7.
Threshold-based alerting works well with time series but fails to account for seasonality and trend.
To fix this problem, Robinhood alerted on data outside of three standard deviations. Defining your threshold from a standard deviation for anomaly detection is advantageous because it can help you detect anomalies on data that is non-stationary (like the example above). In other words, the threshold defined by a standard deviation will follow your data’s trend. Robinhood defined an anomaly as anything outside of three standard deviations away from the mean — so 99.7% of the data lies within this range.
An example of data with a normal distribution. Data that is outside of three standard deviations away from the mean (shaded with green lines) accounts for only 0.03% of all of the data.
Applications in various domains
Time series models are used to:
- Gain an understanding of the underlying forces and structure that produced the observed data
- Fit a model and proceed to forecasting, monitoring or feedback and feedforward control.
Applications span sectors such as:
- Budgetary analysis
- Census analysis
- Economic forecasting
- Inventory studies
- Process and quality control
- Sales forecasting
- Stock market analysis
- Utility studies
- Workload projections
- Yield projections
Understanding data stationarity
Stationarity is an important concept in time series analysis. Many useful analytical tools and statistical tests and models rely on stationarity to perform forecasting. For many cases involving time series, it’s sometimes necessary to determine if the data was generated by a stationary process, resulting in stationary time series data. Conversely, sometimes it’s useful to transform a non-stationary process into a stationary process in order to apply specific forecasting functions to it. A common method of stationarizing a time series is through a process called differencing, which can be used to remove any trend in the series which is not of interest.
Stationarity in a time series is defined by a constant mean, variance, and autocorrelation. While there are several ways in which a series can be non-stationary (for instance, an increasing variance over time), a series can only be stationary in one way (when all these properties do not change over time).
Patterns that may be present within time series data
The variation or movement in a series can be understood through the following three components: trend, seasonality, and residuals. The first two components represent systematic types of time series variability. The third represents statistical noise (analogous to the error terms included in various types of statistical models). To visually explore a series, time series are often formally partitioned into each of these three components through a procedure referred to as time series decomposition, in which a time series is decomposed into its constituent components.
Trend refers to any systematic change in the level of a series — i.e., its long-term direction. Both the direction and slope (rate of change) of a trend may remain constant or change throughout the course of the series.
Unlike the trend component, the seasonal component of a series is a repeating pattern of increase and decrease in the series that occurs consistently throughout its duration. Seasonality is commonly thought of as a cyclical or repeating pattern within a seasonal period of 1 year with seasonal or monthly seasons. However, seasons aren’t confined to that time scale — seasons can exist in the nanosecond range as well.
Residuals constitute what's left after you remove the seasonality and trend from the data.
Methods of analyzing time series data
Time series analysis methods may be divided into two classes:
- Frequency-domain methods (these include spectral analysis and wavelet analysis)
In electronics, control systems engineering, and statistics, the frequency domain refers to the analysis of mathematical functions or signals with respect to frequency, rather than time.
- Time-domain methods (these include autocorrelation and cross-correlation analysis)
Time domain refers to the analysis of mathematical functions, physical signals or time series of economic or environmental data, with respect to time. (In the time domain, correlation and analysis can be made in a filter-like manner using scaled correlation, thereby mitigating the need to operate in the frequency domain.)
Additionally, time series analysis methods may be divided into two other types:
- Parametric: The parametric approaches assume that the underlying stationary stochastic process has a certain structure which can be described using a small number of parameters (for example, using an autoregressive or moving average model). In these approaches, the task is to estimate the parameters of the model that describes the stochastic process.
- Non-parametric: By contrast, non-parametric approaches explicitly estimate the covariance or the spectrum of the process without assuming that the process has any particular structure).
Below is an overview of each of the above-mentioned methods.
Many time series show periodic behavior that can be very complex. Spectral analysis is a technique that allows us to discover underlying periodicities — it is one of the most widely used methods for data analysis in geophysics, oceanography, atmospheric science, astronomy, engineering, and other fields.
The spectral density can be estimated using on object known as a periodogram, which is the squared correlation between our time series and sine/cosine waves at the different frequencies spanned by the series. To perform spectral analysis, the data must first be transformed from time domain to frequency domain.
Learn more about spectral analysis.
What is a Wavelet? A wavelet is a function that is localized in time and frequency, generally with a zero mean. It is also a tool for decomposing a signal by location and frequency. Consider the Fourier transform: A signal is only decomposed into its frequency components.
Wavelets are analysis tools mainly for time series analysis and image analysis (not covered here). As a subject, wavelets are relatively new (1983 to present) and synthesize many new/old ideas.
What is autocorrelation in time series data? Autocorrelation is a type of serial dependence. Specifically, autocorrelation is when a time series is linearly related to a lagged version of itself. When you have a series of numbers where values can be predicted based on preceding values in the series, the series is said to exhibit autocorrelation. By contrast, correlation is simply when two independent variables are linearly related.
Here’s why autocorrelation matters. Often, one of the first steps in any data analysis is performing regression analysis. However, one of the assumptions of regression analysis is that the data has no autocorrelation. This can be frustrating because if you try to do a regression analysis on data with autocorrelation, then your analysis will be misleading.
Additionally, some time series forecasting methods (specifically regression modeling) rely on the assumption that there isn’t any autocorrelation in the residuals (the difference between the fitted model and the data). People often use the residuals to assess whether their model is a good fit while ignoring that assumption that the residuals have no autocorrelation (or that the errors are independent and identically distributed or i.i.d). This mistake can mislead people into believing that their model is a good fit when in fact it isn’t.
Finally, perhaps the most compelling aspect of autocorrelation analysis is how it can help us uncover hidden patterns in our data and help us select the correct forecasting methods. Specifically, we can use it to help identify seasonality and trend in time series data. Additionally, analyzing the autocorrelation function (ACF) and partial autocorrelation function (PACF) in conjunction is necessary for selecting the appropriate ARIMA model for your time series prediction. Learn how to determine if your time series data has autocorrelation.
Cross correlation is a measurement that tracks the movements of two variables or sets of data relative to each other. In its simplest version, it can be described in terms of an independent variable, X, and two dependent variables, Y and Z. If independent variable X influences variable Y and the two are positively correlated, then as the value of X rises so will the value of Y.
If the same is true of the relationship between X and Z, then as the value of X rises, so will the value of Z. Variables Y and Z can be said to be cross correlated because their behavior is positively correlated as a result of each of their individual relationships to variable X.
Parametric vs. nonparametric tests
Parametric tests assume underlying statistical distributions in the data. Therefore, several conditions of validity must be met so that the result of a parametric test is reliable. Nonparametric tests are more robust than parametric tests. They are valid in a broader range of situations (fewer conditions of validity).
Nonparametric tests do not rely on any distribution. They can thus be applied even if parametric conditions of validity are not met. Parametric tests will have more statistical power than nonparametric tests. A parametric test is more able to lead to a rejection of H0. Most of the time, the p-value associated to a parametric test will be lower than the p-value associated to a nonparametric equivalent that is run on the same data.
Time series models
There are many time series modeling and forecasting methods. The fitting of time series models can be an ambitious undertaking spanning several approaches. The user's application and preference determines the selection of the appropriate technique. Learn more about time series modeling and forecasting on this page dedicated to the topic.