Automate Anomaly Detection for Time Series Data
By Jason Myers / Sep 08, 2022 / InfluxDB, Community
This article was originally published in The New Stack and is reposted here with permission.
Hundreds of billions of sensors produce vast amounts of time series data every day.
The sheer volume of data that companies collect makes it challenging to analyze and glean insights. Machine learning drastically accelerates time series data analysis so that companies can understand and act on their time series data to drive significant innovation and improvements.
Current estimates predict there will be over 1 trillion sensors available by 2025 producing time series data. To help companies deal with all this data, Ezako, a French technology company that specializes in artificial intelligence (AI) and time series data, created the Upalgo platform. Upalgo is a SaaS platform that applies machine learning to time series data to make it more useful and performant by automating the processes of anomaly detection and labeling, and then iterating on those processes to improve data models.
The company caters primarily to the aerospace, automotive and telecom industries but can serve any vertical that deals with large amounts of sensor, telemetric and Internet of Things (IoT) data.
The Upalgo platform relies on InfluxDB as its datastore. The company tried several options before discovering InfluxDB, including relational databases, NoSQL databases and a combination of Hadoop and OpenTSDB, which together was essentially a NoSQL database with some adaptations for time series data.
None of these solutions provided the speed and capabilities for handling time series data that Upalgo needed. Key decision factors included InfluxDB’s windowing feature and its active developer community. The Ezako team viewed this as a critical resource for getting help for time series-specific problems from people working in the same areas. Using InfluxDB allowed the data scientists at Ezako to focus on data science and machine learning, not time series storage.
As a high-level overview, the Upalgo platform starts with a data collection API that sends data to InfluxDB. Because Upalgo needs to interact with many different systems, the Ezako team built a REST API that serves as a common layer that easily connects to other tech stacks. Through that API layer, the Upalgo UI can query data out of InfluxDB for visualizations. The machine learning processing layer can access the same data for analysis and write processed data to InfluxDB buckets for deeper analysis and fine-tuning data models.
Machine learning challenges
Even with InfluxDB serving as the core of the Upalgo platform, some inherent challenges remain when applying machine learning to time series data.
One is continuous data ingest. The system collects data around the clock, which, at a very basic level, eats up a certain amount of processing resources. ML practitioners need to remain aware of this resource consumption and factor it into other processes that will run simultaneously, so they can optimize both continuous and non-continuous workloads to deliver expected user experiences.
Related to the first challenge is the read-intensive learning process. Building a data model requires a lot of data, which means very big read operations. On top of that, the learning process needs to be fast while sharing resources with other processes. This is where the REST API comes into play because it consolidates any read issues into a single tech layer, regardless of what processes or systems are running on top of it.
One of the key features of the Upalgo platform is anomaly detection. The platform provides many different algorithms for modeling and anomaly detection, so users can choose the best option for their data and business goals.
Regardless of the algorithm, however, machine learning requires a lot of data. For example, to begin building an anomaly detection model using One Class SVM or Isolation Forest algorithms, you need at least 1 million data points. Calculating features on 60-point windows, which is standard for seconds or hours, produces 15,000 windows for the algorithm to learn from. In reality, this isn’t a lot, and that’s only factoring in a single series.
Models that include multiple series require an additional million data points for each additional series. So, a model that incorporates three series needs 3 million data points just to generate a base model. Some algorithms need even more data. A long short-term memory (LSTM) algorithm needs to learn the characteristics of raw data, which means 5 to 10 million data points during the learning phase.
A significant challenge with anomaly detection is the lack of a baseline truth for the data. As a result, false positives and false negatives can occur. To mitigate these kinds of anomalies, the machine learning algorithms need more information about the data. That’s where labeling comes in.
Labeling data with machine learning
Labels are extra information on a dataset. Labels help algorithms know more about the data, which enables users to do better things with it. One way labels can help machine learning is by removing anomalies from a data set. This helps to establish a more truthful baseline for the data.
Labeling large datasets is a huge time commitment. It’s also a crucial aspect of training machine learning algorithms, so data scientists spend a lot of time labeling data. Upalgo automatically identifies anomalies, making it faster and easier for data scientists to find the things they need to label.
The platform’s labeling feature also allows users to identify several labels in a dataset manually. It then uses AI to examine the rest of the series and find similar patterns. This generates more labels, which produces more information on the series, and more accurate data models.
Making time series data count
Throughout the anomaly detection and labeling processes Upalgo generates data visualization for users using InfluxQL to query data from InfluxDB because it can return large datasets quickly.
Ezako’s commitment to improving machine learning on time series data significantly accelerates key processes that ML data scientists face. InfluxDB provides the backend capabilities that allow the Ezako team to focus on data science, not infrastructure, and to provide the end-user experience that customers want.