A DevOps tutorial to setup intelligent machine learning driven alerts
Last November at InfluxDays San Francisco, I was lucky enough to spend time with many InfluxData team experts. One of them was Nathaniel Cook, the lead developer of Kapacitor, who gave me some TICK script writing tips. Now, I’d like to share them with you.
If you’re not yet familiar with machine learning applied to time series data, don’t worry. This short video will bring you up to speed (you can skip it otherwise)
In this tutorial you will learn how to:
- detect anomalies using our favorite machine learning API: Loud ML!
- define a TICKscript in Kapacitor
- trigger alert notifications via Slack. If you prefer PagerDuty, you will only need to change a few lines in your settings (see our tutorial for those steps).
If you’re an e-shop, we’ll show you how to detect anomalies (low traffic, high traffic, etc.) in online click patterns, to trigger actions. Our first action is to send notifications to a Slack channel. You can change this last action to achieve incredible automation — more about that in our next tutorial; if you are interested, subscribe to our newsletter or follow us here on Medium!
This tutorial is organized into four parts:
- Basic setup: running the Docker containers
- Training your first time series model using Loud ML. Spoiler: no programming required
- Writing your first TICKscript to trigger intelligent notifications and/or automation
- Watching it all in action when the ML task detects abnormal data
If you need more guidance after taking this tutorial, you can also watch how we did it during the DevOps.com webinar in early 2019.
Part one: Basic setup. Run all five Docker containers in less than five minutes
We’re using the TICK-L stack throughout this tutorial; it’s the InfluxData TICK stack (Telegraf, InfluxDB, Chronograf, Kapacitor — four containers, augmented with the machine learning capabilities of Loud ML (one more container). To run all five containers at once, we’ll use the Docker compose file available on Github:
First, clone the public Loud ML repository, and change the directory to
docker/. Then edit the notification settings and optionally the database settings if you have to.
kapacitor.conf file must always contain the following specific settings for Slack: workspace, channel, and url. For information about these settings, read more about incoming Webhooks in the Slack documentation.
Here’s our very own sample which you’re free to use as a template:
All set, then let’s start the containers.
Open the browser to the URL
The default database contains no data. It’s not yet ideal to apply machine learning! You can write your data stream using the HTTP API and
locahost:8086 endpoint, or if possible, point to an existing database in the
docker-compose.yml file. In the latter scenario, stop and restart the Docker environment.
Part two: Train a time series model using Loud ML
Your time series data could be clicks, hits, outside temperatures, or anything else you’re tracking. The more data history the better for training using deep neural networks.
We’ve covered how to do this in detail in a previous tutorial. Follow the steps and then come back here to continue.
Before we move on to writing the TICK script, here are a few more snippets of information.
Throughout the rest of this tutorial, our model is named “telegraf_metrics_count_value_10s” and is configured to predict the data streamed to the “telegraf” database, and “metrics” measurement counting “values”. Let’s assume these values are clicks on a given web page.
One more thing: once your model training is completed, make sure you configure Loud ML to write its predictions and scores to a chosen location. In this tutorial, we’ll select the “kapacitor” sink, and then hit the “Save model” button.
Part three: Write a TICKscript to trigger intelligent notifications
TICKscript is used to define pipelines for processing data. Our pipeline will receive input data from Loud ML predictions, and then output notifications if alert conditions are met.
Click on “Manage Tasks” on the sidebar to create and edit the TICKscript.
The first section declares general variables such as model name, input measurement, output database, output retention policy, and output measurement.
Our second TICKscript section defines a data stream. It receives input from Loud ML and writes the data back into InfluxDB.
Our next TICKscript section will send a notification for ongoing anomaly detection thanks to the machine learning task. Each data point is labelled as normal or abnormal using the
is_anomaly boolean value. The first severity level is set to
warning. We increase the severity to
critical if five consecutive data points or more are detected as abnormal. In our sample model, an anomaly would be reported after 50 seconds (5 data points, 10 seconds apart). One more function defines the alert message, and triggers the notification on Slack.
Our last TICKscript section will send a notification when the situation returns to normal, as indicated by the machine learning task. Again, one more function will send this information to Slack.
Cool! Our final TICKscript now contains all we need (see the full script below if you’d like to copy and paste the complete script). Let’s see it all in action!
Part four: Watch it all in action when the ML task detects abnormal data
Loud ML spots abnormal data automatically. This graph shows an alert notification in the GUI. Data visualization helps! In this tutorial, we decided to edit the default settings to use TICKscript alerting capabilities. So in addition to inserting annotations in the GUI, we are now sending alert notifications. Here’s an example:
The messages have been sent to our Slack channel for warnings and alarms. We used the
stateCount node which works well for counting abnormal data points, with yellow for warnings, red for critical, and green for normal (in our example, the data has dropped back to a lower score).
One alternative is to replace
stateCount node with
changeDetect node if you’re only interested in the respective beginning and termination of abnormal events. TICKscript’s language flexibility is excellent for fine-tuning the settings according to your needs.
What we’ve learned
We’ve learned how to use TICKscript and how to configure Slack to trigger notifications on a given channel and workspace. If you would like to change the settings in
kapacitor.conf file for using PagerDuty, take a look at the excellent documentation on InfluxData web site, as well as the PagerDuty integration guide.
It’s now possible to generalize what we’ve seen in this tutorial and create a design pattern. You will find a generic TICKscript template available in Github using the link below. Values assigned to variables are no longer defined in the template, but left to the user. TICKscript templates give you the freedom to assign values (model name, database name, measurement name, etc.) without changing the core script design. Find more information about Kapacitor template tasks in the InfluxData documentation.
A generic TICKscript template is available on Github.
And don’t forget, if you need more guidance, you can also watch how we did this tutorial live during the DevOps.com webinar in early 2019.
What’s next: AI driven automation!
We’ve now highlighted Loud ML’s capabilities to learn normal and abnormal data trends, and then how it can act on abnormal data streams using TICKscripts.
One more step, and you can achieve complete AI driven automation in your DevOps environment! Our next tutorials will explain how you can act and automate decisions, through three new case-studies using Loud ML and TICKscripts.
Webinar: Join the DevOps.com webinar on January 16, 2019, when I’ll be showing you how to perform all the steps above live.