5. Intro to Kapacitor for Alerting and Anomaly Detection

Recorded : March 2017
In this session you’ll get a detailed overview of Kapacitor, InfluxDB’s native data processing engine. We’ll cover how to install, configure and build custom TICKscripts to enable alerting and anomaly detection.

Watch the Webinar

Watch the webinar “Intro to Kapacitor for Alerting and Anomaly Detection” by clicking on the download button on the right. This will open the recording.

Transcript

Here is an unedited transcript of the webinar “Intro to Kapacitor for Alerting and Anomaly Detection.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.

Speakers:
• Chris Churilo: Director Product Marketing, InfluxData
• Jack Zampolin: Developer Evangelist, InfluxData

Jack Zampolin 00:01.899 Thank you very much, Chris. Good morning, everyone. Today we’re going to be doing an introduction to Kapacitor. The slides are over in the Kapacitor webinar section and I’ll be running through those. My name’s Jack Zampolin. I’m a developer evangelist and sales engineer over here at InfluxData. I’m also running our community site, community.influxdata.com. If you have any usage questions, monitoring, time series related stuff, I would encourage you to drop by and ask a couple of those. Anyway. Let’s go ahead and get started. So goals for this training. I want to explain sort of how Kapacitor works under the covers. What’s it’s doing computationally when it’s generating these alerts and crunching these data streams. We want to review TICKscript syntax. So as many of you know, if you’ve done any work with Kapacitor at all, you have to write your alerts in this custom DSL called TICKscript. It can be a little bit hairy for people, but we’re going to walk through the basics and hopefully, get you a better understanding of that. We will also run a Kapacitor instance and then create a TICKscript and run it on that. And then, at the end of the class, we’ll talk briefly about user-defined functions, which is one of the more exciting features of Kapacitor which allows users to bring their own functions to do computation on some of these data streams. So machine learning algorithms, that kind of thing.

Jack Zampolin 01:49.046 Okay. Kapacitor is a time-series data processing utility. The title on this slide is slightly off. It’s a time-series data processing utility. It allows you to process data in streams or batches. So if you’re familiar with other similar tools, Kapacitor does allow you to act on data that’s streaming into the platform and it also has some tools to go out and grab some data. So kind of ETL-related tasks also work in Kapacitor. And then, these streams and batches of data, Kapacitor allows you to trigger alerts on those based on dynamic criteria. So let’s say you want to alert if a certain stream gives you a value that’s two or three standard deviations away from the moving average for the last 100 values. Those kind of expressions you can easily write in TICKscript. TICKscript also uses InfluxQL to help you manipulate this data, so the same language that you’re familiar with in Influx with the query language and that very SQL-similar language, you also write in Kapacitor.

Jack Zampolin 03:12.983 The primary way Kapacitor gets data into itself is through subscribing to data incoming to an Influx instance. So the way that works is when you start up your Kapacitor, in the configuration field, you would point it at an InfluxDB instance. And right when the Kapacitor first starts up, it’s going to reach out to that Influx instance and say, “Hey. I’m a Kapacitor. I need to subscribe to the points, the data that’s coming into you.” And Influx will shift any incoming data in real-time over to that Kapacitor instance to allow for alerting. And then also, Kapacitor can write data back in Influx. So many of you might be familiar with continuous queries if your database becomes too large and continuous queries are generating a load on your system. One way to offload that would be through use of Kapacitor. So those are the primary uses of Kapacitor.

Jack Zampolin 04:17.384 Oh, something I didn’t mention earlier. If you have any questions, please drop them in the chat or Q&A section, and I’ll be happy to answer those. So these are looking back at sort of what Kapacitor does, these are the key capabilities, so alerting and processing or ETL. So, obviously altering, taking proactive steps based on metrics coming out of your system, auto scaling, alerting team members, sending messages to other applications based on data flows. Those kinds of tasks are what Kapacitor is designed for, but because all of this tooling can also be used to write data back into the database, it does excel at processing ETL jobs too. Written and Go it’s very effective on multi-core systems and actually, can handle quite a bit of load. So if you’re looking at our full stack of products, where does Kapacitor fit? You might be familiar with things like Bosun or Grafana alerting. There’s a number of different products out there. So if you’re going to put the full stack together, how does it look?

Jack Zampolin 05:49.861 At the center, you have InfluxDB which is the data store. You’ve got data flowing in from Telegraf, and then Kapacitor kind of sits there off to the side in a little loop receiving that data that’s coming in from Telegraf, potentially writing back to the database but also kicking off alerts. And then the last piece of the stack is Chronograf. If you’re not familiar, it’s a product that we currently have in beta, and there is a user interface for writing Kapacitor alerts. So that allows you to set error bars on a visual graph, and you don’t really have to write TICKscript. That’s generated for you. So if you’re having trouble with TICKscript, I would highly suggest you check out Chronograf. So design of the program itself, there’s two pieces to it. There’s a Daemon and then a command line interface that drive the API on the Daemon. So each of the Kapacitor list task calls has a corresponding EPI end point and the CLI just interprets those responses and makes them more human-readable. The Daemon is where most of the work gets done. It’s got the configuration for subscriptions which Influx instances, which services are being reported to, PagerDuty, VictorOps, maybe we’re going to send post requests based on this data or other outputs to write back to Influx.

Jack Zampolin 07:32.574 All of that’s configured on the Daemon. The TICKscripts are held in the Demon store to allow it to alert on that data, and you use the CLI to make any changes to the state of the Kapacitor Daemon. In the big concepts within Kapacitor itself are tasks so a unit of work, that’s what you—that’s what you define when you write a TICKscript. There’s a couple of different types that we talked about early—earlier, streaming or batch. So, as I said, streaming will work on the stream of data that’s coming in, maybe you’re windowing it into minute segments and doing some sort of calculation there. That would be a streaming task. And then a batch task, it’s going to query the database, pull the data back, work on it, and then perform an action. In TICKscripts, which we’ll talk about in a second, are example of a directed acyclic graph, and that’s how the data flows through these computational nodes. Another important feature of Kapacitor is recordings. So this allows you to record a stream for a certain period of time and replay it back into Kapacitor to test tasks. So let’s say you have a certain data set that has an anomaly in it and you’re trying to test a function that you’ve written to see if it picks up that anomaly—that’s a great use-case—or just some normal usage data and you want to make sure that your thresholds aren’t being triggered frequently so that your team’s getting alert fatigue.

Jack Zampolin 09:28.935 Obviously, having tons of alerts on PagerDuty makes people not want to check their—check their phones. Any questions or comments so far? Okay. So the computational model on Kapacitor is a directed acyclic graph and this works for both streams and batches. Now, if your image at one, that’s where your data’s coming in, and then each statement in the TICKscript creates a new node where that data will flow and then potentially branch off into other nodes based on different criteria, and then at the end, we kick out alerts. There’s a result of the computation. And it’s important to sort of understand logically that there’s data flowing through nodes and then some sort of result with Kapacitor because it really—really helps debugging. Kapacitor shows, if you’re looking for details on a task gives you quite a bit of information. It, in fact, gives you a little representation of this directive acyclic graph and how much data has floated to each part of it. That can be extremely helpful especially with some larger TICKscripts, some more complicated jobs.

Jack Zampolin 10:58.643 If you have a flipped greater than or equal to sign that will become immediately apparent by the amount of data that’s coming through different nodes. So important to understand, you don’t have to understand it too deeply, but just remember that the computational model is data flowing through nodes. So the primary difference here between batch and streaming is up at the top. Obviously, you got batch and stream. In the batch tasks—oh, let me back up here. These are examples of TICKscript. So very, very simple examples of TICKscript. It was designed with sort of a java script-esque method chaining in mind. So that batch task we’re going to pipe to query this data we’re going to work on a certain way and very similar with streams. So in the batch task you—so in the TICKscript syntax here, let’s dive into it a little bit more deeply, you can declare variables. Obviously, we’re calling requests our measurement. And then, you can use that down in your TICKscript. We’re going to first say, “This is our data coming from this stream.” So if you think about all the data coming into Influx, maybe there’s multiple databases, multiple retention policies, a lot of different measurements coming in. What we want to do with this particular TICKscript is take a piece of that stream and kind of peel it off and work on it individually.

Jack Zampolin 12:48.575 So we’re going to say from a request measurement. And then, we’re either going to write a where clause down here where ” is up” equals true and value is greater than 500. So in that measurement, this tells me there’s a field called value, and then there’s either a field or a tag that’s a Boolean. Now, you’ll notice that this does look very similar to InfluxQL. So select measurement from— select stuff from this measurement where, and then we give some conditions there. And then, finally, in this stream task, we’re going to window. So we’re going to store up in memory five minutes of data and do these computations on them. So we’ve stored this data stream, these windows, in this data variable. And then, down here, we’re going to do two different things with it. So if we think about that directed acyclic graph model, this node is going to split off into two nodes. One of the nodes is going to count the number of points that match our criteria. And then, the other one is going to take the average of them.

Jack Zampolin 14:10.876 So we care how many are coming through with this Boolean check, and then we also care what the average value is of these high-value points here. And this TICKscript contains the examples of all of the different literals that Kapacitor supports, Booleans, numbers, N64 or float 64, strings. And then also time separation, nanoseconds, milliseconds, seconds, minutes, and hours. We also have statements. So from variables, data measurements, and comments, you can see the javascript S double slashes. So that was just a stream node. Now, if we’re looking at a batch node, it looks very similar, but just a little bit different. Instead of peeling off a piece of this stream of data that’s coming in Kapacitor, we want to select a very specific piece of data from an existing database. So with that, we’re going to write a query. And this is just InfluxQL right here. So we’ve got select, mean, value from this CPU usage measurements, and then we’re adding one where clause. Once that data comes in—oh, sorry, and then down here, this period directive tells you how often to run, or how much data to pull back, so that would be where time is greater than now minus one minute. You could also add that in the query. And then, every 20 seconds tells you how often to run this query.

Jack Zampolin 16:04.585 So we’re pulling one minute of data every 20 seconds. We’re going to group that into 10-second buckets and also by CPU. So we want to alert if any of the individual CPUs or a machine is going over a certain threshold or maybe we’re doing a moving average kind of thing. So this is a much more simpler directed acyclic craft, but it is still—there’s a batch node. It’s the initial node. And then we’re doing stuff beyond that. And each one of these is an additional node in that directed acyclic craft. Any questions?

[silence]

Jack Zampolin 17:10.525 Okay. So we’ve been over one stream task before, but let’s just walk through it one more time. So the first node is stream. And from that stream, we just want stuff from the database my DB in the retention policy, my RP, called my measurement. So we’re peeling off a piece of that stream coming from the database. And then we’re applying a filter here. And then finally, we’re going to window the data after that. And that’s a common pattern for stream nodes in stream tasks. You’re going to peel off a piece of the data. And then you’re going to do some sort of computation on it. So looking more deeply into this stream task here, the window node holds a certain amount of data in memory. And then, once that period of time is elapsed, performs work on it and then outputs results. So it sort of takes that stream and turns it into quantizable units of work that act more like batches and are easy to reason about. So the period specifies the amount of data that will be given to the pipeline. So that 15 minutes there, we’re going to be holding 15 minutes of data in memory to perform operations on. The “every” specifies the frequency at which the data will be yielded to the pipeline. So that’s who often we’re performing computations.

Jack Zampolin 18:45.810 So every five minutes, we’re going to be performing a computation on 15 minutes of data. So that’s maybe our longer moving average that we want calculated periodically. Another important part of Kapacitor is understanding lambda expressions. You’ll normally see these in a WHERE clause. Anything that you need to evaluate, Boolean expressions, built in functions, those are going to need to go in a lambda block. So in WHERE expressions, the first example would be, “where host is equal to X.” And that needs to get checked periodically and evaluated. And that piece of code there just gets passed directly down. And that’s what happens there. Built in functions, much the same. So when we get all these values in, we’re going to take the square root of it. And if that’s greater than five, then we’re going to allow it through. So it’s just another WHERE condition. And that list of built in functions is in the documentation for Kapacitor if you’re interested.

Jack Zampolin 20:06.390 So here in the stream, you would use the WHERE node to filter according to a lambda expression. This is an excellent example of that. Maybe you don’t want the data for each CPU, just want the roll up for each machine. That’s more helpful. And then finally, sort of the business end of Kapacitor is the alert node. And that alert node is what you pass your data to after you’ve decided the amount of data that you want. So down in the alert, we decide what level is this alert? What’s the message? And where am I sending it? So we’ve got a number of different services that are available to send alerts to. They’re listed down there. There’s also sort of the generic ones like the HTTP, which allows you to just send that message there. CPU usage is level to an arbitrary end point. And that’ll go out in JSON. So building servers that accept JSON, not the end of the world. There’s many different things you can do with this. Recently, we’ve implemented an autoscaling for Docker Swarm, services on top of Docker Swarm, using that post functionality. So it’s quite flexible. And then if we go through this TICKscript line by line, we’re peeling off our stream of data. We’re restricting it even more with a WHERE clause.

Jack Zampolin 21:46.625 And then under the alert here, first we’re setting the message. If you’ll notice here, this is some special syntax. That’s go templating. And in these alerts, there’s certain predefined values that you can pass into these templates. There’s more information about that in the Kapacitor documentation. But this level is info warn or crit. So CPU usage is maybe warn if it’s greater than 70. If it’s 75, then that would be a warn level and that would be the alert that gets sent out. So there’s more information that’s available. In these templated functions, you can write very customized alerts. So I would encourage you to check out the documentation further on that. And then finally, we’re going to post this to a slack channel. So imagine you have a group of FREs watching slack channels to see if there’s any issues with your infrastructure. Many teams are set up this way.

Jack Zampolin 23:01.847 One other thing that Kapacitor does allow you to do is join data. One of the limitations of InfluxQL is that is it doesn’t allow you to perform cross measurement joins. Kapacitor does allow for that. So one of the excellent use-cases for this would be a continuous query. So maybe you’re looking to join two measurements, perform some computation, and then write back to the database, sort of query pre-optimization. This would allow you to do that. So I’m going to walk through this TICKserver line by line and talk about how a join is performed. So we’re going to save two streams in variables here. Var errors is a stream from the measurement errors. And requests is a stream from the measurement requests. So we’ve taken two little slices of this data stream coming into Kapacitor. We’re going to say, “join requests two errors.” And then “as” specifies the measurement names, moving forward in the TICKscript. Otherwise, Kapacitor will call them things like errors_measurements, errors_requests. The naming becomes confusing. Make sure you alias with that “as” after you join.

Jack Zampolin 24:30.124 And then we’re going to evaluate a function. We got a lambda expression in there. And then that’s errors.value, so that’s a field on the errors measurements. That’s how you reference that field, and the same thing over in requests. And that’s an error rate. So once that’s evaluated we’re going to call that an error rate. And that’s what a join looks like in Kapacitor. Obviously, they can get much more confusing as you’ve probably seen joins becoming confusing in other languages. So this is a great time to take a little second and pause. The exercise here is going to be to write a simple TICKscript that streams data from the measurement mem, and issues a critical alert if the amount of free memory is less than 500 MB. I’m going to pause and take a second to let you do that. But before you do that there is one question from [inaudible]. Are there any options to monitor over a certain amount of time and then alert once that amount is steady at that level? So I don’t quite understand that question. Are you asking for a window of data and then you’re looking for a target level to reach and once that level is reached, alert? If so, yeah, you’re definitely able to express that with Kapacitor TICKscript. Average CPU for 10 seconds and then alert. Yes, that’s absolutely supported. That’s kind of the normal case for a Kapacitor TICKscript. No problem, thank you.

Jack Zampolin 26:27.266 So we’re going to pause for a second and take a minute to allow you to do this exercise. And again, that’s write a simple TICKscript that streams data from the measurement mem that issues a critical alert if there’s less than 500 MB of free memory.

Jack Zampolin 26:42.418 [silence]

Chris Churilo 28:18.753 Okay, before we look at how to actually do the exercise, we have a question, and the question, Jack, is, can you query multiple measurements with conditions like, “sensor one_value is greater than X” and “sensor two_value is greater than Y.”?

Jack Zampolin 28:43.346 Yes, I believe so. I mean, it depends on the shape of your data. So normally if you’ve got two different types of sensors those are either going to be modeled as a different measurement or if they’re modeled as the same measurement they’re going to be separated by tags. And using those WHERE filters to separate those tags is definitely doable. Now, that join that we just showed earlier, maybe if the sensor data’s in two separate measurements you peel off each of those little streams that you want, set some conditions for them, join those streams and then alert on those more complex conditions. So Kapacitor does absolutely support those type of things. Okay. Let’s go to the completion of the exercise here. So we would have a stream from measurement memory, so we’re peeling off that stream and then our alert condition is going to be critical, lambda expression, free is greater than 500 million. It looks like we’re missing a zero there. And then we’re going to log that to an alert file.

Jack Zampolin 30:13.846 That’s one of the other outputs of Kapacitor, you can log these alerts to a file and take any kind of programmatic action you want on them there. Each one of the alerts is just going to a line of JSON. Does that make sense for everybody? Excellent.

Jack Zampolin 30:36.062 Okay, so this section is going to be on using the Kapacitor CLI, so how you interact with Kapacitor? So defining a task, let’s say you have a file called cpu_alert.tick that contains the below TICKscript there. You would define that task so add it to the Kapacitor daemon with the following command, “kapacitor define” and then the name of the task. This is the unique identifier that you will reference it with in Kapacitor itself. Pardon me. That’s the unique identifier that you will reference it with in Kapacitor itself. So if you want to show how that task is running you would run, “kapacitor show cpu alert.” So important that you name it something consistent. Normally folks, name the file the same thing as the alert, it helps for management.

Jack Zampolin 31:50.726 So, you’ll also have to tell Kapacitor what type of task this is. This is a stream task. You have to pass at the TICKscript. So this is the path to the file that we want to run. And then finally the last argument to Kapacitor to find what you need is dbrp. So which database retention policy combo does this particular TICKscript work on? And then once you have defined that task if you run, “kapacitor list tasks,” this is what you would see here. Name, which is the unique ID which you’re going to use to reference that task. The type of task. Whether it’s enabled so this one’s not currently enabled or executing. You do have to take one more step in order to that and then which data is this working on? And that’s a telegraf autogen.

Jack Zampolin 32:51.803 So I mentioned earlier, that’s the unique identifier that you use to show information about that task. Here’s that command, “kapacitor show cpu alert.” And this is what the output looks like here. It gives you the name, any errors that are coming from it, the same information the Kapacitor list gives us, the full TICKscript. And then down in the bottom, this is the output that shows the directed acyclic graph. So in our TICKscript here, the first note is a stream. And then we pipe that stream into a “from.” And then the “from,” we pipe into alert. And that’s what the directed acyclic graph looks like. So obviously, if you have joins, many more piping operations, there’s going to be more nodes in that. And once you’ve gone ahead and enabled that CPU alert task, it will give you more information about how much data is flowing through each of those nodes. And once you enable it, you should see enabled and executing both equal true.

Jack Zampolin 34:10.508 One of the other things that you might want to do with Kapacitor is record and replay a stream to test your task. So that would be, testing our CPU alert task. You would do that with the CLI, as well. You would say, “kapacitor record stream.” And then you give it a task. So that defines what data you’re recording, how long you want to record it. So that’s the full command there. And then once you run that, it’s going to give you back this long UUID. That is a recording ID. And then when you list recordings, you should see that pop up. And then in order to replay that recording, you need the recording ID and then the task name. And then you would just tell Kapacitor to replay that recording with a task CPU alert.

Jack Zampolin 35:17.701 So the exercise here is to define a task off of the following TICKscript.

[silence]

Jack Zampolin 36:19.040 [inaudible] asks, “Can I record for a given for a time period, starting at 11:30 and going for an hour?” Sounds like you have some irregular activity happening around midnight. Yeah. I think you might actually have to stay up and trigger that task at that time. There’s no way to run that. Well, I guess you could set Kapacitor on a cron because this is just a command line operation. So, yeah. Absolutely. Awesome. Cool. Okay. So next, how do we define that task? And this would be mem alert. It’s a string type. We’ve got that TICKscript there that we didn’t necessarily name but if we’re calling it mem alert, we would probably call the file memalert.tick and then our dbrp is going to be Telegraf auto-gen. Finally, we’re going to do a brief overview of user-defined functions in Kapacitor. UDFs allow you to write your own functions or algorithms and plug them into Kapacitor. You can also build your custom function that runs in its own process and Kapacitor will deliver data to that via standard in or standard out, or another defined protocol, maybe over a UNIX socket.

Jack Zampolin 38:01.469 That allows you to use any programming language that supports encoding and decoding protocol buffers, so there’s protobuff libraries for all of the major programming languages. I think Go & Python are probably most supported but we’re happy to add other languages. And then we’re going to have an example here in Python and this is just a very simple UDF example. Given that you have a UDF that computes a moving average—so we were talking about a moving average earlier. Within that user-defined function, there’s an API for defining what your options are and then with those options, they can be invoked in a TICKscript. So that moving average has a property called field, a property called size, and then we’re aliasing the data coming out of it there. And this is how you would reference that user-defined function within Kapacitor. There’s some more work to be done to obviously write your user-defined function. And then once it’s written, there is one step to tell Kapacitor about that and where it is.

Jack Zampolin 39:26.980 And that’s essentially rewriting some config, it’s not an API operation so there is a bit more involved there. If you want to learn more about this, there’s plenty of documentation on our website, docs.influxdata.com, in Kapacitor. So there’s a great getting-started guide. There’s a syntax guide in here, and then there are also some examples within the github.com/influxdata/kapacitorrepo. There’s also an example folder with plenty of other examples there. So finally, we’re going to get to questions. There’s some subtext involved there.

Jack Zampolin 40:11.636 [inaudible] has one more, “Do I have to manually deploy the UDFs? I’m guessing yes.” The answer is yes. Obviously, if you’re using some sort of automation tooling like a Chef, or a Puppet, or an Ansible, many folks use that to deploy UDFs, but you do have to get them on to the machine somehow. Yeah. I have seen folks run UDFs containerized and then communicate over UNIX sockets on the same host. That’s also an option that opens you up to other levels of automation as well. But again, depends on your deployment environment. Any other questions? I’m also happy to answer questions on any of our other products at this time and if you have questions after this, please go to community.influxdata.com. Sign up and ask some questions. We’ll get those answered as soon as we can and there’s also a number of community members on there who are just answering questions. So if I or another InfluxData employee doesn’t get to it, somebody else might.

Pin It on Pinterest

Contact Sales