Recorded: May 2017
In this video, we will show you how to get started with Kapacitor in InfluxCloud.
Watch the webinar “Kapacitor on InfluxCloud” by clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “Kapacitor on InfluxCloud.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Jack Zampolin: Developer Evangelist, InfluxData
Jack Zampolin 00:03.131 Hello. My name is Jack Zampolin, and I’m the Developer Evangelist over here at InfluxData. And today, we’re going to be giving an introduction to Kapacitor on InfluxCloud. By the end of this section, you’ll be able to describe what Kapacitor is and why you would want to use it. You’ll be able to work with an InfluxCloud Kapacitor instance, talk through briefly the Kapacitor computational model, and then understand how to work with Kapacitor itself, write the TICKscripts that you need to use to interact with Kapacitor, use those scripts to process an alert on data. So the way that I like to think of Kapacitor at a high level is like a pipe. So data comes in one side, and there’s some gears on the inside that mash up that data based on certain rules, and then out the back, comes transformed data or alerts. So if you’re familiar with other stream processing frameworks, this is kind of similar from a mean model standpoint. So Kapacitor is a time series data processing utility. It processes data in batches or streams. So in the streaming case, data is coming in continually at current time and then being transformed. And then in the batching use case, you’re querying a bulk of data from InfluxDB, transforming that somehow and then performing actions based on that. You can trigger alerts based on dynamic criteria, transform data via InfluxQL functions. There’s also some additional things that Kapacitor offers that InfluxDB does not such as joins. And then you can also write that data back into Influx. One of the cool features of Kapacitor is that it automatically—if you pointed it in InfluxDB instance, when you start it up, you can automatically subscribe to that data in InfluxDB, and it’ll be forked over to Kapacitor as it gets written to the database.
Jack Zampolin 02:23.161 So if we look at an architecture diagram of the full stack, this is where Kapacitor fits in, right over here. And you can see that the data is being written via subscription to Kapacitor, and then Kapacitor can also write data back to InfluxDB. In addition, it can also kick out alerts, and you can drive that through Chronograf. Chronograf does have a UI to allow you to write TICKscripts without having to write them all out by hand. So in order to get started, we’re going to need to download the Kapacitor CLI, spin up a cloud instance, and then we’re going to need to connect the CLI to the cloud instance. So on OSX, it’s a brew install. And then, obviously, on W and Ubuntu or RedHat, you’re going to have a wget and then a depackage your yaml. So I’m going to pause here for a minute to let you go ahead and get that. Next, we’re going to want to create a new Kapacitor instance in our InfluxCloud account. So go ahead and log in to your InfluxCloud account at cloud.influxdata.com, and buy a new Kapacitor node. I believe you need an InfluxDB cluster to connect the Kapacitor node to. So be sure you have that first. Next, we need to connect the Kapacitor CLI. So the Kapacitor CLI, when you’re running it, has a flag-URL. That’s just looking for a connection string for Kapacitor, and you can actually set that by default by setting an environmental variable Kapacitor_url with the full connection string to your Kapacitor instance, including the DB username and password. So the easiest way to do that is just to say, export Kapacitor URL, and provide your credentials there, and then the Kapacitor command line tooling will work as intended. And if you are on Kapacitor stats ingress, you should see some data underneath that. And I’ll pause for a minute to let you give that a try. And just to clarify, that DB user and DB password is the username and password for your InfluxCloud, InfluxDB instance. That’s a lot of Influxes.
Jack Zampolin 05:05.947 Now that you’ve got set up, let’s take a second to talk about the Kapacitor computational model and how Kapacitor works under the hood. So the important objects that you’re going to see in the Kapacitor CLI if you start playing around with it are tasks and recordings. So what are tasks? Tasks are a unit of work. It’s defined by a TICKscript. It’s either a stream or a batch. So either that data’s coming in the frontend via a subscription to Influx, or it’s being written in via Telegraf, or Kapacitor’s querying an InfluxDB instance for that data in bringing it in that way. And tasks define a Directed Acyclic graph. That’s a fancy word for pipeline. So if you think about a data pipeline, there’s different stages. The data gets transformed as it moves through the stages and then it comes out the back. TensorFlow uses this model. A lot of data processing frameworks use the pipeline model. We are no different. And then recordings allow you to save data from a stream or a batch, and it’s useful for testing your TICKscripts. So if you want to have this stream of data that produces a consistent result for any sort of automated testing, or as you’re developing your TICKscripts, you want a consistent data set to run against, recordings are an easy way to save that data. Just digging into the Directed Acyclic Graph computational model a little bit more, this is kind of what it would look like. So based on certain criteria, data flows through the different nodes in the graph and then finishes it in individual node. And this is based on the rules or tasks that you define in Kapacitor as to what these different nodes will be and how the data will process through. And this works for both streams and batches. So if you look at that number one up there, that data could be coming in via a stream. You could be Windowing it in node two, and then also, by the same token, it could be a batch task. So you could be querying that data via batch and then splitting it into a couple of different pieces for different calculations down the graph a little bit further.
Jack Zampolin 10:27.706 So first, let’s go ahead and chat about chaining methods here. In the example above, query is the only chaining method there, and they’re denoted by that pipe. So if you’ll notice the pipe, the pipe denotes a chaining node. And then if we’re looking at this same TICKscript, the property methods are period every in group by. They’re modifying the query chaining method. So here, we’re going to talk through a few of the standards and tactical things that you’re going to see in TICKscripts. We have literals. So down here in fill, that tolerance is also a duration literal. You’ll see strings in every TICKscript. We also support integers as well as floats, but those are the primary literals that we support in TICKscripts. There are statements, variables, and comments. So here, the variable is set by VAR, and then we don’t really have an example of those statement in here I don’t think, but we’ll see some of those later. One thing that’s a little tricky about TICKscripts that I found is that the single and double quotes do mean very different things. And we were talking about statements earlier, this lambda here is a statement. That’s a good example of a statement. So in the quoting rules, the way that I like to think about this is single quotes are used to refer to the name of the thing. So CPU, this log, in double quotes, refer to the actual value of the thing. So we’re evaluating that value if you’re using double quotes. So we want the value of usage idle to check if that’s less than 70. And in that case, you would use the double quotes there. So lambda expressions. This is how you do that sort of that evaluation in Kapacitor. So we use lambda expressions to divide transformations on data points, as well as Boolean conditions that can act as filters. So the most canonical example of a lambda expression is a WHERE clause. So WHERE host—and notice here we’re using double quotes—we want to resolve host. If the host is server 001, then that’s going to return true. So that’s a simple lambda expression, a more complex one. This Sigma down here, that will define—this will critically alert if value is three standard deviations beyond a moving average. I’m not saying that exactly right, but you get the idea. There’s excellent documentation on Sigma in Kapacitor as well.
Jack Zampolin 13:48.921 There’s a number of different node types and TICKscripts. We’ve sort of talked about the generics, and now let’s talk about the specifics. So the initial nodes, there’s only two, batch and stream. And just to go over that again because it is kind of confusing. Batch nodes will query data directly from InfluxDB. Stream nodes operate on a stream of data that’s coming into Kapacitor. Your interior nodes there, you’ll see stuff like the InfluxQL node. You can just pass arbitrary InfluxQL. Join to time series based on time. UDF node, if you have any custom functions or user-defined functions that you would like to use to process the data, there’s a method for incorporating those, and you can reference them in the TICKscript. In eval node, so if you need to evaluate a lambda expression, a lot of stuff in there. And then as far as output nodes, there’s an alert node, an HTTP output node that exposes data at a rest API based on a programmatic set of rules and then in InfluxDB output node. So the alert node has a number of different chaining methods that change the behavior of that. So whether you want to output to Alerta, or PagerDuty, or even just send a simple HTTP request with some JSON, which is very powerful in its own right, alert nodes are great for that. So we’re going to talk through a few of the different node types. First, we’ll go through batch. It’s the initial node, and the next node that you’re going to see is always going to be a query chaining method. So that’s an example of the batch node that we’ve seen a couple times here so far. So and as you’ll notice below there, “period” in every are utilities for describing the frequency and periodicity of the batch query execution. So what does that mean in English? .period(1m) means where time is greater than now minus one minute. So this query will run over the last minute of data.
Jack Zampolin 16:09.185 “Every” specifies how often the query/TICKscript runs. So every 20 seconds, this will query one minute of data. In that one minute of data, we’re going to group by time 10 seconds. So this TICKscript is processing 6 points of data every 20 seconds. So for the first exercise, let’s go ahead and write a simple TICKscript that works on batches of data generated by the query. So the query that we’ll see there is SELECT mean(busy) FROM “telegraf”.”auto-gen”.”cpu”, and then we want to have a period of one hour, every one hour, and we want to group by all the tags. So I’m going to give you just a second to work on that. So the solution, it looks very similar to the one we saw earlier. We’re going to pass in a query there, period one hour, every one hour, and group by star. The next initial node is a stream node. If you’ve got a stream node, the next chaining method is always going to be from, and that’s going to sort of narrow down your stream. So you remember we were talking early about the big flow of water, and we’re forking off little streams of it. And then again, those methods down below, database retention policy, measurement, and WHERE, that’s just narrowing down your search dimension. So as an exercise, write a simple TICKscript that streams data from the measurement CPU in the Telegraf database. So we just saw an example of that. Let’s go ahead and take a second to write one.
Jack Zampolin 18:31.691 So stream, from, database Telegraf. So we’re narrowing down that stream with these property methods here. Retention policy and measurement. And that gets you there. The next node I want to talk about is the alert node. So let’s say you got a stream from, we’re going to group by service. There’s a bunch of stuff up there, and we’re manipulating the data. And then finally, we want to get down into the alerts. So once we’ve manipulated the data and figured out what’s important for us, we’ve narrowed our problem, we’re going to want to do something with that. So this is an example of a lot of the different chaining methods or property methods that you would have on alert. There are integrations with OpsGenie, PagerDuty, Sensu, Slack, VictorOps, Alerta, email. The exact plugin will run a binary on the host and HipChat. As far as the HTTP post method, I’ve found that extremely useful. We’ve done a webinar on auto scaling Docker swarm where we’re programmatically generating JSON data and sending it as the message with that HTTP post to integrate with other micro-services and services in your infrastructure to take programmatic actions. And it’s a very, very powerful feature of Kapacitor. So just to go back here and talk about these different methods of alert. ID is sort of the identifier of the alert, and that’ll pop up in the message body. So we’re going to call this a Kapacitor alert, and whichever service is important, whichever service trips this condition is going to be what the alert ID is. And then as you see here, there’s a templating engine. We make certain values available. There’s a programmatic way to access those. This is essentially Go templating. But we’re going to pull—the ID of the alert is level. So you see the levels down there. Info, warning, crit. And then we’re going to pass the value in as well so we can show the operator how bad it is. Info, warning, crit, just set the different alert levels, and they each take lambda expressions.
Jack Zampolin 21:12.493 Here’s an example of posting to two different endpoints, and then we also want to email that data to a particular email address. And the SMTP settings are in the Kapacitor configuration. So this exercise’s a little bit harder than the previous ones. Write a simple TICKscript that streams data. So a streaming TICKscript, stream from, streams data from the ‘mem’ measurement and issues a critical alert if the memory free is less than 500 megabytes. And go ahead and take a couple of minutes to do that.
Jack Zampolin 22:09.644 Okay. So this is more pseudocode than anything, but stream from measurement memory, and then we’re going to alert, and notice that crit there, we do need to pass the lambda expression. And then we’re going to log it. So we’re just going to write that data into a file. An important chaining node is the WHERE node. So it filters the incoming data according to a lambda expression. Here we’re just filtering based on host. One thing I’m going to go ahead and note again because I forget it all the time is double quotes being that that is going to resolve. So that host is going to resolve to whichever host comes out of that group by up there. As this data’s flowing through the DAG, there’s going to be a host here. That’s a pretty excellent example. This also shows how to use variables to break your TICKscripts up into logical parts. This first part here just gives us the sums. So we’re summing up these values, and we’re grouping it by some stuff. And then down here, we’re going to do something with those sums. So we’re going to filter them slightly further, and then we’re going to alert on the ones that match our criteria. So in this exercise, we want to write a simple TICKscript that streams data from the memory measurement, and it is where the host is located on localhost. And it issues a critical alert if the memory free is less than 500 megabytes. So we’re going to build on that previous TICKscript we just wrote and try to add a WHERE clause. So I’ll go ahead and give you a couple of minutes to do that.
Jack Zampolin 24:23.759 So here we go. The only difference there is we’re adding a WHERE after the measurement there. So you can see that pretty easily. That is a training method. Another important node is the window node. You’re going to use this with the stream. So what this does is to make it easier for computers to work with it. Instead of continually processing this incoming stream, we’re going to cut that stream every five minutes in this case and break that data up into discrete units that allow us to process it much like it would be in batch mode. So a window’s period specifies the amount of data that will be held in memory and yielded to the pipeline. So we’re keeping 10 minutes of data, and the window’s every specifies the frequency at which the data will be yielded as the pipeline. So we’re keeping 10 minutes of data, and we’re running this task every five minutes. If you’re running computationally intensive tasks here, be aware that that period, you are going to need to keep 10 minutes of data in memory in order to execute this. So for very large data sets, this can quickly get expensive. Just be aware of that. So our exercise here is write a simple TICKscript that creates a 10-minute window out of the stream data from the measurement CPU. And we want that to admit every five minutes like in the example just before. And I’ll give you a little while to do this.
Jack Zampolin 26:34.200 So that simple stream from CPU TICKscript we wrote earlier, we just want to window that with period and every. So pretty simple there. And as we’re getting to more advanced things here, one of those examples is a join node. So here, we have two separate streams, two different measurements. You can’t do this in InfluxQL. We’re going to save those invariables, and then we’re going to go ahead and join them. So we’re going to take that first variable here and we’re going to join requests to that. So each of these measurements have different fields. You need to provide prefixes for both of those. So there’s the “as”, so this is going to prefix errors_field name to each of the fields there. And you can see an example of that down there and error_ rate. Or no. I’m sorry. They’re period separated strings, and you can see that down there in errors.value. So that’s how that “as” works there. Tolerance. So to give yourself a little bit of wiggle room there. How are we joining this via time? If your data’s in nanosecond precision, it’s very hard to find things that match up in the same nanosecond. So we do offer that tolerance function there to group points together logically. There’s also a fill command. So if you’re missing points, we need to fill that somehow, and then we need to name the resulting stream. And then you can see down here, we’re evaluating. So we’re calculating a rate, an error rate as a matter of fact. This would be a great example of something you might want to write back into InfluxDB. There’s plenty more examples of join nodes in the docs. And if you run into any issues with those joints, please pop over to community.influxdata.com and we’d love to help you getting started.
Jack Zampolin 28:49.243 So as far as using Kapacitor—so we’ve talked a lot about the domain-specific language that you need to write to generate these alerts. But how do we actually do anything with those TICKscripts? So this is what you would run on your command line interface to define alerts. One of the common tasks that you’re going to do is to define a task. So our CPU alert that we wrote earlier, we’ve expanded that example a little bit, and you can see that there. So we’ve got that TICKscript, and we’ve saved it in the file CPU_alert.tick. So when we’re defining the CPU alert, it’s Kapacitor defined, and then we give it a name. That name is a UUID that’s going to be used to access that task within Kapacitor. We need to tell Kapacitor what type of TICKscript it is. So this one’s a stream. We need to pass it the TICKscript. So there. And then we also need to specify which database retention policy combination that this TICKscript is going to act on. So that particularly will be Telegraf auto-gen. So once we’ve defined that task, when you run Kapacitor list tasks, you should see the following output. It’ll give you the task name. It’ll give you the type. It’ll give you whether it’s enabled – so you do need to enable the tasks – whether it’s executing, and which database’s retention policies it is acting on. So one thing to note if you’re following along and you’re using this TICKscript, you’ll notice this will log to temp alerts. You need to make sure that the temp folder is writable to by the Kapacitor user. So you’re going to want to CH own that folder or file as the Kapacitor user or else you might get some permission errors.
Jack Zampolin 31:02.757 So here’s an example of recording and then replaying a stream. So this would be you’re taking a stream of your production data, running it through Kapacitor, and recording it so that you can develop some scripts based on that data. So you would say, Kapacitor record stream. We want to give it a task. So that task is going to narrow down the faucet for us and give us the specific data that we’re looking for. And then we want to say how long we want to record it for. So that task is that command. So Kapacitor record is going to run for a full 20 seconds. So your command line is going to hang for a little while. It’s going to look weird. And then it’s going to spit out this big UUID. So you can see the recording ID is that. So for convenience, saving that as an environmental variable is an easy way to do it especially if you’re going to run some command line scripting. Normally, I’ll just save the output of that, Kapacitor record stream, into a command-line variable and then use that to replay it later. So when you list recordings, you’ll see that recording ID as well as some more data about the recording. And then to replay that data back into Kapacitor, you can see the command there. In “fast”, just as a node, we’ll replay in that full 20 seconds of data as fast as possible. If you don’t pass that, it’ll pass it in real time. Especially when you’re testing, it doesn’t really matter how quickly the data comes in. Kapacitor is quite capable, and it can handle quite large quantities of data. So playing it in quicker gets you a quicker cycle to help you debug those issues. So showing a task, you would run Kapacitor show and then you would pass it the ID of the task. So in this case, CPU alert. And it’s going to give you some data on that.
Jack Zampolin 33:18.499 And in order to enable a task, you would say Kapacitor enable and then you would pass in that CPU alert. And it would give you the enable executing to true. I’m sorry, that’s cut off on the top, but you can see there that that’s Kapacitor show CPU alert. That’s the same thing we’re running over here. And you can see that it’s enabled false, executing false. Once you run Kapacitor enable and then you pass it that task name, it will enable and begin execution of that task. Kapacitor show is extremely helpful through debugging. If you see down there at the bottom, it says dot. That’s a visual representation of the Directed Acyclic Graph. And there’s even some more output below that dot that will show you how many data points have passed through each section of the Directed Acyclic Graph. This is especially useful when you’re testing WHERE filters. So maybe you’ve got a pretty complicated lambda expression that’s filtering out some data. When you replay your recording in to test the task, you might see that no data’s getting past that WHERE filter. If that’s not intended, it’s an easy way to know that that’s where we need to debug. So let’s take a second to try and define a task off of the following TICKscript. So if you need a database retention policy combo for this, I think it’s telegraf. autogen. So I’m going to go ahead and give you a minute to do that. So you’d see kapacitor define, and we’re running this on memory, so let’s just call it mem_alert. Obviously, if you’ve got more alerts, you might want some more complicated names than mem_alert. This is just for convenience. Typestream. We’ll pass it this TICKscript which we’re calling mem_alert.tick, and then we’re giving it a database retention policy. One other thing a Kapacitor can do is act as a continuous query agent. So if you’re familiar with continuous queries in InfluxDB, they look like this, but what they do is run continuously on data in order to reduce the resolution of it and make it quicker to query for dashboard. So let’s say you’ve got some very high-resolution data coming in every 10 seconds, but you only need five-minute averages of that. So you would create a continuous query to use the aggregate mean or median to reduce that data into fewer data points to make it quicker to pull up for a dashboard.
Jack Zampolin 36:09.835 Kapacitor can perform the same function as you can imagine either via batch. So just specifying how frequently the query runs with period in “every”, and then also stream as well. But you would need a window in that case. And then as far as an output, so it would be an InfluxDB out node as opposed to an alert node that you would use. So to change this CQ, which just has a simple median, and we’re moving it into a separate measurement in a different retention policy, and there’s some group by there as well, you would say—you could convert it to this particular task. So you see that batch there, so we’re just going to keep it as batch. We’re going to run that query. SELECT median (usage_idle) as usage_idle FROM “telegraf.default”.“cpu”, And then we’re going to want our period in every. So if you’ll notice, there are period in every five minutes. That’s defined by that Group By time here. That’s how continuous queries decide how often to run and how much data to run over. There’s special syntax to change that, but that’s the way the CQ works. So we’ll go with period in every. Generally, if you’re running them in production, some data might come in late. So having the period be double the every. So running it on 10 minutes of data every 5 minutes helps you catch some of those potentially out of order points and get accurate readings. We want to group by star. One thing we’re going to want to do there is align the data to standard boundaries, and then we’re going to want to write back into Influx. And if you’ll notice there, that’s the same database retention policy and measurement that we were talking about earlier. You can even offload some of that processing. So that median there that we used to be running in the query, we can offload that to Kapacitor to do as well.
Jack Zampolin 38:12.476 So the primary use case for using Kapacitor for CQs would be if your InfluxDB database is overloaded and you’re having performance issues with your continuous queries, you might want to offload that computational load to Kapacitor. So you can offload as much or as little of it as you want. If you want to do this with streaming queries, you can even reduce load on the database more, but in this example, we’re showing batch queries. And then again, streaming queries, just know you are going to have to keep all that data in RAM. So if you have some longer-term roll-ups, let’s say you’re rolling up a couple hours or days of data, probably not going to want to do that in Kapacitor. This is an example of a CQ stream that we were talking about earlier. So the streams all work the same way. You’re peeling off a little stream of a larger flow. So in that from statement, we’re peeling off our stream. We’re going to window that data. So we’re going to break it up into blocks that we can deal with it. Notice that period in every five minutes again. That’s from the CQ earlier. Obviously, because it’s a stream, we’re not passing any of that computational load off to the database. So we’re just going to be running the median there. And then we’re going to be InfluxDB outing the data to our proper database retention policy and measurement. So if you’re making a decision via stream versus batch, as I said earlier, it boils down to the available RAM and the time period being used. A stream task will have to keep all data in RAM for the specified period. So if the period’s an hour or a day, i.e. too long, you will need to first store the data in InfluxDB and then use batch to query that data back into Kapacitor. Stream tasks do have one advantage. Since it’s watching the stream of data, it understands time by the timestamps on the data. As such, there are no raised conditions for whether a given point will make it into the window or not. If you’re using a batch task, it’s still possible for a point to arrive late, will be missed in a window. If you’re using batch tasks, it’s extremely important to set that period longer than your every so that you’re pulling in a little bit more data and potentially solving for those risk intentions.
Jack Zampolin 40:39.018 Streaming doesn’t have as much of an issue. So that’s another thing to note. And thank you all very much. This is the end of the training for today. If you have any questions, please go ahead and post them over on community.influxdata.com. I’ll be happy to answer them over there. Also, one thing I would mention, if you’re trying out Kapacitor, I would encourage you to give Chronograf a try. These TICKscripts can be very difficult to get up and started with. And Chronograf provides you with a user interface where you can click down through your measurements, fields, tags, construct these TICKscripts with a visual editor. And those TICKscripts use best practices so that you can see how people write these in production. And you can always pull those out of Kapacitor, modify them for your own use. So if you want something that’s not supported by the UI, maybe a UDF or some more advanced math that’s difficult to do in the UI, you can always start it in the UI. So peel your stream off in the UI as you will and then modify the TICKscripts on your own. So I would highly suggest checking out Chronograf. An easy way to do that is through the InfluxData sandbox. If you go to gethub.com/influxdata/sandbox, you’ll see it there. And obviously, Kapacitor on Cloud, much easier than self-hosted. So this is a great getting started experience. Thank you all very much.
Track and graph your Aerospike node statistics as well as statistics for all of the configured namespaces.
Knowing how well your webserver is handling your traffic helps you build great experiences for your users. Collect server statistics to maintain exceptional performance.
Collect and graph performance metrics from the MON and OSD nodes in a Ceph storage cluster.
Use the Dovecot stats protocol to collect and graph metrics on configured domains.
Easily monitor and track key web server performance metrics from any running HAProxy instance.
Gather metrics about the running Kubernetes pods and containers for a single host.
Collect and act on a set of Mesos statistics and metrics that enable you to monitor resource usage and detect abnormal situations early.
Gather and graph metrics from this simple and lightweight messaging protocol ideal for IoT devices.
Gather phusion passenger stats to securely operate web apps, microservices & APIs with outstanding reliability, performance and control.
The Prometheus plugin gathers metrics from any webpage exposing metrics with Prometheus format.
Monitor the status of the puppet server – the success or failure of actual puppet runs on the end nodes themselves.