Downsampling Your Data
Coming soon! Our webinar just ended. Check back soon to watch the video.
Webinar Date: 2018-12-20 08:00:00 (Pacific Time)
In this session, you will learn downsampling strategies and techniques for your InfluxDB data.
Watch the Webinar
Watch the webinar “Downsampling Your Data” by filling out the form and clicking on the download button on the right. This will open the recording.
Transcript +
Here is an unedited transcript of the webinar “Downsampling Your Data.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers:
• Chris Churilo: Director Product Marketing, InfluxData
• Noah Crowley: Developer Advocate, InfluxData
Chris Churilo 00:00:03.497 All right. Three minutes after the hour. As promised, we’re going to get started, and there’ll probably be other people that join us a few minutes after, but that’s not a problem. Good morning. My name is Chris Churilo. Today we are doing a training on downsampling your data with Noah Crowley, one of our awesome Developer Advocates. He’s based in New York City, and he also runs a Times Series Meetup in New York City, so if you happen to be in the area, check those out. He has always fabulous speakers there. So I want to remind everybody if you do have any questions for the training material, or if you have questions about any part of the TICK Stack, please go ahead and put them in the chat or the Q&A panel. We’ll get them answered before the end. And if you don’t, or you come up with some questions later on, no problem. Just post them to the community site, and we will make sure that we get those answered. So with that, I will hand it off to Mr. Crowley.
Noah Crowley 00:00:58.074 Awesome. Thank you, Chris. Good morning, everybody, and welcome to part of the InfluxData Getting Started series. This is Downsampling Your Data. Like Chris mentioned, I’m Noah Crowley. I’m a Developer Advocate. It’s my job to talk to developers, find out how they’re using our product, what they like, what they don’t like, give them good resources for getting started, and bringing feedback back to the team about ways that we can improve. So I’ll have my email address and Twitter at the end of the presentation. If you want to reach out to me about anything either in this presentation or another presentation, please feel free. But without further ado, we’ll start this training.
Noah Crowley 00:01:41.408 So the agenda for today is to talk a little bit about downsampling. We’ll start out. We’ll go over what downsampling actually is. We’ll talk a little bit about why you might want to do it, and then we’ll talk about how specifically to downsample data using InfluxData. And there are two methods to do that. There’s one called continuous queries, which is a function of the database itself, and then there’s a second application called Kapacitor, which is a stream and batch processing engine, which among other things, can be used for downsampling. So we’ll touch on both of those during the presentation. First of all, what is downsampling? The textbook definition says that it’s the process of reducing the sampling rate of a signal. So if you think about what that means, whenever you’re trying to measure something—generally with a time series database—you’re measuring it maybe at regular intervals, or at irregular intervals, but you’re measuring it periodically through time. And the amount of time that spans between each one of those samples is considered the sampling rate. So if you’re sampling something three times a second, then your sampling rate is three times a second. If you’re taking a measurement once an hour, then your sampling rate is once an hour. And what that means is it determines how much data you have and how much resolution you have, how closely you can look and examine the way that things change over time, either over long periods or over short periods.
Noah Crowley 00:03:09.584 So what does some full-resolution data look like? This is the usage for the user on a standard laptop. So they’re doing something throughout the day where every now and then, usage peaks. And maybe they’re running a task, or processing some photographs, or whatever it might be. But you can see these peaks in the full-resolution data. I’ll turn on my pointer so I can point around the screen. So these are sort of the peaks where you see usage goes along for a period of time, and then there’s a peak, and then more data. But if you look closely here, you start to lose a little bit of the detail in these areas. And that could potentially be a problem. It could potentially not be a problem. Maybe you’ll be able to zoom in there and see a little bit more of the details. But if this is what full-resolution data looks like, then this is what some downsampled data looks like. So in this technique, what we’re actually doing is we’re taking every Nth value from this graph, and we’re graphing it here. And what that gives us is a much cleaner representation of the data. It’s a lot easier to see, to read, to identify individual points, and potentially even to identify trends that are happening.
Noah Crowley 00:04:26.826 Another method of downsampling is to take a random subset of N points. So whereas in the previous example, we were going through and, say, taking every fifth, or every tenth, or every hundredth point and using that as our example, here we’re going through and we’re randomly sampling the data that we already have to generate a new downsampled set. I actually didn’t really understand why you might want to do this as opposed to just taking N samples, so I talked to one of our engineers, Michael Tusa, and he explained to me that one of the advantages of having this random subset of N points is that it preserves some of the statistical measures of the data that other methods might not always preserve, things like median, or mode, or average, and stuff like that. If you’re taking sub-samples of something in regular intervals, then it’s possible that you’ll lose data due to seasonality, or something that sort of lines up along with the intervals that you’re taking. So you’re taking samples every five samples—sub-samples every five samples, then maybe you have some sort of action that’s going on at exactly that time that you lose if you try and do that every Nth selection. So actually taking a random subset might be able to preserve that data for you.
Noah Crowley 00:05:49.381 Another method of downsampling is to go ahead and take the average of N points. So every five points, we take the average, we plot that on a graph, and this is what we get. This is a pretty effective method. It gives you a nice visualization of the general trends. As you can see, this sort of maps better to the actual data than either this or this. Actually, the N values are pretty close as well. But each of them has sort of a different representation of the data, so it’s important to understand when you’re downsampling, how you’re downsampling because it does affect the ultimate visualizations, and the way that you do analysis and alerting, and all sorts of other stuff.
Noah Crowley 00:06:31.817 Ultimately though, the question is why would you want to do this? Full-resolution data, for most cases, is better. There’s more of it. You can see more detail. You can dive in and really examine stuff. You’re not missing anything. There’s nothing that’s been obscured by your downsampling methods, or whatever you might be doing. But unfortunately, full-resolution data does have some downsides. It’s hard to work with. It can take up more disk space. It can utilize more network communications. If you’re doing processing on that data, that means more CPU. In general, full-resolution data takes more resources, and more resources mean more money. In addition to the back end, when you’re actually storing the data, and transmitting it from your various applications, or IoT devices, you then have some issues on the front end when you’re trying to do visualizations. So full-resolution data is difficult to visualize in the browser. The browser is not the most efficient computing environment, and so loading it up with thousands or millions of data points and then trying to graph them on the screen can cause a lot of sadness for your end users. So something that we do in Chronograf is we downsample based on an interval, which is calculated based on the time window that you’re looking at. So that helps us alleviate the issues in the browser, but you still have to worry about what you’re doing with full-resolution data on the disk. So if you can afford it, if you have massive server farms, and you’re willing to spend ten times as much on your monitoring as you do on your business, then you’re good to go. But in most cases, the spend on monitoring becomes disproportionately large, and you have to think about ways that you can save money and still get value out of the data that you’re collecting. And downsampling is a really great way to do that.
Noah Crowley 00:08:39.362 How can we do downsampling intelligently? What is smart downsampling? A lot of the times when we’re dealing with data like this, we know that the relevance of the data decreases with age. So you know that, “Hey, I might need this full-resolution data immediately because I’m monitoring the response times of my application, and I need to know in five, or ten, or fifteen seconds if something is going wrong because I’ve got an SLA that I have to hold up, and if things are down for a few minutes, that already starts to eat into our SLA for the year.” So having that immediate high-resolution data is really important, but then two weeks later when you don’t need to react to an incident anymore, the value is no longer as important. You don’t necessarily need those sub-second collections. You could potentially look at something over a minute or an hour, but the general idea is that new data, which is high-resolution, is more valuable than old data, and so if you can downsample it and save a little bit of cost in terms of your resources, then that’s a smart thing to do. So as new data becomes old, we decrease the resolution. This is also helpful in places like IoT. You might find that you want to keep full-resolution data at the edge where you’re making decisions about your devices, and how you need to actuate things, and respond to changes, but then once you transmit the data up to your database for a long-term store, you can potentially downsample it because then you’re only really concerned about longer-term trends, and trend analysis. What happened over the course of six months? When you’re looking at stuff in that kind of a time frame, then it becomes less important to understand what happened in the last fifteen or twenty seconds.
Noah Crowley 00:10:32.328 So this is often called roll-ups. And this is sort of another definition of downsampling, a common definition within the example of collecting time series data for monitoring systems and for IoT. You might also see downsampling in the context of audio processing and stuff like that, and in those cases, this concept of roll-ups doesn’t really apply. But if you’re collecting data for analysis, then roll-ups is a really common pattern, and we’ve sort of folded into what we’re calling downsampling here. So what that means is that we’re actually going to expire the older, high-resolution data. We’re going to replace it with a lower-resolution version, and decimated is just the mathematical term for going from something that is high-resolution to low-resolution. And the idea being that we don’t want to just downsample our data and then store both copies because if that’s what we were doing, then we would have significantly more data than we started with, and the value of downsampling is completely lost. So once you’ve downsampled the data, it’s really important to expire that old data and get rid of it so that you can realize the cost benefits of downsampling.
Noah Crowley 00:11:49.679 So in terms of InfluxData, there are two ways to do downsampling. The first, as I mentioned earlier, is called continuous queries. These are queries that run periodically within the database itself. So you’re able to create these from the same query engine that you would anything else, either through the Influx CLI, or you can do it in Chronograf. But you can create these queries, and you can have them run periodically in the background so that they’re always running, and they’re always calculating these downsamples for you. You could also use Kapacitor, which is a separate application that you would then need to also install and manage. Fortunately, it’s as easy to get up and running as the rest of the TICK Stack. It’s a single Go binary. We provide solutions for a lot of package managers, so often times, it’s as simple as saying, “brew install Kapacitor” or “apt-get install Kapacitor,” or something like that. So those are the two methods of doing downsampling. And then in terms of doing expiry so that we can actually have the full solution for doing roll-ups, we have something called retention policies in the database. And retention policies are self-explanatory. The name sort of represents exactly what it is. But it’s how long the data is going to be retained in the database. So you can say, “Hey, I want to store this data in this retention policy, which is two weeks. And so after two weeks, I want the data to go away. And I want this data to be stored in a retention policy that has an infinite duration so that it’ll stick around forever.” And then once you have those two retention policies in place, you can send data to one for—you can send high-resolution data to the retention policy that gets purged every two weeks, and you can send the roll-up data to your retention policy that’s kept around for an infinite duration. So between these, mixing and matching either continuous queries and Kapacitor with retention policies, you’re able to get a really effective roll-up process within InfluxData itself.
Noah Crowley 00:14:01.816 So first, we’ll dive into continuous queries. And we’ll go over a little bit of how they work, what the syntax looks like, what some of the pros and cons there might be. So this is a really basic example of continuous queries’ syntax. Again, you can run this from the Influx command line. So if you just open up the command line, just type Influx, and it should connect you automatically to your database if it’s running locally. If it’s running remotely, you’ll need to add a few arguments so you can make sure to connect. But once you’re connected, you can just enter a query in this format, and it’ll create a continuous query for you. So let’s just parse through this real quickly, piece by piece. You start out by saying CREATE CONTINUOUS QUERY. That’s how the database knows what you’re trying to do. You give the continuous query a name, which can be anything, and you give it a database that you want it to operate on. And then you can BEGIN the query itself. And in our case, the query is going to be a SELECT. We’re going to select a measurement that we want to downsample. We’re going to describe where we want to eventually insert it into with the into statement. We’re going to say from, which is the measurement that we want to select from. So we have the actual, individual measurements in each database. And then we have GROUP BY, which is going to describe how we’re going to actually aggregate this data. Generally, we’re going to give it a time interval, so in our case, we might look at roll-ups over every five minutes or something like that. And then you’re going to give it a number of tags as well, so that you can do a more efficient query of the data on the back end.
Noah Crowley 00:15:45.215 So let’s look at a specific example of what that might look like. So in this case, we’re creating a continuous query called My CQ. We’re creating it on the Telegraf database. So the Telegraf database is generally—it’s an automatically-created database by the Telegraf application itself, and it’s where you send information by default from the Telegraf collection agent. So in this case, we’re still looking at that usage user number that we were looking at before. So what Telegraf is doing is it’s periodically collecting performance data about the machine that it’s running on, and then it’s sending it over to the Telegraf database. So we’ve specified that we want to run this continuous query on that database, and so we’re able to select the average of the specific field that we want to look at. We’ll describe where we want to then send that data once it’s been downsampled. In this case, we want to send it to telegraf.rollup_5m.cpu. And let’s break down exactly what that is. So the first part of that name, Telegraf, is just the database that you’re writing into. The second part of the name is actually the retention policy that you want to write into. So in our case, we have two retention policies here. We have Autogen, which is the retention policy that’s created by Telegraf when it initially starts up, and then we have a second retention policy called Roll-up Five Minutes. And so what we’re doing here is we’re selecting the mean of usage user, we’re inserting it into the Telegraf database, specifically using the Roll-up Five Minute retention policy, and then into the measurement CPU, which is where we want to pull it from. And you can see that down here. We’re pulling from Telegraf, from the autogen retention policy, and from the CPU measurement. Finally, because we’re doing a roll-up by five minutes, we want to group by time, and then we want to grab everything that’s in there and dump it into the new, downsampled data.
Noah Crowley 00:17:57.479 So this is pretty cool. This is going to give us exactly what we want. We’ll start getting these five-minute roll-ups of the average of the measurement of that time, so the mean CPU usage for the individual user. But there is one downside, and that is that select mean usage user actually rewrites the name of the measurement as you’re downsampling it. So as this measurement comes out into Telegraf roll-up five minutes, which is the retention policy, it’s not going to be labeled usage user the same way that it was when you were extracting it from the database. And this can cause problems when you’re trying to do visualizations and things like that because now you’re looking for a different set of series than you were before. And maybe you designed your dashboards to take that into account, which is totally a good way to go, but maybe you haven’t. And so there’s another little tidbit that we can add to this query in order to make things a little bit easier to use. And that’s this as statement. And basically, all the as statement is doing is preserving the original metric name. Otherwise, it would come out with the name of the function that you were applying to it, so you would be writing new data as a mean. But what we want to do instead is write the new data as usage user so that we can look at both of these two retention policies and we can say, “Okay. We have a usage user in one place, and we have another usage user in another place. Those are the same data, except one is downsampled and one is not.” So that’s really an effective technique that you should definitely use if you’re writing continuous queries.
Noah Crowley 00:19:44.217 You might be interested in downsampling more than just an individual measurement like we did in the previous example. So in this case, you’re actually going to select mean of everything, all of the various fields. You’re going to select that into Telegraf, into the roll-up five-minute retention policy, and you’re going to go ahead and give it this template variable of measurement, which will actually automatically fill in exactly what you need based on what’s being selected. You’re going to select that from telegraf.autogen from all of the retention policies in there, and then you’ll group by time for five minutes, and downsample exactly as we did before. This is a more advanced way of doing things, and definitely has some downsides as well. Using select star with into, the query actually converts the tags in the current measurement to fields in the new measurement. So that can be problematic, especially if you have things that were previously differentiated by a tag value because InfluxDB can actually overwrite those points if they were differentiated by tag value, especially if they had the same time. So the way that we get around that is with that final—is by adding a group by tag key to preserve tags as kept tags. And that’s sort of what this has happening down here. So that’ll actually group all of the various tags and make sure that those are preserved as they get written to the new retention policy. This is still potentially problematic for writing dashboards because field names will still be different, so you’d have to make sure that you write your dashboards to account for that.
Noah Crowley 00:21:44.938 So there are a few issues with continuous queries in general that I sort of touched on a little bit already. One of them is this issue of having to rename things as you downsample. That definitely causes problems when you’re building out dashboards and things like that. You also can’t query across retention policies currently, which can be a little bit frustrating. So what we often see users do is having a dashboard for their roll-up data, as well as a dashboard for their non-roll-up data, or potentially using template variables in there to try and get at their roll-up versus their non-roll-up. But having two separate dashboards usually is effective because the first set of dashboards with the high-resolution data is usually used for doing things like incident response and taking immediate action, whereas the roll-ups are generally used for that longer-term analysis. So that can often be an okay pattern, but it is one of the gotchas in using continuous queries. Another issue that you potentially have is that it will place extra load on your InfluxDB instance. How much extra load really depends on what kind of work you’re doing. If you only have one or two continuous queries, and you don’t have a lot of people querying the database generally to get data, then that’s not such a big deal. But if you are doing hundreds of continuous queries, and potentially one for each, individual measurement name so that you can make sure that the measurements in the roll-up match the measurements in the original, high-resolution data, then all of a sudden, you have considerably more work being done. And it’s something that you really need to be aware of if you’re going to be using continuous queries. Keep an eye on how many of them are running, how long they’re taking, and whether or not you have enough resources to deal with that.
Noah Crowley 00:23:45.483 It’s also limited to things that are expressible in InfluxQL, and there are a couple of things that are not possible in InfluxQL. The first is not having any cross-measurement joins. This is a problem, not only for continuous queries, but also for other types of math, and dashboarding, and things like that. The way that we’re hoping to address this is actually with a new query engine and query language that we’re putting together, called IFQL, and a more general-purpose query engine that will be able to take queries in both InfluxQL, and IFQL, and potentially even other query languages and do stuff like being able to do joins across measurements. So if that’s something you’re interested in, I definitely recommend our CTO, Paul Dix, has a number of talks about InfluxQL—sorry. IFQL, and the new query engine, and where things are going with that. So I’m sure you can find those on our website, or our YouTube page, but if not, go ahead and reach out to me, and I can send you a link to that.
Noah Crowley 00:24:51.105 And the other issue with InfluxQL is that there aren’t really any rolling windows in InfluxQL. And we’ll talk about exactly what a rolling window is when we get to Kapacitor, because that’s one of the things that Kapacitor lets you do. But it’s basically saying that like, I want to be able to compute an average of the data over the last half an hour, but I want to do that every five minutes. So there is overlap in those computations in that every five minutes, I’m including twenty-five minutes of data from the previous computation, but that tends to be a really useful pattern because you do want these sort of longer windows of downsampling that are coming in every five minutes. So you’re maybe looking at averages over a longer period of time, but those are being updated really regularly.
Noah Crowley 00:25:45.353 Another issue, which relates to the extra load on your InfluxDB instance, is that all CQ processing and execution runs on a single node in the cluster. This is really only an issue if you’re an Enterprise customer and you’re running a clustered version of the product, but it can be something that you need to pay attention to if that is the case because there’s no way currently to distribute that load across your various nodes in the cluster, meaning that one of your nodes, if you have a lot of CQs running, could really behave differently than the other nodes, and cause some problems. And finally, the last caveat is that you can’t have consistent field names without having many continuous queries. So this is what I was talking about before where you want to select something as. Most of the time, if you’re just doing a naive continuous query, it’s going to convert this field A into mean field A, and then you don’t have those consistent field names, and that can cause problems when you’re writing queries, or when you’re building dashboards, and stuff like that. And the only way to really get around that issue is to have a large number of continuous queries, which of course, exacerbates the issue of placing extra load on your InfluxDB instance. So a general rule of thumb with continuous queries is that if you only have a few of them, if you have a lot of head room on your open source instance, then these are a really great solution because they don’t require you to spin up additional software. They don’t require you to build things in Kapacitor. You can just write queries in the format that you’re familiar with, and have the database run them itself. But performance considerations are always an issue, so if you can get away with this, and not worry about the performance, and not worry about some of the other potential caveats, then continuous queries are definitely a good way to go.
Noah Crowley 00:27:48.840 But if you really need something like rolling windows or consistent field names and stuff like that, then you might want to end up looking at Kapacitor. So Kapacitor is another element of what we call the TICK Stack. So the TICK Stack is Telegraf, T, InfluxDB, I, Chronograf, C, and Kapacitor, K. And it’s our platform for metrics and events from collection to visualization. It’s an [inaudible] solution to dealing with time series data. And what Kapacitor is, it’s a stream and batch processing engine. So it can do a lot of what continuous queries do and then more on top of that. So batching versus streaming. These two methods of functionality are really valuable, again, depending on your workload. So with continuous queries, you’re always operating in this mode of batches. You’re making queries, you’re doing operations on them, and then you’re loading that data back into the database. And that’s something that Kapacitor can do as well. You can say, “Hey, Kapacitor, I want you to query InfluxDB periodically, every five minutes, or every ten minutes, get back those results, do some sort of computation on them, and then output them,” wherever you want to output them. In our case, we’re going to be writing it back to the database because we’re doing downsampling, but Kapacitor can also be used for things like alerting, or writing to external data sources, or bringing in additional data from something like a MySQL database, and adding that to some tags, and writing it back to the DB.
Noah Crowley 00:29:36.310 So Kapacitor is actually really, really flexible, and batching is a good way to work with it because it’s sort of more, well, we’ll talk about that in a second. The benefits of batching are that it doesn’t buffer as much of the data and RAM. It can do the query, get the data back, and then write it back out. It’s not having to continuously stream data points from InfluxDB. But one of the downsides is that [inaudible] places additional query load on Influx. So the way that this is working is sort of very similar to the way that continuous queries are working in that they’re something that runs periodically that does work, and that work is being done on the InfluxDB instance because that’s where the queries are being executed. An alternative to that is to use streaming with Kapacitor. And with streaming, the writes to the database are actually mirrored directly from the InfluxDB instance. So this puts much less load on the instance because instead of executing queries, it can just forward those writes along to Kapacitor, and it doesn’t have to do a lot of work. The downside of that is that Kapacitor then needs to buffer all of that data and RAM, and it means that there’s an additional write load being placed on Kapacitor because now it has to get each, individual piece of data into its system, and make sure that it’s in the right place, and then do computation on it, and all that stuff. So the benefit of that is that you can actually move Kapacitor onto its own instance. You can scale it separately from InfluxDB. And so that gives you a little bit more flexibility in terms of how you’re dealing with resources, and how you’re actually scaling out the architecture of your data by point.
Noah Crowley 00:31:28.880 So this is what a batch TICKscript looks like. So the other caveat, or the other downside with Kapacitor is that it has its own language that you have to learn. It’s a really powerful language, but it can be a little difficult for newcomers to learn in the beginning. But basically, the idea is that in Kapacitor, you’re going to start with something, and then you’re going to add processing and computation to it. So in this case, we’re starting with a batch, and the first thing that we’re going to add to the batch—whoops. Sorry about that. The first thing that we’re going to add to the batch by piping into it is this query. So the query format’s a little bit convoluted, but you have query, open parenthesis, three quotes, and then your query itself. And in this case, the query is going to be the same as the query that we used for continuous queries earlier. We’re going to select the mean of usage user, as usage user, so we’re preserving those field names, which helps a lot with dashboarding, and things like that, from telegraf.autogen.cpu, which is the same Telegraf database with the autogen retention policy and the CPU measurement. We’re going to give it some arguments, and basically, the way that arguments are added is by chaining them using a dot, so you have dot, period, which is five minutes, dot, every, which is five minutes. That means that we’re actually going to select five minutes’ worth of data every five minutes. So we’re going to be doing downsampling of the last five minutes of data every five minutes, very similar to the way a continuous queries would’ve worked, and then we’re going to write that using the InfluxDB out node to the database Telegraf, and the retention policy roll-up five minutes.
Noah Crowley 00:33:20.939 So in functionality, the way that this works is actually really, really similar to a continuous query. You’re doing that query, you’re doing the computations, and then you’re writing it back to the database to a different retention policy. The only major difference is in this example, that’s happening on Kapacitor instead of InfluxDB. Again, the reason for doing that usually is to reduce load on your InfluxDB instance, or because you had a significant data processing pipeline that you want to do a lot more computation stuff that you might not be able to do just with InfluxDB. So that’s a very similar example of the batch TICKscript, and sort of what it looks like. Functionally equivalent to a continuous query.
Noah Crowley 00:34:09.016 The other thing that you can do is you can actually write a TICKscript that’s going to stream data from InfluxDB. So this is what the streaming TICKscript looks like. Again, you start with the type of process that it is, and you can see here at the top it’s a stream, whereas back here, it said batch. So this time, we’re going to stream. Where we’re going to stream from is going to be the measurement CPU, and we’re going to group by all. This is grouping by tags so that we can maintain the tag name the same way that we did before. We’re going to give it a window of a period of every five minutes. Sorry, a period of five minutes, and an every of five minutes. And this, again, functions the same way that the continuous queries do. It’s every five minutes, we’re going to do this computation over the last five minutes of data. So the period is how much data that we’re going to do computation on, and the every is how often we’re going to do that computation. The computation that we’re going to do in this case is calculating the mean of usage user, and then writing it back into the database as usage user. The writing back into the database happens using the InfluxDB out node. We specified the database Telegraf, and the retention policy roll-up of five minutes. So this will do, functionally, the same thing that the batch TICKscript does, and that a continuous query does, except it’ll do it in a streaming way, where it’ll actually mirror the writes from InfluxDB into Kapacitor, and then do this math on them. But ultimately, the results that you get out of it will be the same. So while this process is a little bit different, the end result should be the same because you’ll have these periods of five minutes where you’re taking the average, and those are happening every five minutes.
Noah Crowley 00:36:04.265 But like we mentioned before, one of the advantages of using a TICKscript in Kapacitor like this is that it gives you the ability to start doing these rolling windows, which is something that you can’t do with continuous queries. And rolling windows are very powerful, and they’re something that people like to use for gathering data about their signals, and doing processing, and signal analysis, and stuff like that.
Noah Crowley 00:36:30.770 So this is this exact same TICKscript as before. But the only thing that’s been changed here now is that the period has been set to twenty-five minutes instead of five minutes. So what exactly does that mean? So there’s the five-minute one, and now we’ve changed it to twenty-five minutes. So what that means is that every five minutes, we’re going to be looking at the last twenty-five minutes of data, and calculating the average across that, writing that to the database, waiting another five minutes, looking at the last twenty-five minutes of data, calculating the average for that, and writing it back to the database. So ultimately, what that means is that the rate of change of those averages should be a little bit slower because you are reusing some of the same data to calculate that average, and then only a little bit of the data at the end has changed. That can be a really powerful technique that’s essentially how things like moving averages, and exponential moving averages, and some more complex algorithms that you can run on time series data work. But they help to smooth out what you’re seeing and to give you a view into the data that is not as reactive to spikes and changes, or potentially smaller issues within the period itself. So that’s one of the advantages—excuse me—of using TICKscripts and Kapacitors is the ability to do these rolling windows and to write these larger rolling functions back into the database.
Noah Crowley 00:38:06.962 There are some issues with TICKscript, just as there are issues with continuous queries. It’s really important to understand your workload, and what exactly is going on inside your system so that you can make an intelligent decision as to what you want to use. With Kapacitor, renaming things is still a problem. It’s not possible to—you still have to do this mean usage user as usage user, and there is no way to templatize that functionality. So if you were to do this downsampling across a large number of measurements in your InfluxDB database, and you want to do this with Kapacitor and TICKscripts, and you want to make sure that you maintain those metric names, then what you actually have to do is create a large number of template tasks and load in the metrics’ names into the template tasks and calculate them that way. If you’re okay with writing specific dashboards for your roll-ups, then you don’t have to worry about that. But that is something to keep in mind as you’re doing downsampling, whether it’s with continuous queries, or with Kapacitor.
Noah Crowley 00:39:19.402 Another issue is that you can’t do general queries across measurements the way that you can do with continuous queries. You have to specify mean usage user, and there’s no way, again, within TICKscript and Kapacitor to make this into some kind of template or wild card variable. So you need to be able to specify the measurement that you want to do calculations on, and then do the calculations on that measurement. Template tasks are a good workaround for that, but that is one of the non-issues that people definitely run into when doing this kind of stuff. And it gets a little bit worse when you start doing trailing roll-ups, which are sort of roll-ups of roll-ups of roll-ups because that renaming problem comes up repeatedly.
Noah Crowley 00:40:10.667 So again, there are some issues with Kapacitor, just as there are some issues with continuous queries, and to reiterate, that’s why it’s really important that you understand exactly what workload you’re dealing with, and where the advantages outweigh the potential issues. Stream tasks currently aren’t feasible in high-throughput InfluxDB instances. This is because all write load gets mirrored onto a single Kapacitor instance, and if you have a really high-throughput InfluxDB instance, the overhead of doing that mirroring is potentially problematic. Again, there are ways around that. You can throw a load balancer into the mix, which will be able to write data to both locations at once. You could also have a more sophisticated data pipeline with Kafka, and having writes going into there, and then various places pulling them, and stuff like that. So there’s always workarounds and things that you can do, but that’s something to keep in mind is that since you’re mirroring all of that write load onto a single Kapacitor instance, you could potentially have issues with just mirroring large amounts of data. And the second big issue is that TICKscript doesn’t have a support for wildcard operations. So the only way to downsample all the data is to have individual tasks, and again, this is where the templating comes in, for measurements and fields, and have tasks for each one of those so that you can go ahead and iterate through each individual measurement and field that you need to downsample.
Noah Crowley 00:41:51.821 If you’re trying to make the decision between whether or not to use continuous queries and Kapacitor, here are some good guidelines for helping you make that decision. Continuous queries are built into InfluxDB. That means if you already have an InfluxDB instance deployed, if it’s not doing too much work, then you can just go ahead and start using continuous queries on that instance immediately without any sort of infrastructure overhead, or having to deploy, and monitor, and configure a new software. So that’s really valuable. If you’re an ace with InfluxQL, then continuous queries are potentially a win for you because you can just use those same queries directly, and everything will operate the way that you expect it to. Often enough, this is going to be simple—often, this will be good enough for simple downsampling. So if you have a few continuous queries, or a few measurements that you need to downsample, if you know what those measurements are, and you can go ahead and select as to maintain the names across the various retention policies, then continuous queries are a really good way to go. They do place a little bit of extra load on InfluxDB, but if you’re doing simple downsampling and you don’t have a really heavily-loaded instance, then that should be good for you. Kapacitor, on the other hand, runs externally to InfluxDB. So it means that none of that extra load is being placed on the instance, but it means that you then have to install another piece of software, and monitor it, and make sure that it works. So trade-offs everywhere. It uses TICKscript, which has significantly more functionality than InfluxQL, but it is a new language to learn. It’s something that you can learn pretty quickly, but there are some caveats, and we definitely have some users who find it to be a little bit confusing. But again, it’s there. It has a lot of really nice, built-in functionality, and you can do a lot more stuff with TICKscript than just downsampling. So downsampling might be a good entryway into Kapacitor, and the other stuff that you can do, and those more complex operations, being able to compute things in different ways, to write your own user-defined functions. That’s part of the TICKscript specification is so you can actually call out to functions that you write in Python or another language, and use those to do your downsampling as well. So TICKscript and Kapacitor is really, really powerful in that kind of way, and can give you many more options than just working with continuous queries. The downsides, aside from having to run this external application is that it buffers streaming data and RAM. So if you’ve got a spare instance with a ton of RAM, and you can throw InfluxDB on there—sorry, Kapacitor on there, then you can just mirror all those writes, and send them to RAM, write them to RAM, and you’re all good to go. But if you’re resource-constrained, then that could potentially be an issue for you.
Noah Crowley 00:45:07.554 Again, there’s no one-size-fits-all solution. Continuous queries are good for some workloads. Kapacitor is better for others. If you have specific questions about what might work for you, I highly recommend heading over to the community site, at community.influxdata.com, and giving us a little bit of information about what your workload is, what it looks like, and myself, or one of the other developer advocates, or one of the members of the community, will get back and help you walk through that process of deciding which of these is best for you.
Noah Crowley 00:45:46.308 So I have one last, little bit of detail for this presentation, and that’s going through retention policies themselves, and how exactly they get set up. We use them in a lot of places, both in the TICKscripts, and in the continuous queries in terms of writing to those new retention policies. And they’re basically just ways to expire that old data. So what we did before—I’ll come back to the general syntax, actually. The first thing that we did before is we created this retention policy on the autogen database with a duration of ten days. So what we’re doing there is we’re taking the autogen retention policy, which is the default retention policy that’s created when you create a database, and we’re resetting it so that the duration which is normally infinite is now ten days. We’re setting the replication factor to one, etc., etc., and then we’re creating a second retention policy called roll-up five minutes, where the duration is infinite. So the difference here is that we changed the autogen policy to have a duration of ten days, which means that after ten days, we’re going to get rid of that data, and then here, we have the new retention policy roll-up five minutes, which is going to be infinite. So we’re going to keep the roll-up data around forever, but we’re going to get rid of the high-resolution data after ten days.
Noah Crowley 00:47:08.922 And this is just the general format for what creating a retention policy looks like in InfluxQL. It’s create retention policy. You give it a name. You tell it which DB you want to create the retention policy on. You give it a duration. The duration is how long the data will stick around before being expired. And zero is actually the code for infinite, so if you say duration zero, it will never expire the data. If you say duration one day, it’ll get rid of it after a day. And replication is for Enterprise and clustered setups where you potentially have three, or five, or some large number of nodes, and you don’t necessarily need to persist the data across all of them. So as you’re creating a retention policy, you can also set a replication factor, and in more advanced-use cases, there are ways that you can play around with—maybe you have data that is less important, so you don’t need to replicate it as much, and maybe the high-resolution data you want replicated more, and things like that. So if you’re an Enterprise customer, definitely reach out to somebody at the company and make sure that you’re using those in the best ways. But again, really simple syntax. But all of this needs to be done sort of at the beginning of the process to make sure that you have all of the retention policies in place so that you can make those decisions, ultimately, between whether you want to do continuous queries, or TICKscripts, or Kapacitor, or anything like that.
Noah Crowley 00:48:43.237 So that’s about it for me. I think we’re running close on time, so I’ll wrap things up. My email address and Twitter are on the screen, so please, if you have any questions or comments that come up after the session, tomorrow or the next day, you can feel free to reach out to me. I also highly recommend, again, the community site, community.influxdata.com. There’s a lot of good conversation going on there, and you can see what other people are doing, what solutions they’ve come to for downsampling their data, and you can have conversations about whether that would also work for you. So with that, I think I’m done, and I’ll turn it back over to Chris to see if we have any questions.
Chris Churilo 00:49:27.295 Looks like we don’t because Noah, I think you did a very thorough job. I think all you guys have to do is just take another listen to this and you too will be able to give this training. So thank you so much, Noah.
Noah Crowley 00:49:39.707 Sure.
Chris Churilo 00:49:39.831 As I mentioned, I will post this recording later on today so you can take another listen, and the automated email will go out tomorrow. But if you’re really anxious, just click on the link that you used to register for this training, and you’ll be able to get to the on-demand video. So thank you so much, Noah, and please take Noah up on his advice to go to the community site because he does live there, and he will be able to help you with any of your projects when you’re using InfluxData. So thanks, everybody, and have a wonderful day.