7. Downsampling Your Data

In this session, you will learn downsampling strategies and techniques for your InfluxDB data.

Watch the Webinar

Watch the webinar “Downsampling Your Data” by clicking on the download button on the right. This will open the recording.

Transcript

Here is an unedited transcript of the webinar “Downsampling Your Data.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.

Speakers:
• Jack Zampolin: Developer Evangelist, InfluxData

Jack Zampolin 00:00.604 Okay, so welcome to the Thursday morning webinars everybody. Today, we’re going to be talking about downsampling your data. My name is Jack Zampolin. I’m the Developer Evangelist over here at InfluxData. So I do stuff like this. All right. So let’s go ahead and get started. So today, we’re going to talk about downsampling at InfluxDB. What composes that? There’s two main parts: continuous queries, CQs, and retention policies, RPs. So during this webinar, we’re going to talk about CQs, talk about how to create them, talk about RPs, talk about how to create them, and then we’re going to talk about how to use them both together in order to downsample your data. We’re also going to talk about common issues folks run into.

Jack Zampolin 01:04.032 So what are continuous queries? Continuous queries are queries that run automatically and periodically on new data that’s coming into Influx and it stores the results of that query somewhere else in the database. Why would you use this feature? If you’re familiar with SQL, they’ve got stored procedures. Some other databases have similar things. It’s kind of like running a query on a cron job. And the two primary use cases there are downsampling, so reducing the resolution of very high resolution data. Let’s say you got one-second data coming in and you only need one-minute histograms of that data. That’s why you would use a CQ. That’s downsampling. And then, the other use case is pre-calculation of expensive queries. So let’s say you’ve got something where you’re trying to pull the top five servers by utilization out of a list, and you’ve got to do some computation there. Maybe you’ve got to compute a utilization percentage and it’s rolled up data over a whole day. Those kinds of expensive queries that run on dashboards can sometimes slow things down. Using CQs, you can pre-calculate them and then just be pulling a very small amount of raw data.

Jack Zampolin 02:28.747 So here’s the basic continuous query syntax. We would want to create continuous query, name on a database, and then we have our query that we’re going to run in there, and that query must have an into clause. So into tells the query what to do with the data. Otherwise, it would just sort of go away. The group by time is also important. That controls how often the CQ runs and over how much data. So here’s an example of that in practice.

[silence]

Jack Zampolin 03:31.502 We’re using this CQ to downsample the CPU data, just using the— pardon me. In a collection in a roll-up interval of one minute. So here, we’ve got 10-second granularity data, standard CollectD, or Telegraf, and we’re downsampling that into one-minute data. Pretty easy example. So here’s what you’re going to see in the logs when that runs. You can see down there it spits out the full query, and it says it’s executing the continuous query. Gives the name as well. So as I noted before, it executes at the same interval as the group by time interval, and it executes a single query that covers the time range between now, and now minus the group by time interval. So just that most recent data.

Jack Zampolin 04:37.659 So if you’re sitting there thinking, you might say, “There’s a problem with that. I’ve got some points that might come in out of order, and if I’m just rolling up the last minute every minute, I might miss some of those out-of-order points, and they might not get included. So I would like to run this continuous query on, maybe, five minutes of data every minute, and I’m going to constantly rewrite those past minutes where I might have had missed data. And this five-minute interval is going to catch all of the data that I care about.” So in that kind of case, you’re going to need a resample EVERY FOR clause in your continuous query. And you can see an example of that there. So what does resample every for do? Resample every two minutes—every defines how often the CQ runs, and for defines how many minutes of data it runs over. So for three minutes means where time is minus now, is now minus three minutes, and we’re still going to group that by time one minute.

Jack Zampolin 06:04.636 And, again, in your logs, you would now see the continuous query running every two minutes. Any questions? And while I’m giving this presentation, I do like to answer questions. So I’m sorry I didn’t mention this earlier, but if you do have any questions, please drop those in the chat or in the QA section, and I’ll answer them as they come in. This every and for thing gets a little confusing sometimes. It’s kind of difficult to keep them straight in your head, but EVERY is how often the CQ runs and FOR is the time range over which it will run.

Jack Zampolin 06:57.581 So before we dive into retention policies, I’m going to pause there and just take a second to wait for any questions, if anyone has any.

[silence]

Jack Zampolin 07:32.989 Adlow asks, “About CQs, how do I define a CQ for all measurements in all of my databases?” Ad, there’s no way to do that, currently. There’s no way to say, “I want to apply this roll-up procedure to every piece of data that comes in.” There are some ways to do a CQ to every measurement in an individual database, but you’ll have to create that continuous query in each of your databases. Does that answer your question?

[silence]

Jack Zampolin 08:17.502 So just to be more explicit, Ad, there’s no way to do that in InfluxDB right now. You would have to define one CQ for each of your databases. There’s a measurement backreference that you can use in your “into” query. That gets into some more advanced syntax that I’m happy to discuss later.

Jack Zampolin 08:39.065 Paul asks, “When using such a query, what fields will be called when using mean*?” I don’t think that you can pass a wild card in for any of these queries. You do need to specify the exact fields, and that’s because our fields are tight. So those kinds of queries like “mean”, those kind of aggregate functions like “mean”, are not according to the Influx documentation. Okay. Well, I might not know on that, but I’m fairly certain that for “mean”, it’s only going to pick up numeric fields. And for all of our aggregates, it would only pick up numeric fields. Does that answer your question, Paul?

Jack Zampolin 09:30.009 Okay. So onto retention policies. What are retention policies? This determines how long you would like to keep data in the database. So this is the syntax for creating a retention policy. You would create the retention policy on database name with a duration and a replication factor as well. And that’s not germane to the open-source software and its remnant when clustering was still in the open source. And then, finally on there, there’s also shard duration. This controls the length of your shards. So a little bit more information on that. Data in InfluxDB is sharded on disk by time. So the files themselves are X or however long. The default shard duration when you create a new database is one week so that each week, there’s a shard on disk that represents the data for that week for that database retention policy combination. Those are your durations. And then, obviously, replication vector just has to do with clustering. And shard duration, I spoke about briefly, but there is much more in the documentation.

Jack Zampolin 11:07.936 So this is how you would define a retention policy. And if you’ll notice there, you see the shard group duration. Shard group duration is also dependent on the duration of the retention policy. If you’ll notice a one-day retention policy, meaning any data that’s older than a day old, will get dropped. That only has a shard duration of one hour. So shorter durations, shorter shard group durations, but you can also explicitly set the shard group duration.

[silence]

Jack Zampolin 11:58.551 So Cliff asks, “I’m currently running on a single-hosted instance, but I’m about to be migrating to a two-instance cluster. Should I update my retention policy’s shard setting?” There’s nothing about that migration that calls out for updating your retention policy shard settings. Yeah. Okay.

Jack Zampolin 12:27.935 Adlow asks about how to use downsample data with Grafana. I mean, a typical example will be to have CPU data that you want to downsample, data older than seven days, but you want to show one graph with current and older data. Adlow, that’s a current technical limitation of the database and not something we can do. It’s a major feature that we’re working on in our roadmap. I wouldn’t anticipate that feature before the end of the year, but it should be sometime within the next year, I would imagine.

Jack Zampolin 13:03.972 Paul also asks, “Automatically downsample database with backreferencing?” Yeah. So that’s any numeric fields, as I said earlier. If they’re floats or integer fields, they will be downsampled using that function. And I believe what it does is it creates a new fieldname with function_fieldname in your new measurement. So that’s any numerical fields. String fields will be dropped. Does that make sense, Paul? Awesome.

Jack Zampolin 13:53.044 Okay. So this is an alter retention policy statement. If you have an existing retention policy and you’d like to change settings in it, this is how you would do it. If you were running in a cluster, you would need to rebalance the cluster after that, so just something to keep in mind. Here’s how to run that.

[silence]

Jack Zampolin 14:46.052 Adlow asks, “How can you measure the size of a database based on a date? I would like to free up some space before applying a retention policy, but I’d like to know how much space would be freed.” There’s no native way to pull that out of the database. However, your shards on disk represent discrete units of time. And what the retention policies are going to do under the hood is drop those shards. So if you run a show shards command on your database, you can see what group of time each of those shards represents. Shard IDs are unique to the instance, so if you’ve got a couple of databases and retention policies, they’ll be skipping numbers, but they’ll be monotonically increasing. So the oldest shards are the lowest-numbered shards and the newest shards are the highest number of shards. When you apply that retention policy, you’ll drop a certain number of those shards off-disk based on how far back you would want to go. I would do that calculation based on the size of the shard files on disk, and that’s in the data directory for InfluxDB which lives at /var/lib/influxdb on Linux installations or in home.influxdb in Mac. Does that make sense, Adlow? Awesome.

Jack Zampolin 16:26.967 Okay, so how would you combine continuous queries and retention policies to downsample your data? So we’re looking to downsample 10-second resolution Telegraf data to 5-minute resolution data. So we’re going to take 30 points and compress them down into 1. We want to store 10-second resolution data for one week, and then we want to store the 5-minute resolution data for four weeks. And in order to follow along with this tutorial, you’re going to need a working InfluxDB instance. So we would first create the database. If you have Telegraf running against your InfluxDB instance already, you will have this Telegraf database. Another way to do this is to just start Telegraf. We’re going to need to create another retention policy on Telegraf for that one-week data. And when you see there, you’ll see that retention policy’s been created. If you’ll notice, the shard group duration is also one hour as well. We’re also going to want to create that four-week retention policy as well. Notice that we’ve made the one-week retention policy our default. So any data that’s written into Telegraf without specifying a retention policy will automatically go into that one-week retention policy there.

[silence]

Jack Zampolin 18:14.639 Then we’re going to create our continuous query. Notice we’re going to be taking from Telegraf one week, our default retention policy, and putting into our four-week retention policy for longer-term retention.

[silence]

Jack Zampolin 18:41.944 Now, in this case, once we’ve set up our retention policies, we’re going to install Telegraf and start the service. So we’re going to wait about five minutes for our data. And once that five minutes is up, you should have seen the continuous query run in your logs, and you should see some data in each one of those measurements. So if you look here, you should see from Telegraf one-week CPU, we get a certain set of data there. And then if we say from Telegraf four-week CPU average, we will see the data below, and those are five-minute roll-ups if you’ll notice by the timestamps. And I’m going to pause there for any questions about downsampling in InfluxDB.

[silence]

Jack Zampolin 20:42.821 Paul asks, “How does this work at scale? It seems normal to me to potentially want to downsample all data received, but most of the examples typically talk about dealing with a single field.” Well, Paul, at scale, InfluxDB installations, I’ve seen people attack this a number of different ways. So you and Adlow were asking questions about using the splat operator and the measurement backreference. We’d added those features to continuous queries to make it easier to do that use case that you’re describing right there. So between the splat operator and the measurement backreference, it’s pretty easy to write a CQ that downsamples all measurements in the database. Now, obviously, when we’re talking about string fields and stuff, things get a little stickier. It really depends on what your requirements are and exactly what you need to do. One thing that’s nice about InfluxDB is it is quite performant. And most dashboard-type queries, if you’re running it for the last couple hours of data at most, those are going to return within acceptable latencies without downsampling. So it really does depend on exactly what you’re trying to do and exactly what your system needs to accomplish. Does that make sense?

[silence]

Jack Zampolin 22:22.671 Paul, yes, it is. And while we don’t currently have histogram in the query language, most larger CQs I’ve seen like that are max, min, mean, and then a couple of p90, p99, p50, stuff like that. So there is definitely a way to do that. Any more questions?

[silence]

Jack Zampolin 23:11.330 Adlow asks, “For me, the use case is I’m an admin of Influx and I have several users, each one with their own characteristics. But I want to limit the amount of data they store, so I need a general way to achieve that.” Yeah. So with that, retention policies, obviously, and downsampling in general, would be a good way to do that. If you have a manageable number of users, depending on how many several is, writing custom CQs for them, maybe one level of downsampling, would be not a whole ton of overhead. Is that the type of use case we’re talking about, not too many users? And does the splat operator and the measurement backreference work for you, Adlow?

[silence]

Jack Zampolin 24:21.023 Okay. Well, as we’ve talked about a couple of times, you can’t downsample string fields. So if the measurement is all numeric fields, you can use the splat operator like this. Here, let me write this query out.

[silence]

Jack Zampolin 25:37.362 And Adlow, that is the splat operator being used with the measurement backreference there, and that’ll do a whole database, but just to numeric fields. Good deal.

Jack Zampolin 25:53.589 Okay. So some common issues. One is working with historical data. If you have a bunch of data that you haven’t downsampled and you’re wanting to start a CQ on the new data, how do you deal with the old data? The best way to do that is with INTO queries. So you would just write the exact same query that you were going to use with your CQ, but do a one-off for a specific time range. And this is an example of that.

Jack Zampolin 26:34.501 So Cliff asks, “In my earlier question, I incorrectly asked about the shard setting. It should have said replication level. I’m currently running on a single-hosted instance, but I’m about to be migrated to a two-instance cluster. Should I update my retention policy’s replication setting?” Yes, and I would work with cloud support to get the instance rebalanced after that. Off the top of my head, I’m not quite sure what the exact procedure is, so just send in a ticket to support@influxdb.com. But yes, I would imagine you are going to need to update your retention policy setting. And I think that cloud support will kick off that rebalance for you. So I hope that answers your question. No problem.

Jack Zampolin 27:32.989 Another common issue that some people run into is missing data in CQ results. And as we’re talking about string fields earlier, string fields will generally be dropped. But in this case, this INTO query will hold that data in. So maybe if you’re using that measurement backreference and the splat operator to growth downsample your fields, you could write another CQ to separately downsample those string fields. And this would be an example of that.

Jack Zampolin 28:23.195 Alexander Kruger, “Is it on your future agenda to make an easier automatic downsampling like in Graphite?” Alexander, as I’ve said a couple of times, yes, it is. It’s absolutely on our roadmap. It’s definitely a major feature that we know we need and we’re planning to work on. I wouldn’t anticipate it this year, but sometime early next year, I would imagine. That’s when we’re going to get to it. It’s a very, very complicated feature. There’s a number of edge cases. It requires us to really dig into the storage engine at a very deep level and design something that works for a lot of different use cases. So while yes, it is in the plan, it’s a little further down the road. And one more thing that I will say about that is Graphite and most other Time Series Databases are just designed to store metrics. They will only store floats and integers. We allow a more diverse data model, and one of the things that’s a downside about allowing a more diverse data model is that it does make things like downsampling harder. And that’s the result of the system you see now. It works excellently for integer and float fields, and you can write a query that will effectively downsample an entire database for integer and float fields, but not for string fields. And that’s one of the major edge cases. Also, large numbers of fields— and it’s difficult to kind of understand what’s happening there. Maybe you want to apply different downsampling techniques to different fields within the same measurement. Then you’ve got to write a custom CQ. So we’re working on solving those problems, and that is definitely on our roadmap.

Jack Zampolin 30:18.088 Paul asks: “At what point does one move away from CQ to Kapacitor?” And there’s a very easy answer to that, Paul. It’s as soon as you become CPU-bound on your host. The first thing I would eliminate is continuous queries and offload that to Kapacitor. Continuous queries are very CPU-heavy. Does that answer your question there? Excellent.

Jack Zampolin 30:51.311 Okay. This is an example of missing some tag data in a CQ. If you’re missing tag data, grouping by that tag will maintain that data. Otherwise, you’ll just have it ungrouped. So another issue is configuring the schedule of the CQ, grouping things properly into buckets, making sure that works.

[silence]

Jack Zampolin 32:11.214 So some common issues that you’re going to run into when writing retention policies. It’s very easy to create a retention policy and not write into it. That default keyword is a little bit tricky, and making sure that you’re specifying your database and retention policy combo properly is very important. So there’s an example of using the HTTP API to specify a retention policy. And then we’ve got a couple of different CLI options. If you’re using retention policies, make sure to specify them, and if you’re using retention policies, my advice would be to say, “Always use the full syntax for database.retentionpolicy.measurement when specifying any measurement identifier.”

[silence]

Jack Zampolin 33:24.702 And then, again, when writing to retention policies make sure that you select the proper retention policy that you want to use. And then, the final thing here is Paul asked a question earlier about moving away from CQ to Kapacitor. So how exactly does that work? This allows you to offload the computation that’s involved in a CQ to a separate host. This is what you would implement if CQs are using too many resources. What Kapacitor will essentially do is it works in one of two ways, either a batch or stream. So in the batch task, it works literally, exactly like a CQ. It will query the data from Influx, and then write it back into a different part of the database. In the stream case, any data that’s coming into the database, InfluxDB will stream over to Kapacitor, and then Kapacitor will window and aggregate that data and write it back into the database. So the batch case gives you a lot more control like EVERY and FOR, and the stream case is kind of a standard continuous query without the special syntax, and that data just gets written right back into the database. And this is used in many, many installations. At scale, this is probably the best way to do this.

Jack Zampolin 34:57.666 Okay. And we do have some further reading and additional resources and I’ll be sticking around for a little while to take questions. That is the end of our webinar today.

Pin It on Pinterest

Contact Sales