InfluxDB Tech Tips – June 2016
Recorded: June 2016
In this talk, Regan Kuchan from the InfluxData Support and Services team will walk you through her InfluxDB top tech tips on how to resolve common (and not so common) issues you might encounter with InfluxDB.
Watch the Webinar
Watch the webinar “InfluxDB Tech Tips – June 2016” by clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “InfluxDB Tech Tips – June 2016.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Regan Kuchan: Technical Writer, InfluxData
Regan Kuchan 00:08.647 Hi. I’m Regan. I’m on the support team here at InfluxData, and I write a blog that comes out every week that covers some of the frequently asked questions that we get on the support team as well as some more general tech tips for working with InfluxDB. And today, I’m going to be going over some of the topics that I’ve covered in that blog. So group topics into three main categories. The first category is series cardinality. So how to think about series cardinality and how to calculate series cardinality are very common questions that we get. So we’ll definitely be going over that today. The second category is INTO queries. INTO queries are great for moving data around your database. They’re also good for doing things you didn’t know you could do with InfluxQL. And, finally, they have this very specific, weird behavior that hopefully you will be aware of by the end of this talk. And then the last category is just a couple tips that I couldn’t figure out how to categorize in—anyway.
Regan Kuchan 01:05.845 So let’s start with series cardinality. What is it? The basic definition that we give to people—I’ll go into more detail, but bear with me for now. But the basic definition is just the number of unique measurement and tag set combinations on your instance. That’s it. And why is it so important? Why do we keep talking about it and making sure that you understand it? It comes down to if you have unbounded series cardinality, so if the number of series on your database are going to keep increasing over time, forever and always, your RAM needs are also going to keep increasing over time, forever and always. And that can lead to bad things. So now that we know what series cardinality is and why it’s so important, let’s go into an example to kind of think about it on a deeper level. And I have a really simple data set up here. We’re not going to worry about having multiple measurements. We only have one measurement. And the one measurement is called Records. And Records has three tag keys. And those are email, first name, and last name. And then each tag key has two tag values, and those are all in the last column there.
Regan Kuchan 02:08.636 So when you first hear the definition of series cardinality, which is all combinations of measurement and tag sets in your database, it might be convincing to look at this and just say, “Okay. I’ll calculate every single possible tag set combination of email, first name, and last name.” And to do that you would just do two times two times two and that would give you a series cardinality of eight. If that made no sense, I’ve written out every single combination of the email tag, the first name tag, and the last name tag here. And if it still doesn’t make sense, you’re right. And that’s because the first name Penelope is always going to be associated with the email [email protected] The first name Barry is never actually going to be associated with the email [email protected] So when you’re thinking about the actual tag set combinations in this measurement and your actual series cardinality, you’re only going to have two combinations. So you’re only going to have a series cardinality of two because Penelope and Braithwaite are always going to be associated with the email [email protected] And then Barry and Curmudgeon are always going to be associated with [email protected] We call tags like email independent tags. Then we call tags like first name and last name dependent tags. And they’re dependent tags because they’re already scoped by another tag, including first name and last name, is not actually going to increase your series cardinality. So what should you take away from all of this? Ultimately, it’s probably going to be easiest to estimate your series cardinality by just calculating every single tag set combination per measurement and then adding that up across each measurement in the database. But just know that that process can lead to an overestimate of your series cardinality because of the presence of things like dependent tags.
Regan Kuchan 03:57.311 Yeah. As you can imagine, your scheme is probably going to get a lot more complicated than just having one measurement and then three tag keys and then two tag values per tag key. And a lot people ask us this question, which is, “Can I just clear the database to figure out what my series cardinality is?” And the answer is yes. These two queries will do just that. The first one will return series cardinality by database. And the second one will return series cardinality for every database across your entire instance. Moving on to the INTO query part of this talk. So INTO queries are just InfluxQL queries that write the results of a query to another database or measurement. And why are they so important? They’re great for things like downsampling. Downsampling is taking something like 10-second resolution data, aggregating those data to the 1-hour resolution, and then saving that 1-hour resolution data in your database. But they also provide the same functionality as nested functions and HAVING clauses. I’m not going to go into downsampling today. There is a guide in the docs that I recommend checking out if you are interested in that topic. I am going to talk about how to use INTO queries as a workaround for nested functions and HAVING clauses.
Regan Kuchan 05:13.801 So the two queries up here are invalid and InfluxDB. The first one has a nested function. A nested function is just a function within a function. And most InfluxQL functions do not actually work with nesting. The second query has a HAVING clause. And for those of you who are unfamiliar with HAVING clauses, this query will just return the mean value of Bs if that mean value was greater than two. So now I’m going to talk about how to get the same results as that invalid query by using an INTO query. It’s pretty simple. All you’re going to do is break that invalid query into two steps. And the first step is going to be the INTO query. And that’s going to calculate the mean value of Bs grouped by Hive. It’s going to write the results of that query into the measurement, new mes. And then the second query is essentially the outermost function in that invalid query. And it’s just going to calculate the sum of the data that, you just wrote to new mes, and that’s it. The workaround for getting the same HAVING clause functionality is basically the same. You’re going to break that invalid query into two queries. And the first one is the INTO query. And, again, it’s just going to calculate the mean value of Bs grouped by Hive. And it’s going to write the results of that query into the measurement, new mes. And then the second query is essentially the HAVING clause part of it. So you’re going to query the data that you just wrote to new mes and ask for only those value of mean_Bs that are greater than two.
Regan Kuchan 06:47.109 So if you’re looking at this and you’re saying, “Great. Now I have to run two queries instead of one. This sucks,” not necessarily. So you can turn the INTO queries that I was talking about into continuous queries, and InfluxDB’s continuous queries run automatically and periodically on the database. So I have the syntax up here. But, overall, this query would just run every 15 minutes on the database automatically. And then you’d be back to only having to manually execute one query to get the information that you were after. Now I’m going to talk about kind of the weird behavior that I mentioned that INTO queries have. And I’m going to do it with an example because doing it without an example is kind of hard. Let’s say you have a measurement and it’s called French Bulldogs. And that measurement has two points. And all you want to do is take the two points in that measurement and write it to a new measurement called Dogs. And that’s actually really easy to do with an INTO query. That’s what I’m doing in step two. InfluxDB says, “Okay, I wrote two points.” Great. That’s what we would expect. And then you go to query, the data in Dogs, and you only see one point. So what happened? Well, INTO queries turn tags into fields. So color, which was a tag in French Bulldogs, is now a field in Dogs. The tag color was the only thing that was differentiating the Rumpelstiltskin point from the Princess point. So when you move into Dogs, color is no longer a tag. There’s nothing differentiating Rumpelstiltskin from Princess. So InfluxDB assumes that you have a duplicate point and it just overwrites Rumpelstiltskin with Princess, if that makes sense [laughter].
Regan Kuchan 08:29.172 The fix for this is actually really straightforward. All you have to do is put any tag that you want to preserve as a tag in the GROUP BY clause. And you can see I did that in the second step here. I moved color from that select area into the GROUP BY clause area. InfluxDB, again, says it wrote two points. And now since color is preserved as a tag, you have two points in Dogs. So while it is a very weird thing to have happen, it’s pretty easy to work around. And if you know that you’re always going to want to preserve all tags as tags, just put a GROUP BY star in all of your INTO queries and you won’t have to worry about it. Okay. Now we’re in the final couple slides here. I have two really quick tips, so bear with me. The first one is just how to get human readable timestamps in InfluxDB’s CLI. The command line interface is just InfluxDB’s shell for working with the database. And by default, it returns timestamps in those rather unhelpful epoch format. And all you have to do is type precision RFC3339 and rerun the query, and you’ll have beautifully formatted timestamps.
Regan Kuchan 09:38.748 The last tip for today is what to do if you’re getting the expected identifier error unexpectedly. So say you’re going along and you write some data, insert monsters exist equals true. And then when you go to query that data, you get a weird error, which is error parsing query found exists expected identifier. Assuming there are no other problems with your query, it might be the case that one of your identifiers is an InfluxQL keyword. In this case, exists is an InfluxQL keyword. So to query it, all you have to do is double quote it. Identifiers are database names, retention policy names, measurement names, field keys, and tag keys. And then if you check out that link at the bottom, you’ll find a complete list of InfluxQL keywords. All right. So now you have two very random tips for working with InfluxDB. You know virtually everything about series cardinality. And you know quite a bit about INTO queries. Where can you go from here to learn more? Well, I highly recommend checking out the docs. In particular, there is a frequently encountered issues page in the InfluxDB docs that covers a lot of what I just talked about and more. There is also the blog that I mentioned at the start of the talk. And it comes out every Thursday morning. So if you enjoyed this talk, check it out. If you hated it, ignore it [laughter]. That’s fine. And then the last one are just some longer technical papers. And they cover things like what is time series? Why does it matter? How to migrate your data from older versions of InfluxDB, and there’s also an awesome one on performance tuning that goes over more about what I’m sure is everyone’s favorite topic at this point, series cardinality. So thanks so much for listening. That’s it. [applause]