In this session you will learn how to tune your queries for performance plus strategies for effective schema design.
Watch the webinar “Optimizing Your TICK Stack” by clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “Optimizing Your TICK Stack.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Chris Churilo: Director Product Marketing, InfluxData
• Jack Zampolin: Developer Evangelist, InfluxData
Jack Zampolin 00:01.947 Thank you very much, Chris. Today, I’m going to be talking about optimizing the TICK Stack. So that’s going to be schema design, hardware, and a couple of other various, miscellaneous things. As Chris said, my name’s Jack Zampolin. I’m the Developer Evangelist over here at InfluxData. And if you have any questions during the webinar, please feel free to drop them in the Q&A or in the chat section. And I’ll answer those as they come in. So please feel free to do that. Anyway, I’m going to go ahead and get started.
Jack Zampolin 00:39.861 So what are we going to cover today? We’re going to briefly go through the data model. We’re going to understand tradeoffs involved in the schema design. We’re going to figure out some best practices there. We’re going to talk through hardware requirements. And we’re going to talk about performance—another couple performance tradeoffs and schema designs, so very heavy on performance tradeoffs and schema designs tonight. So important to remember about the InfluxDB data model, tags are indexed. Fields are not indexed. And all points are indexed by time. So, if we back up a little bit for each database instance of InfluxDB—if you open up the open source, that’s going to have databases inside it. Each of those databases has many Retention Policies. Each of those Retention Policies has measurements, which are essentially tables in SQL.
Jack Zampolin 01:41.957 And within those tables, those tags, which are indexed, are kind of like indexed columns. And the fields, which are not indexed, are just kind of like normal columns with data in them. And everything is indexed by time. So when we’re going forward, please be sure to remember that. So, when we’re talking about schema design, there’s a few no-nos. And we’re going to go through the no-no’s, and then we’re going to talk about what to do after that.
Jack Zampolin 04:35.058 So, if we look here many of you might be familiar with this format, cpu.server-5.us-west, and then a single value field and a timestamp. This is Graphite essentially, and this encodes a lot of information in tags, and makes it very hard to query for. So, if you wanted to pull up just the statistics for server-5, you’d have to write a regular expression, and then pull that data out separately, or to pull all your servers from us-west. Those queries are non-performant. This leads the database to storing the data in a non-performant way. Please don’t write Graphite directly into the database with no tagging at all.
Jack Zampolin 05:27.273 What we would like you to do is encode that information as tags instead. And if you look down, you’ll see the same points in tag format. So many plugins, you’re going to have Sensu, or Collectd, Telegraf, Graphite. There’s a number of other collectors, the Diamond collector being one of them that I can think of. So what if I have something that sends data like this. Well, there is a couple of tools that we offer that can sit between your data source and InfluxDB, run the data through a parser, and add those tags based on these periods of lineated strings. And this Graphite template parsing language is available in a service for the core database itself, InfluxDB, as well as a parser plugin in Telegraf. And if you see here, the first part of the template, sensu.metric.*—that’s the filter. So I’m saying four points that match this regular expression starts with sensu.metric, this is how I need to parse them. In each of the periods, in the template, maps to a period in your string there. So we’re going to ignore sensu.metric. We’re going to call them measurement net. This column here is server. This is an interface tag, and then these are the field names. So you get points that look like this down there. Any questions about that?
Jack Zampolin 07:23.463 Cool. Another bad thing to do is overload tags. This Graphite-style schema here, hostname.datacenter, all under server, this makes the data more difficult to query for and doesn’t give you anything. It’s also nonintuitive, too. So separate that data out into different tags. The other ones degrade by region, and by host. And also if somebody is trying to plug in a hostname and wonder why nothing gets returned, it’s because there is .us-west on the end of it. Another bad idea, don’t use tags that have high variability—UUIDs, hashes, or random strings.
Jack Zampolin 08:18.807 Now, the important caveat here to remember is that most of your use cases will have a UUID that you need to filter by. That’s totally fine. Like sensor ID or host ID, you’ll have a number of those. But that’s a fixed number. The number of hosts you have is based on how large your installation is. And that’s nothing to worry about. But if you’re trying to add a row ID to InfluxDB, I’ve seen folks do this. They’re familiar with SQL databases. They’re wondering where the primary key is. And they try to add a tag that will act as a primary key. That’s a very bad idea. It creates a ton of tags and takes up a lot of space in the in-memory index, leading your database being highly unperformant pretty quickly. So, if you do have these requirements, let’s say session ID, request ID, those kind of things, you can consider vertically sharding across instances, using tag prefix groupings. Maybe you take the first part of your UUID, store that as a tag. There’s fewer of those, and you can filter partially by that. Move data from tags to fields. In the example above, moving request ID to a field might help. So you’re not tracking—maybe giving it a unique user identifier request but just each session. You’ve got fewer sessions in the day so that’s not as much of an issue. Another thing that you can do to add more series cardinality to the database is to use a cluster. This allows you to horizontally shard series across multiple instances.
Jack Zampolin 10:18.783 Tags that are independent have a cost associated with them. So what does that mean? So, if you look at week, weekday, hour, and minute, whichever one of those has the highest number of values is going to determine the rest of the tags. But host over there is completely independent of those time-related tags. So, for every separate host, that’s going to create a unique series for all of those time-related tags. That leads to very high cardinality because it’s the multiple of both of those tag—the number of tag values in each of those different tags is the number of series that will be created. So independent tags do have that cost of creating a lot of series in the database. So something to make sure to be aware of. And, again, columns can be stored as fields instead. And you can do WHERE lookups on those. The only thing you’re not going to be able to do is GROUP BY. But in this example, grouping by hour or week is—you can accomplish that with the time base of the database itself.
Jack Zampolin 11:50.823 And, again, the converse of this is that dependent tags have minimal cost. Now, the best example I can think of this is location. So, if you a sensor ID that has a location associated with it, you can have a lot of tags describing that location without adding any additional cardinality, and that’s if the sensor lives in one place. Let’s say it’s monitoring temperature on the outside of a building. It’s got a zip code. It’s got a state. It’s got a country code. So you can add all of those additional tags without adding any additional cardinality because sensor ID is going to be the thing that you have the most of as far as tag values, and all the rest of them are dependent on that sensor ID. In this example, we’re adding first name as a tag for email, and because each one of those first names in the emails maps closely to one email, you’re not adding any additional series. You’re just adding more descriptiveness and more ability to query your data.
Jack Zampolin 13:11.293 Using the same for a tag in a field is—it’s a bad practice, leaves you to have to use special query syntax to retrieve the data. It’s very confusing for users, and even in somebody who’s accidentally designed systems where I did this [laughter], it’s confusing for anyone querying the data. So just try not to do that. Differentiate those names somehow. So maybe our user tag is user type and our user field is user ID. So something to think about there.
Jack Zampolin 13:57.183 Using too few tags is a very bad idea. So if we’re only tagging our data with the region, things that you’re probably going to run into—fields are not indexed. So, if you’re trying to filter by host in a query there, you’re going to have to scan all of that data. It’s very expensive. Your queries are going to be slower. Field WHERE lookups versus tag WHERE lookups are generally about an order of magnitude slower. You can’t GROUP BY fields, and if there’s points in the same series—and remember series is defined by measurement in the unique tag set. Points in the same series have the same timestamp, the system will store the union of the field sets, so last write wins. So you’re going to end up overwriting a lot of your data if you have not enough tags, and you’re not being descriptive enough of your data.
Jack Zampolin 16:13.009 Don’t create too many logical containers. Don’t write to too many databases or Retention Policies. If you’re thinking you’re going to have hundreds or thousands of databases, there’s probably a better way to design that system. I’ve seen folks split each customer into a separate database, and it have up to hundreds or thousands of databases. When you get into the thousands of databases, there is generally some performance issues. You start bumping into open file handle limits because of the way we store data on disk. There’s a lot more overhead associated with handling all of those files and answering queries, more RAM, additional CPU. It’s difficult to join that data in the database itself. You can use regular expressions to search for multiple measurements within an individual database, but if data is across databases, there’s no way to—with the query language natively—manipulate that data together. You would have to do it client-side or with a tool like Kapacitor. And it’s just not good practice, so. Anyway, don’t have too many databases.
Jack Zampolin 17:44.416 So that’s a lot of don’t do this, don’t do that. What should I do as a conscientious database designer? What I always like to do is start by thinking what kind of queries do I want to run? By knowing which questions you need to answer out of your data, you can write a performant and working schema for your use case. So, if I want to run this query here, what can we deduce about our schema? So, if we want to count anything or average it, any arithmetical operations, that’s going to need to be a field. So we know that alice here is going to be a field. Tags are stored in the database as strings, so there’s no type of information associated with them. Fields can be bools, strings, or float, amd64s. So those kinds of arithmetic operations need to be run on fields.
Jack Zampolin 19:03.114 Timestamp collision workarounds, will this be fixed in the future? I don’t know what you mean about will this be fixed. But the best way to work around timestamp collisions is to write the data at a higher precision so that you’re not having those collisions or to separate your data into more series. It’s not generally a problem I see folks run into if you’re using Telegraf or any of the other standard collectors. They do a very good job of avoiding timestamp collisions. Just be sure that you’re properly describing your data, giving it enough tags to separate it into enough logical containers, and make sure you’re writing it with high enough precision where you’re not going to run into timestamp collisions. Does that make sense?
Jack Zampolin 20:09.844 I’m going to assume that does. If you’ve got any more questions, Sartorius, just toss them in the chat again. Yes, but with nanosecond precision, we still have collisions. I would say adding more tags to your data. I’d be interested to see where you’re getting collisions with nanosecond precision. If you go to community.influxdata.com and share some more details about that, I’d be happy to answer your question. Thank you, Sartorius. Awesome. So another thing to think about when you’re designing schema is only fields can store numbers. I’ve chatted about this a few times. Any of that type of information, comparisons, you’re going to want to store that data as a field.
Jack Zampolin 21:02.622 The GROUP BY clause cannot accept fields. This is important to understand about schema design in Influx. It’s just a fundamental limitation of the database. We can’t GROUP BY fields. We can only GROUP BY tags. So anything that’s in a GROUP BY field, we’re going to want it to be a tag. Hostname, a common one, sensor ID, stuff like that. GROUP BY tag.
Jack Zampolin 21:33.492 And some general things to keep in mind, as we said just a second ago, anything in a GROUP BY clause must be a tag. Anything that you want to pass into a function must be a field. Anything that uses comparison operators can be a tag or a field. So, obviously, greater than or less than—it needs to be field because that’s a numeric operator. But if it’s equal to or not equal to, that can be a tag. If you’re going to use a regular expression to filter through it, let’s say you have some—this is very common in DevOps—information encoded in your hostnames and you want to use regular expressions to search through your hostnames to match a list of hosts, that’s going to need to be a tag. And then, again, if you lose information by storing it as a string, use a field. Any questions before I move onto the exercises?
Jack Zampolin 23:01.362 Okay. We’re going to go ahead and move through the exercises now. So this exercise is going to be a schema design exercise. We’ve got a number of sensors, about 10,000 of them. So we’ve been running a very successful business here. We’re measuring the air quality at a bunch of different points all throughout San Francisco. And these sensors are emitting air quality data every 10 seconds. So when those sensors emit data, this is the data they emit. So there’s some location data, zipcode, latitude, longitude, city name, and device_id, and then there’s some pollution data. So smog level, CO2 parts per million, atmospheric lead, sulfur dioxide. So as an exercise, why would it be a bad idea to make latitude or longitude a tag instead of a field? And if you have any ideas on this, just drop it in the chat. I’m going to mute my mic here for just a second and let you guys think about this question.
Jack Zampolin 24:45.883 And Paul’s got it right. Too many possible values is going to lead to some high cardinality there. As you know, latitude and longitude’s generally high-precision floats. Those aren’t known for low cardinality. So do try to avoid using lat or long as a tag instead of a field. One thing that I’ve found very helpful with geographic use cases like this is geo-hashing and using just a very short prefix of the geo-hash to group things into areas to query over by your tag values, and then you can store the full precision geo-hash or the latitude and longitude as a field and pull out that exact location data. And that works pretty well for location-based data.
Jack Zampolin 25:41.154 So as we just talked about, they’re both going to have a high cardinality and could result in a large number of series. Why would it be a good idea to make them tags instead of fields? Fast lookups on latitude and longitude, grouping by these latitude and longitudes. Maybe we’re not storing super high precision latitude and longitudes. We’re just storing maybe one decimal place, so relatively large areas, and we can WHERE search by those. If our devices don’t move, latitude and longitude might be dependent on device_id and storing them won’t increase series cardinality. So those are considerations to think about. So as we’re designing this schema, the following queries are important. We’re going to call our measurement pollutants. We’ve got a couple of things we need to GROUP BY, and we’ve got a couple of things that we need as fields. We need to aggregate or otherwise perform arithmetic operations on them.
Jack Zampolin 27:04.867 So what’s the best way to organize our data to support the queries we want? And if you think about this, we’re basically thinking about FROM these queries here, what are we going to want as tags, and what are we going to want as fields? So I’m going to give you guys a second to think through this, and then we’re going to talk through a couple of different possible schemas.
Jack Zampolin 27:32.806 Okay. So let’s go over the schema for this problem here. So how are we going to organize our data to support the queries we want? Well, here’s one potential schema. We’ve got our measurement pollutants. We’ve got tags, city, device_id, and zipcode, and then we’ve got a bunch of fields, lat, long, smog level, CO2 parts per million, lead, and sodium dioxide level. In here is a couple of example points in line protocol.
Jack Zampolin 28:35.137 Another potential schema would be the same as the last one, except we would have those latitude and longitude values as tags. Does anyone have any questions or comments about either of those schemas? Which one do we think would be best, and what are the pros and cons of either of those?
Jack Zampolin 29:03.087 Okay. Moving on. Hardware requirements. So, if you’re running the database, you’re going to want to understand how large of a machine you’re going to need to run it on based on your workload. InfluxDB as the database is known for its performance, so we’re not—we’re pretty good with your hardware, but there are some things to keep in mind. This is going to depend pretty heavily on your load profile. So how many values a second each one of those field values needs to get persisted to disk. So the amount of a field values you have coming into the database every second will help determine your load, essentially. Another part of load is the number of queries. That’s going to affect memory consumption, and then, obviously, unique series that we’ve talked about a few times. That’s also going to affect memory consumption. Your indexes. So, if we look here, 25,000 values a second, five queries a second, less than 100,000 unique series, you can run that on very small hardware, and then obviously moderate and high loads. And then the performance limits of a single server, anything over one million values a second, more than 100 queries a second, generally, I’d say that’s between 20 and 50 people in chairs refreshing Grafana [laughter] and then over 10 million unit series.
Jack Zampolin 30:43.381 Paul has a question. Are there any metrics collected by Telegraf that may allow us to see what our current load profile may be? Yes. Telegraf has an InfluxDB plugin that gets a lot of these values. I would check out the documentation on that. The number of series in the database is in there definitely, the number of writes a second is in there, and the number of queries per second you can pull out as well. So the Telegraf InfluxDB plugin has all of this. So, for those load profiles there, these are the different hardware requirements you’re going to want. Low load you can handle pretty gracefully on two to four CPU cores and two to four gigs of RAM. InfluxDB being written in Go does effectively utilize multiple cores. We’ve got a number of background processes. So having at least two CPUs to make the database run performantly is a good idea.
Jack Zampolin 31:45.373 Yes, Sartorius, the internal database does have some of that information as well. The write throughput especially is going to be in the internal database, and I believe the series counts are in there as well. But you might need the Telegraf plugin to pull those out. And then moderate load, four to six CPU cores, 8 to 32 gigs of RAM depending on, obviously, the number of series in your query load. And then we do advise folks run on SSDs. So those IOPS requirements are very low for most SSDs but just something to be aware of. Running on spinning disk will reduce your write throughput by about two-thirds. And then for high load or any pretty serious production load, I do recommend that folks always run eight or more cores. And in that high load scenario, you’re going to have a lot of series and new queries, so 32 gigs or more of RAM. I’ve seen folks run with hundreds of gigabytes of RAM depending on the number of series and, obviously, that number is a little more flexible than the CPUs number. But just something to keep in mind.
Jack Zampolin 33:19.340 General performance tips. We’re pretty CPU heavy and different workloads can be memory-heavy as a caveat to that. When I do testing on AWS, I use the c3 series. Memory usage is both heavily correlated with the number of unique series in the database. I’ve said that a number of times [laughter]. It’s an important thing to understand about Influx. It’s something that kind of separates Time Series Databases from other databases—is the fact that we keep those series keys in memory for those very quick lookups. Anything that requires a lot of queries is also going to require initial memory. So let’s say you’re more in that low-load scenario in terms of number of series and amount of values you’re writing every second, but you have a lot of people looking at those dashboards. Additional memory and then you might start seeing some CPU convention when you start scaling up those clients as well. So just something to be aware of and please run the database on SSDs. It just runs better.
Jack Zampolin 34:38.888 Storage requirements. A lot of folks want to know, “How much storage am I going to need to satisfy my use case? I need to store all of this data for a year or forever. When you’re calculating storage space for Influx, does downsampling improve performance?” We do have a question here, Sartorius. Does downsampling improve performance? That’s an interesting question. Yes and no. So downsampling will essentially reduce the size of your historical data on disk, but that historical data that’s on disk doesn’t have a whole lot of bearing on most of the queries that you’re going to end up doing. If you’re using the database to do DevOps monitoring, or for an Internet of Things application where people are looking at current sensor values more often, or even a financial use case where you have stock market data streaming in, what you really care about is now and the last 15 minutes or the last hour. That data’s always going to be in a current shard within the database and will be very performant. That old downsampled data is going to be performant for long-term queries. So let’s say you’re querying for a year of data trying to pick out some trends. That will make those queries more performant. So in that way, downsampling improves performance. But there’s no downside to having all of that data on disk. If you’re not querying it, we’re not going to really pull it up. And keeping it around is fairly cheap. So good. Thank you.
Jack Zampolin 36:39.357 So non-string fields take up approximately three bytes per point, so the way I normally say that is three bytes per floater integer value. So one billion points is around three gigs on disk. String values require variable spaces determined by the string compression. The string compression algorithm we use is Snappy. So I think that’s generally around 60% compression. And, obviously, that depends on how many times values are repeated. Let’s say you’re storing an alert state as a field value. There’s going to be a lot of repeated values there. We’ll store pointers to those values, and they’ll reduce very well. So just depending on the data you write there. An important thing to remember is that measurement names, tag keys, and tag values are only stored once per series, not per point, and have very minimal impacts on storage needs. So load up on those tags if they’re descriptive and not in conflict.
Jack Zampolin 37:42.840 And that’s our webinar today. I’m happy to stay around and take questions. I’ll be here for the next 10 to 15 minutes, and I’ll be answering the chat in the Q&A. I hope everyone got a lot out of this webinar, and thank you all very much. So looks like there is one question in the Q&A. John asks, “Can you comment on adding an alert flag as a field versus a separate measurement? Example, temperature over a range.” “Can I see this video again?” Mikale, we will upload this video. And I think Chris is going to send this out. John, as to your question there, how would you store it as a separate measurement? I’m not quite understanding that. If it’s an alert flag, it might want to be a tag, too. You might want to GROUP BY an alert state and figure out which hosts are in an individual alert state. But having it as a measurement, I’m not quite understanding what you’re saying there. If you could add some clarifying there, I would appreciate that.
Jack Zampolin 39:25.048 Will across measurement queries come in the future, Sartorius. So you can query across measurements currently. You can query across multiple measurements using regular expressions in your FROM clause. It’s not very performant, but it’s doable. As far as cross measurement joins, I do know that those are features that are on the road map, so yes, I do hope to get those eventually. Mikale asks, “We try to use TICKscript, but we have a problem with Kapacitor because we have a time lag between hosts and InfluxDB and Kapacitor.” That’s okay. Yeah.
Jack Zampolin 40:18.443 John, sorry meant tag as opposed to measurement. John, it really depends on your use case. For tags, generally those sensor alerts, there’s a few different states, so they’re really not going to add a whole ton of cardinality. And you’re going to want to maybe GROUP BY those. Probably storing as a tag would generally be advantageous; however, I’ve definitely seen people store them as field. And I’ve had use cases where storing sort of an alert state like that as a field, maybe it’s not super critical to have, but it’s nice and descriptive and my client-side application needs to do something with that value, in that case, storing it as a field’s just cheaper. So does that make sense? Awesome.
Jack Zampolin 41:28.780 So, Mikale, it sounds like you’ve got a lot of dead man alerts that are firing inaccurately due to time lag between Influx and Kapacitor. If that’s the case, I would suggest trying to reduce the latency between InfluxDB and Kapacitor. We normally run them either on the same host, if it’s a smaller installation, or on hosts that are available in the same data center over our virtual private network to reduce latency between them. Ideally, InfluxDB does not have to travel over the open Internet to reach Kapacitor. That’s pretty slow. So reducing that latency would be the way to go there.
Jack Zampolin 42:20.646 How can we disable tasks for maintenance hosts? Maybe for sync and PTD. That’s a feature that’s coming in the future for Kapacitor. There’s no way to currently disable alerts without disabling the entire TICK script. This also depends on exactly how your TICK scripts are laid out, Mikale. Sartorius asks, “Kapacitor, InfluxDB on the same machine is okay for performance?” The short answer, no, long answer, it depends on how much performance you need out of them and how large the machine is. But for lower load scenarios, that’s okay. For higher load scenarios, please do be sure to separate those on the separate machines. Does that answer your question?
Jack Zampolin 43:28.635 Mikale, did I get to your question there about the Kapacitor stuff? There was a number of different issues there, and I was trying to answer that. Did I get everything for you? So we had a couple of more questions in here. Sartorius asks, “Kapacitor getting data from another source than Influx to make decisions?” Yes, Sartorius, Kapacitor, I like to think of it kind of as having a frontend and a backend. The frontend ingests data, and the backend spits out alerts and manipulated data. Any source that writes in InfluxDB line protocol can be routed through Kapacitor. So you can use Telegraf to write directly to Kapacitor and make alerts that way. I’ve definitely used Kapacitor in that configuration before, and it absolutely works great. So yes, you can do that. I’m late, and I didn’t see about the Docker Stack. Where can I see the video? Chris will send out an email to this video afterwards, and this particular video was not about Docker. We did a webinar earlier this week on using the TICK Stack and Docker Swarm and autoscaling there. That video will be up soon as well.
Chris Churilo 44:45.726 Oh, it’s actually live already, so I’ll put the link in here.
Jack Zampolin 44:47.694 Oh. Oh, excellent. Okay, cool. So Mikale, if you’re looking about the Docker stuff, we do have that. So yeah. Is that the Docker—are those the Docker things that you were looking for? It looks like we lost Mikale.