In this webinar, Michael DeSa will define what time series data is (and isn’t), how the problem domain time series differs from more traditional data workloads like full-text search, and examine how InfluxData is differentiated from other proposed solutions (1 hr). Recorded February 2017
Watch the webinar “Introduction to Time Series” by clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “Introduction to Time Series.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Michael DeSa: Software Engineer, InfluxData
Michael DeSa 01:01.402 So that first question there which is what is time series data? Time series is a sequence of data points, typically consisting of successive measurements made from the same source over a time interval. So if that’s kind of a little bit abstract—you don’t really know what that means—what it really means when it comes down to it is if you were to take your data, you throw it up on to a graph somewhere, one of the axes on your graph would always be time. So one of the dimensions of your data is always time. Time matters and it changes over time. So here we have the #royalbaby usage over time. This was right around the time that, I think it was Prince Harry—not Prince Harry, Charles, I believe was having a baby and you could see when sort of the tweets per hour was trending. So here in the x-axis, we have time, y-axis we have some metric. This is time series data. This next example we have here is of something that is not time series data. Here we kind of have this web graph of things where you have users, you have apps, you have friends, you have various other sort of things, and you can like things, you can cook things, you can listen to things. And the important thing about this data here is the state of the graph and the relationships between the various objects in the graph, not how that graph changes over time. And that’s sort of a key thing here that we’re going to keep kind of going over here. If your data does not evolve over time or if you don’t care how your data evolves over time, it’s probably not a time series use case.
Michael DeSa 02:48.861 Here we have kind of some weather data where we’re monitoring things like the dew point, the temperature, and some other sort of metrics around some weather data. And we can see here we’ve got sort of historical graphs as well as these kinds of gauges that tell us the current state of something. This is time series data. The next thing we have here that is not time series data, it’s kind of something you see a lot in maybe a little bit more kind of machine learning things where you’re trying to do some sort of classification. You want to determine sort of classification of things like this and if you have very high athletics but low grades, you’re maybe a jock. If you have very high athletics and very high grades, maybe you’re sort of a superstar. And the important thing about the data here is that it’s sort of clustered in this way and the relationship between grades and athletics and sort of what that means for your sort of social status in some kind of methodical high school here. And so the important thing here is, again, these clusters, not how these clusters change over time, and therefore it is not time series data. Here we have what I would consider to be the sort of typical or the epitome of time series, at least in the monitoring world, where we’re looking at system metrics from a server, so things like the disk write ops, the amount of disk used, the short-term load, and the resident set size of a process on some server somewhere. And in this example here we have things graphed on Grafana. So this is here, x-axis time, y-axis submetric, definitely time series data.
Michael DeSa 04:44.096 So if we take that data that we had earlier where we were, say, looking at the tweets per hour, if we then just said, “Well, we really only care about the distribution of tweets by population over a whole.” That might not be the most sort of time series oriented use case, and so for that reason, I would consider that not to be time series data. However, it is possible that you would want to know in the last hour, what was the distribution of tweets by population? And you want to look at how that distribution changes over time, and that would be something that you could have a time series use case for. Here, again, we have something that is, without a doubt, not time series. We have x-axis temperature, which has no relation to time, and y-axis sort of price. No, this is definitely not time series. There’s no time on this graph anywhere. And then here we have—ah, this one’s not spreading. There was a video here. I’ll be sure to share out some slides that have a video, but here we have sort of the—we see the viral spread of Ed Sheeran’s Sing, where you could see the green spots and the red spots sort of gradually expanding over time. And sort of that lends itself, again, to a time series use case. This would be perfect for something to go into a Time Series Database.
Michael DeSa 06:17.051 And here, again, we have some sort of CPU load that we’re monitoring on a number of different systems. X-axis time, y-axis submetric. Without a doubt, this is time series data. And then, this one here is a little bit fuzzy. So in the x-axis we have time spent studying. In the y-axis we have student grades. So while this is not—time spent studying is actually a metric. It is not a sort of an ordering. It’s, how long was I studying? That ends up being the thing that you’re storing, and then student grades is the other metric. While time spent studying isn’t necessarily a relationship of time, you could store this type of data in a Time Series Database, though it may not be the best-suited thing. But it is possible.
Michael DeSa 07:17.969 Within time series, there’s two major categories. There’s regular time series, sort of typically known as metrics. These are the things on that blue graph there that come in at a fixed regular interval. And then there is irregular time series which are things like events. So that’s that red graph down at the bottom there where I don’t necessarily know when the data’s going to come in, I just know that it’s going to come in. So, again, regular time series is measurements gathered at regular time intervals, think metrics. So monitoring your system, monitoring your servers, things like this. And then think of irregular time series. You want to think of events, or in the IoT use case, typically, sensors only report data on a state change. So when the state changes, there’s an event for that. The data gets reported at an irregular interval, meaning I don’t necessarily know when the data’s going to come in.
Michael DeSa 08:23.976 So the summarization of events becomes fairly important. Applying a lot of forecasting techniques and sort of just getting a general sort of feel of your data often requires having these sorts of events or this irregular data turned into regular data. So a good example of something is if I’m, say, monitoring the price of Apple stock. These trades come in very, very frequently, and I want to really know, over the last 10 minutes, what were the trades that happened. So I’d get a bunch of different events for trades and I want to sum all of those together to give me a single thing for the last 10 minutes. And then, similarly, if I’m, say, monitoring an application, a web application, and each endpoint I have is receiving requests and then giving out responses, and I want to measure the latency of those requests. Then each time I get a request, I want to kick off or trigger some kind of event. But I really don’t care about the individual event. I care about some kind of aggregated window or maybe some sort of histogram of all the request latencies that helps me get an idea of what’s really going on. So the summarization of events and sort of data over time spans is something that’s very, very important for time series data.
Michael DeSa 09:54.392 So now we have an idea of what time series data is. Well, what is a Time Series Database? And Time Series Databases are optimized for collecting, storing, retrieving, and processing time series data. And so this is a little bit different from, say, a document database where you’re really storing your general documents and you perform operations that, say, modify that document or you do something like joining documents together. And the important thing here is, it’s not particularly suited for storing just a single numeric value or a single string value and how that single string value, say, changes over time. This is a little bit different from a document database. Similarly, to search a database, if I want to, say, using something like Elasticsearch, I can put metrics into Elasticsearch. But the database itself is doing a lot of extra work to pull out those metrics, mostly because it’s not suited for individual sort of datapoint metrics over a time period. It’s just something that they happen to also be able to do with varying levels of efficiency.
Michael DeSa 11:12.782 And then, the final thing here, sort of compared to traditional databases, relational databases, in the time series world, it’s very frequent that your data shape is changing and you’re querying data in an individual series, so you really care about columns. So these more tradition, sort of tabular storage systems aren’t particularly suited for time series data, specifically. So the key thing is there, you want to think about this kind of sequence of datapoints or sequence of metrics along at timestamp is really what you want to think about when you’re thinking about time series. Which is really some sort of columnar structure that time series lends itself to.
Michael DeSa 11:58.373 So the time series workload is particularly unique because of the lifecycle management. With the time series, it’s very frequent that—or very common that you only need the last month’s worth of data at a very high resolution. You don’t want to keep around 20 billion points or 20 billion individual metrics for forever. And so you want some way to manage the lifecycle of your data. You want a way to retire old data somewhere else or compress it down into a smaller format because you really don’t need that high-resolution data for extended periods of time. Similarly, you want to have very efficient summarization techniques. So you want to be able to summarize what the maximum value of this thing in the last hour was, very efficiently, and mostly, over some time span is sort of the key thing there. And then you have very large scans of many, many, many records. So you have billions, and billions, and billions of records that you have to comb through, and you need some way to present that onwards to a client. So maybe you have a billion rows and pulling a billion individual points out of the database is probably infeasible, so you want some sort of summarization to pull things together so you get a realistic view of the world.
Michael DeSa 13:26.880 There’s some other types of databases that get used for time series. As I mentioned, Cassandra. So Cassandra is a columnar distributive database, and you can store time series data in Cassandra. The biggest issue here, the biggest issue I see, is very often you end up implementing a lot of things that a Time Series Database kind of just gives you for free. And I think that theme is pretty common across the other types of databases, whether it be Redis, MongoDB, or Elasticsearch, is you can do time series in these databases, you just have a lot more work that you have to do to make that use case work. And so there’s sort of a trade-off there in how much overhead are you willing to deal with? How much extra work do you want to have to put into maintaining this sort of time series use case and kind of custom tooling around it? Off to the other side here, we have some other Time Series Databases. I’ll sort of go through these in order here which is Kx. Kx Systems makes a database called kdb, which is very popular in the financial industry. And the biggest issue I see here is kdb is in memory and disk-backed, but very often we have users that have use cases that need more than one machine. So you couldn’t get enough memory on a single machine to hold all of that data. Off to the side, we have Graphite—I’m sorry. Graphite. Graphite is a sort of legacy Time Series Database, and whether or not it’s actually time series is a little bit debatable. It’s more for metrics and monitoring, and less for a generic time series use case. And it does things like automatic roll-ups of your data and you often lose the individual sort of irregular data that you might want to be storing into a Time Series Database.
Michael DeSa 15:28.891 The next one over here is OpenTSBD. So OpenTSDB is not actually a database in itself. It’s all layer on top of HBase. And so the biggest concern I see from people here is you need to manage HBase in addition to this layer on top of it. And so managing HBase is quite an ordeal, and most people don’t want to undertake something like that. And then finally, down here we have Riak TS, which it’s a company called Basho makes a database called Riak, and they kind of altered their database to be a little bit more suited for time series data. And it’s shown some promise, though I haven’t seen too much written about it in the last few months. There’s one that’s not on here that I’d like to acknowledge, which is Prometheus. And Prometheus is very good in the kind of DevOps monitoring world. The biggest issue that I see there is managing it in a highly available state and doing some sort of scale-out with that becomes infeasible. And—oops—for those reasons, I would suggest that it is maybe not the most suited for a generic times series use case or a use case where you have a scale-out load or things like that.
Michael DeSa 16:47.743 Now we’re going to move on to some time series use cases. The primary use cases that we see are IoT, DevOps, and real-time analytics. And we’ll kind of go through those just briefly and talk about what each of those things are. So, in the IoT space, we see people like factories or oil and gas companies or sort of smart-run infrastructure companies that want to have some metrics associated with what they’re doing. So, let’s say, take the oil and gas example. Go down to some oil drill somewhere and there’s going to be tons of sensors. Each of these things is kicking off metrics. They want to be able to store that and query that data in some kind of real-time fashion. And so that’s what we see in the IoT use case in the industrial setting. For consumer, most people have seen things like Fitbit where you can monitor things like the number of steps you take per day or per hour and sort of look at the distribution of those things there. That’s definitely an IoT use case that we see a lot. In the DevOps use case, we see custom monitoring. Whether that be monitoring your application, your users, events, servers, sort of that whole spectrum of things there. We’ve seen intrusion detection systems being applied where the data gets stored into InfluxDB, and so there’s sort of that space there. And then real-time analytics where you’re storing something that’s pertinent to your business in real time. So this could be a number of different things. I’m struggling to think of the one off the top of my head, but we do see that fairly frequently, as well.
Michael DeSa 18:37.297 So if I’m coming out into this space, I need to build some sort of time series solution. Why should you choose InfluxDB over the other options that we’ve talked about? And sort of the number one thing in my mind and the number one thing at this company that we hold as the standard, is we want all of our tools to be very easy to get started with. So, in my experience, whether it be with Cassandra, or HBase, or any other of these other systems, there’s very often a large overhead in getting started, and that kind of limits me when I’m building something out for the very first time. I don’t want the database to become something I have to think about and manage when I’m just starting out. And so, for that reason, we want InfluxDB to be very, very, very easy to get started with. So we try to make it as easy as possible. The next thing, kind of in line with that easy to get started with, most people that have some sort of software experience have dealt with SQL, and for that reason, we derived our query language off of SQL. So easy to get started with, familiar query language. In line with that easy to get started with, we didn’t want to have external dependencies. So for OpenTSDB, you have to run an HBase cluster. There’s a thing called metrics tank. For that, you have to run a Cassandra cluster. So we didn’t want any of that. We just wanted to have a single binary that you put on your machine and then you can run it, and that’s all you had to do. You don’t have to manage some external thing. The only thing you have to deal with is InfluxDB. For that, we wanted to allow for both regular and irregular time series. So it’s pretty common where you wanted to store irregular sort of events in the database. So in sort of the IoT world, most of the things around that were based off of events. And so some of the other time series solutions, like Graphite, don’t necessarily allow for irregular time series in the best way so we wanted something that was, again, very easy to use whether you’re doing regular or irregular time series.
Michael DeSa 20:47.156 We’re horizontally scalable. So when you sort of start out building your application, you only need, really, a single instance. But say you start getting a little bit more load, you need some high availability, you need to scale-out the performance of your cluster, you can really do that with InfluxDB. That being said, it is sort of a commercial engagement, but we hope that sort of getting the open source server should get you far along enough. At some point, you’ll be happy to scale-out and pay for scale-out. And then the final point down there that I’d like to make is which is we’re not just a database. InfluxDB or InfluxData is not just a database. We’re really building a whole Time Series Platform. And the reason is it comes back to that ease of use, where we build a collector telegraph that you can sort of get data into InfluxDB, whether it be in a variety of different formats. We have InfluxDB, where you can really store that data; Chronograf, where you can visualize that data; and then Kapacitor, where you can process that time series data. And we’re aware that there’s other sort of collectors out there, whether that be collectd or other sort of visualization tools out there, whether it be Grafana or sort of a number of other things. And there’s various other sort of stream processors or batch processors like Spark, or Flink, or Samza. There’s sort of a wide array of them. We’re aware that all of these things exist. And the main reason we make all of them is we want to have something that, if you haven’t made a decision about this and you need to get off the ground and up and running, you have a way to do so. And you can keep it all in sort of one thing and it will work seamlessly. And so that’s kind of where I see the most value in InfluxDB, is we can leverage the entire stack to provide an easier out-of-the-box experience so that you can spend less time worrying about your infrastructure and more time building your own application.
Michael DeSa 22:56.963 On that note, we’re going to go into the InfluxDB data model. And for that, we’re going to sort of start with a typical time series graph. In the x-axis we have time, y-axis we have price—sorry. Spaced out here, jumped a little bit ahead. We’re going to start with this graph here and we’re going to look up at the label on the graph here. We call this label the measurement. So it’s kind of a high-level grouping for all of the data beneath the graph. Off to the side, we have the legend data or sort of metadata. We call this metadata tags, and tags are indexed values in InfluxDB. The collection for all of the tags for a single piece of metadata, so the blue circle with the A, or the green square with the AA, we call that sort of collection of all the metadata for a single legend item a tagset. So the blue circle with the A is ticker=A,market=NASDAQ. Then we have y-axis values. So these y-axis values are called fields and they can be ints, strings, floats, or bools. The collection of all of the fields—just as the collection of all the tags is called the tagset, the collection of all the fields is called the fieldset. Note that in this case, there’s only one field called price, but we could have many. There’s no reason we couldn’t have price, and volume, and high price, low price, and so on and so forth. And then finally, we have the thing that makes this time series, which is that timestamp down at the bottom, which is represented as the number of nanoseconds since January 1st at midnight in 1970 in nanoseconds.
Michael DeSa 24:50.839 For a long time, we used to represent the data that came into the database as JSON. And a while back we had some performance issues with encoding and decoding JSON, and for that reason, we looked at kind of the other formats that were out there and we developed our own textual format called the line protocol. And the line protocol is as follows. It goes: measurement, comma, tagset, space, fieldset, space, timestamp. So here we have an example of stock_price,ticker=A,market=NASDAQ price=177.03 large timestamp. So just to give you an idea of what points kind of look like, we have a measurement, and then our tagset, our fields, and then the timestamp. And then we can see that different measurement and tagsets have different fields, and so on and so forth. This brings me to what is referred to as a series in InfluxDB. So a series is all points that have a common measurement and tagset, or all things that have a common measurement and tagset. So that’s that first section of the line protocol that we saw there. And you can think of it as all points on that blue line, all points on that yellow line, all points on that green line, all belong to the same series. So the next thing is an individual point can be thought of as a measurement plus a tagset plus a timestamp. So in that way, you can kind of think of the timestamp as the ID for a point in a series. Sort of a lot of words there but drawing some sort of SQL analogies.
Michael DeSa 26:42.455 Here we have a couple more examples of some points in line protocol. We could have a measurement CPU that has a single tag called host equal to server one, has a single field called value equal to 100, and then some large timestamp. Below that we have the measurement temperature that has two tags, zip code and country, that store the values 94107 and USA respectively. And then in this case, we have two fields which are value and humidity, which store 75 and 10. And the important thing to note here is that there is no timestamp. This is perfectly valid line protocol. If you do not specify a timestamp, InfluxDB will assign the time that it received the point on the server that received it. So there’s a bit of problems about that, and you could have a race condition for some reason your request fails and you need to retry it. Your data might be slightly out of order. And for that reason, we always recommend assigning timestamps. The final example that we have here is measurement called response time that has two tags, method and precision, that store get and millisecond respectively. And finally, we have a single field called value equals to 12i, and then a timestamp. An important thing to note here is that i on the app, following the 12, denotes that that value is an integer. All other value types that we’ve seen here have been treated as floating point numbers.
Michael DeSa 28:12.994 So this brings me to writing data in InfluxDB. Doing so is sort of very easy. First thing that you’re ever going to do is when you start writing data into a database, you want to actually create a database. So you’re going to start InfluxDB up in some background process and then type Influx and then that’s going to open the InfluxDB CLI. And you should see something very similar to what we saw here which is connected to http://localhost8086 version 0.9. You should see something slightly newer, so something more like 1.2 or 1.1. And then once you’ve done that, you can type a create database mydb which will create the database mydb for you. We can then verify that that database was created by issuing a show databases command. And when we issue a show databases command, we’ll see two databases there. One of them being internal and the other one being mydb. So internal is a database that just kind of comes along that stores various internal statistics about the InfluxDB process, and mydb is the database that we just created. In the past, I’ve seen some people have some confusion about, what is name? Name is actually just the column header that we don’t have sectioned off. So it’s actually not a—there’s no database called “name”. Once we’ve created the database, the next thing we want to do is use the database. So using the database sets the context for all further queries or commands. So we’re going to say use database mydb, and it says using database mydb. Now all further queries that we issue will have that context set.
Michael DeSa 30:03.238 Once we’re in the CLI, we can insert some data just to sort of get a feel for how things work by issuing an insert command. So you say insert CPU,host=server1, location=us-west, value=10, no timestamp, and then hit enter. And then insert CPU,host=server1, location=london, value=11. And then insert CPU, host=server2, location=us-west, value=12. So sort of just go through and insert each of those points. And once that’s done, we can verify that all of the data was actually written by issuing a select star from CPU. The important thing to note here is if you have a real instance, you may not want to issue that specific query unless you have some protections built into the database already. So if you just have, say, 30 billion records that are under the CPU measurement and you say select star, it will try to pull all of those off disk. If you don’t want that to happen, the InfluxDB configuration file does have the way that you can prevent this, but those are off by default. The following two things here, if you issued them, they would show you the series that were in the database, which we should see three of them because we inserted three points and three different series. And if we issued show measurements, we should see just that CPU measurement. And on that note, I am out of slides. So I’m happy to answer any questions or things that may come up. And thank you again for your time, and I’ll be in the Q&A answering questions. Thank you.
Track and graph your Aerospike node statistics as well as statistics for all of the configured namespaces.
Knowing how well your webserver is handling your traffic helps you build great experiences for your users. Collect server statistics to maintain exceptional performance.
Collect and graph performance metrics from the MON and OSD nodes in a Ceph storage cluster.
Use the Dovecot stats protocol to collect and graph metrics on configured domains.
Easily monitor and track key web server performance metrics from any running HAProxy instance.
Gather metrics about the running Kubernetes pods and containers for a single host.
Collect and act on a set of Mesos statistics and metrics that enable you to monitor resource usage and detect abnormal situations early.
Gather and graph metrics from this simple and lightweight messaging protocol ideal for IoT devices.
Gather phusion passenger stats to securely operate web apps, microservices & APIs with outstanding reliability, performance and control.
The Prometheus plugin gathers metrics from any webpage exposing metrics with Prometheus format.
Monitor the status of the puppet server – the success or failure of actual puppet runs on the end nodes themselves.