Why Build a Time Series Data Platform?
Please note: This blog was originally published on DB-Engines and can be found here.
In the middle of the big data hype cycle, before IoT was on the top of everyone’s buzz list, before Cloud Native was common lingo and before large enterprises were starting the work to de-silo their infrastructure monitoring and metrics data, Paul Dix, Founder of InfluxData, began building a purpose-built Time Series Platform. Flash forward to today, time series is now the fastest growing database segment, and the market is clearly moving beyond the re-purposed Cassandra and Hbase implementations that defined the segment at that time. The following post is a firsthand account from Paul Dix outlining the problems he witnessed and why he built a modern Time Series Platform.
I am frequently asked: “Why build a database specifically for time series?” The implication was that a general SQL database can act as a TSDB by ordering on some time column. Or you can build on top of a distributed database like Cassandra. While it’s possible to use these solutions for solving time series problems, they’re incredibly time consuming and require significant development effort. I talked to other engineers to see what they had done and found that there was a common set of tasks that led to the need for a common Time Series Platform. Everyone seemed to be reinventing the wheel, so it looked like there was a gap in the market for something built specifically for time series.
In this post, I’ll define the time series problem, lay out what differentiates time series from other use cases and database workloads and look at other approaches I’ve seen taken to handle the unique requirements of time series data. Finally, I’ll look at the advantages of building specifically for time series.
Defining The Time Series Problem
Let’s first define time series data and then look at how others have tried to solve for this before moving to a Time Series Platform.
When I refer to time series data, I think of two different types of time series: regular and irregular.
- Regular time series are familiar to developers in the DevOps or metrics space. These are simply measurements that are taken at fixed intervals of time, like every 10 seconds. This is also seen quite frequently in sensor data use cases, like taking a reading at regular intervals from a sensor. The important thing about regular time series is that they represent a summarization of some underlying raw event stream or distribution. Summarization is very useful when looking for patterns or visualizing data sets that have more events than you have pixels to draw.
- The second type of time series, irregular, corresponds to discrete events. These could be requests to an API, trades in a stock market, or really any kind of event that you'd want to track in time. It's possible to induce a regular time series from an irregular one. For example, if you want to calculate the average response time from an API in 1 minute intervals, you can aggregate the individual requests to produce the regular time series.
My belief is a modern TSDB needs to be able to handle both regular and irregular events and metrics.
The other part of time series is that it’s common to have metadata that describe the series that users may want to query on later. This could be a hostname, application, region, sensor ID, building, stock name, portfolio name or really any dimension on which a time series might want to be queried on. Adding metadata to time series allows you to slice and dice them and create summaries across different dimensions. This means that a series is the metadata that describes it and ordered time, value pair tuples. Metadata is represented as a measurement name, tag key/value pairs, and a field name.
Time Series Applications & Scale
Now that we’ve defined what time series are, let’s dig into what makes them different from other database use cases and workloads.
- Time series data needs to focus on fast ingestion. That is, you're always inserting new data. Most often, these are append operations where you're adding only recent time series dataalthough users do sometimes need historical backfill, and with sensor data use cases, we frequently see lagged data collection. Even with the latter, you're usually appending recent data to each individual series.
- High-precision data is kept for some short period of time with longer retention periods for summary data at medium or lower precision. One way to think about this is the raw high-precision samples and summaries for 5 minute and 1 hour intervals. Operationally this means that you must be constantly deleting data from the database. The high-precision data is resident for a short window and then should be evicted. This is a very different workload than what a normal database is designed to handle.
- An agent or the database itself must continuously compute summaries from the high-precision data for longer term storage. These could be simple aggregates like first, last, min, max, sum, count or could include more complex computations like percentiles or histograms.
- The query pattern of time series can be quite different from other database workloads. In most cases, a query will pull a range of data back for a requested time range. For databases that can compute aggregates and downsamples on the fly, they will frequently churn through many records to pull back the result set for a query. Quickly iterating through many records to compute an aggregate is critical for the time series use case.
- Server and application monitoring
- Real-time analytics
- IoT sensor data monitoring and control
The data for each of these is different, but they frequently take the same general shape. In the server monitoring case we’re taking regular measurements for tracking things like CPU, hard disk, network, and memory utilization. It’s also common to take measurements to instrument third-party services like Apache, Redis, NGINX, MySQL, and many others. Series usually have metadata information like the server name, the region, the service name, and the metric being measured. It’s not uncommon to have 200 or more measurements (unique series) per server. Let’s get a rough idea of a DevOps data set for a day. Say we have 100 servers and each has 200 unique measurements to collect. That means we have 20,000 unique series. Further, let’s say that we’re collecting this data every 10 seconds. That means in the course of a day we’re collecting 86,400 / 10 = 8,640 values per series for a total of 20,000 * 8,640 = 172,800,000 values for each day.
The Problem of Using a SQL Database for Time Series
Many of our users started off working with time series by storing their data in common SQL RDBMSes like PostgreSQL or MySQL. Generally they find this works for a time, but things start to fall apart as the scale of the data increases. If we take our server monitoring example from before, there are a few ways to structure things, but there are some challenges.
|Create a single table to store everything with the series name, the value, and a time.||Separate lookup index if we wanted to search on anything other than the specific name (like server, metric, service, etc.). This naive implementation would have a table that gets 172M new records per day. This would quickly cause a problem because of the sheer size of the table.||With time series, it's common to have high-precision data that is kept around only for a short period of time. This means that soon you'll be doing just as many deletes as inserts, which isn't something a traditional DB is designed to handle well.|
|Create a separate table per day or some other period of time.||Requires the developer to create application code to tie the data from the different tables together.||More code must be written to compute summary statistics for lower-precision data and to periodically drop old tables.|
Then there’s the issue of scaling past what a single SQL server can handle. Sharding segments of the time series to different servers is a common technique but requires more application-level code to handle it.
Conclusion: Relational technologies were not designed to solve the specific time series issues, and trying to get them to solve them is impractical.
Building on Distributed Databases
After initially working with a more standard relational database, many will look at distributed databases like Cassandra or HBase. As with the SQL variant, building a time series solution on top of Cassandra requires quite a bit of application-level code.
First, you need to decide how to structure the data. Rows in Cassandra get stored to one replication group, which means that you need to think about how to structure your row keys to ensure that the cluster is properly utilized without creating hot spots for writes and reads. Then, once you’ve decided how to arrange the data, you need to write application logic to do additional query processing for the time series use case. You’ll also end up writing downsampling logic to handle creating lower-precision samples that can be used for longer-term visualizations. Finally, once you have the basics wired up, it will be a continual chore to ensure that you get the query performance you need when querying many time series and computing aggregates across different dimensions.
Conclusion: Writing all of this application code is frequently a multi-month project requiring competent backend engineers.
Advantages of Building Specifically for Time Series
So this brings us back around to the point of this post: Why build a Time Series Data Platform?
One of our goals we envisioned when making a Time Series Platform was optimizing for a user’s or developer’s time to value. That is, the faster they get their problem solved and are up and running, the better the experience will be. That means that if we see users frequently writing code or creating projects to solve the same problems, we’ll try to pull that into our platform or database. The less code a developer has to write to solve their problem, the faster they’ll be done.
Time is Peculiar
Other than the obvious usability goals, we also saw that we could optimize the database around some of the peculiarities of time series. It’s insert only, we need to aggregate and downsample, we need to automatically evict high-precision data in the cases where users want to free up space. We could also build compression that was optimized for time series data. We also organized the data in a way that would index tag data for efficient queries. At the database level, there were many optimizations we could get.
Going Beyond a Database to Make Development Easier
The other advantage in building specifically for time series is that we could go beyond the database. We’ve found that most users run into a common set of problems they need to solvehow to collect the data, how to store it, how to process and monitor it, and how to visualize it.
We’ve also found that having a common API makes it easier for the community to build solutions around our stack. We have the line protocol to represent time series data, our HTTP API for writing and querying, and Kapacitor for processing. This means that over time, we can have pre-built components for the most common use cases.
We find that we can get better performance than more generalized databases while also reducing the developer effort to get a solution up by at least an order of magnitude. Doing something that might have taken months to get running on Cassandra or MySQL could take as little as an afternoon using our stack. And that’s exactly what we’re trying to achieve.
By focusing on time series, we can solve problems for application developers so that they can focus on the code that creates unique value inside their app.
|About the Author:||Paul is the creator of InfluxDB. He has helped build software for startups, large companies and organizations like Microsoft, Google, McAfee, Thomson Reuters, and Air Force Space Command. He is the series editor for Addison Wesley's Data & Analytics book and video series. In 2010 Paul wrote the book Service Oriented Design with Ruby and Rails for Addison Wesley's. In 2009 he started the NYC Machine Learning Meetup, which now has over 10,000 members. Paul holds a degree in computer science from Columbia University.|