Time series database (TSDB) explained
What is a time series database?
A time series database (TSDB) is a database optimized for time-stamped or time series data. Time series data are simply measurements or events that are tracked, monitored, downsampled, and aggregated over time. This could be server metrics, application performance monitoring, network data, sensor data, events, clicks, trades in a market, and many other types of analytics data.
A time series database is built specifically for handling metrics and events or measurements that are time-stamped. A TSDB is optimized for measuring change over time. Properties that make time series data very different than other data workloads are data lifecycle management, summarization, and large range scans of many records.
Why is a time series database important now?
Time series databases are not new, but the first-generation time series databases were primarily focused on looking at financial data, the volatility of stock trading, and systems built to solve trading. Yet the fundamental conditions of computing have changed dramatically over the last decade. Everything has become compartmentalized. Monolithic mainframes have vanished, replaced by serverless servers, microservers, and containers.
Today, everything that can be a component is a component. In addition, we are witnessing the instrumentation of every available surface in the material world — streets, cars, factories, power grids, ice caps, satellites, clothing, phones, microwaves, milk containers, planets, human bodies. Everything has, or will have, a sensor. So now, everything inside and outside the company is emitting a relentless stream of metrics and events or time series data.
This means that the underlying platforms need to evolve to support these new workloads — more data points, more data sources, more monitoring, more controls. What we’re witnessing, and what the times demand, is a paradigmatic shift in how we approach our data infrastructure and how we approach building, monitoring, controlling, and managing systems. What we need is a performant, scalable, purpose-built TSDB.
What distinguishes the time series workload?
Time series databases have key architectural design properties that make them very different from other databases. These include: time-stamp data storage and compression, data lifecycle management, data summarization, ability to handle large time series dependent scans of many records, and time series aware queries.
For example: With a time series database, it is common to request a summary of data over a large time period. This requires going over a range of data points to perform some computation like a percentile increase this month of a metric over the same period in the last six months, summarized by month. This kind of workload is very difficult to optimize for with a distributed key value store. TSDB’s are optimized for exactly this use case giving millisecond level query times over months of data. Another example: With time series databases, it’s common to keep high precision data around for a short period of time. This data is aggregated and downsampled into longer term trend data. This means that for every data point that goes into the database, it will have to be deleted after its period of time is up. This kind of data lifecycle management is difficult for application developers to implement on top of regular databases. They must devise schemes for cheaply evicting large sets of data and constantly summarizing that data at scale. With a Time Series Database, this functionality is provided out of the box.
Independent ranking of top 15 time series databases
Time series databases are the fastest growing segment in the database industry. But which time series database is the best and most popular? There are many ways of determining popularity, but an independent website, DB-Engines, ranks databases based on search engine popularity, social media mentions, job postings, and technical discussion volume. (Read their full methodology). Here are the current results:
Time series – the fastest growing database category
DB-Engines also ranks time series database management systems (Time Series DBMS) according to their popularity. Time series databases are the fastest growing segment of the database industry over the past year.
Time series databases vs. other databases
Time series databases are often compared to other databases. There are multiple types of databases that get pulled up for comparison. Mostly, these are distributed databases like Cassandra, MongoDB or HBase. When comparing time series databases with Cassandra, MongoDB, or HBase there are some stark differences. First, those databases require a significant investment in developer time and code to recreate the functionality provided out of the box with TSDB’s.
Specifically, developers will need to write code to shard the data across the cluster, aggregate and downsampling functions, data eviction and lifecycle management, and summarization. Finally, they’ll have to create an API to write and query their new service. Also, they’ll need to write tools for data collection. They’ll need to introduce a real-time processing system and write code for monitoring and alerting. Finally, they’ll need to write a visualization engine to display the time series data to the user.
Time series databases vs. Elasticsearch
People often ask what separates a time series database from Elasticsearch. Elasticsearch is designed and purpose-built for search, and for that use case it’s an excellent choice. However, for time series data it’s like putting a square peg into a round hole.
It’s difficult to work with the API causing developers to take much more time to get going. Elasticsearch’s performance is far worse than the performance of the various time series databases. For write throughput, TSDB’s typically outperform Elasticsearch by 5-10x depending on the schema and the time series database. Query speed on specific time series is 5-100x worse with Elasticsearch than with a TSDB depending on the range of time being queried. Finally, on-disk size is 10-15x larger on Elasticsearch than with most time series databases if you need to query the raw data later. If using a config that summarizes the data before it goes into the database, Elasticsearch’s on-disk size is 3-4x larger than a time series database. For performance, time series databases outperform Elasticsearch on almost every level.
Read more about how time series databases outperform Elasticsearch >
Time series databases vs. MongoDB
MongoDB is an open-source, document-oriented database, colloquially known as a NoSQL database, written in C and C++. Though it’s not generally considered a true TSDB per se, its creators often promote its use for time series workloads.
It offers modeling primitives in the form of timestamps and bucketing, which give users the ability to store and query time series data. MongoDB is a general-purpose document store. MongoDB is intended to store “schema-less” data, in which each object may have a different structure. In practice, MongoDB is typically used to store large, variable-sized payloads represented as JSON or BSON objects. Both because of MongoDB’s generality, and because of its design as a schema-less datastore, MongoDB does not take advantage of the highly-structured nature of time series data. In particular, time series data is composed of tags (key/value string pairs) and sequences of time-stamped numbers (which are the values being measured). As a result, MongoDB must be specifically configured to work with time series data.
Read more about how time series databases outperform MongoDB >
Time series databases vs. Cassandra
Cassandra is a distributed, non-relational database written in Java, originally built at Facebook and open-sourced in 2008. It officially became part of the Apache Foundation in 2010.
It is a general-purpose platform that provides a partitioned row store, which offers features of both key-value and column-oriented data stores. Though it provides excellent tools for building a scalable, distributed database, Cassandra lacks most key features of a time series database. Thus, a common pattern is to build application logic on top of Cassandra to handle the missing functionality. Cassandra requires major upfront engineering effort to be useful. Using Cassandra required us to be familiar with Cassandra column families, rows, wide rows, CQL, compact storage, partition keys, and secondary indexes. These are general Cassandra concepts and are not particular to the time series use case. Cassandra also requires domain-specific decision-making.
Read more about how time series databases outperform Cassandra >
Old time series databases
There are different types of data models when it comes to time series databases, as described below.
Graphite comparison: Graphite is an older time series database monitoring tool that runs equally well on low-end hardware or cloud infrastructure.
Teams use Graphite to track the performance of their websites, applications, business services, and networked servers. It marked the start of a new generation of monitoring tools, making it easier than ever to store, retrieve, share, and visualize time series data. Graphite was originally designed and written by Chris Davis at Orbitz in 2006 as a side project that ultimately grew to be their foundational monitoring tool. In 2008, Orbitz allowed Graphite to be released under the open source Apache 2.0 license. Graphite stores numeric samples for named time series and expresses a value and its associated metadata with period delimited strings. These are commonly called ‘points’:
prod.sequencer-142.ingen.com.cpu.user 0.0 1473802170
prod.sequencer-142.ingen.com.cpu.nice 1.3 1473802170
prod.sequencer-142.ingen.com.cpu.system 2.3 1473802170
With this method, metadata associated with the various measurements (in the above case, CPU measurements) are transmitted multiple times for every same interval. What that means is that for something like a standard Sensu CPU check, Graphite will easily emit 6-10 different metrics in the above format for each CPU on each host. That extra metadata quickly adds up. Additionally, in Graphite, each of the strings is also stored in a different file and takes up an index space.
OpenTSDB comparison: OpenTSDB is a scalable, distributed time series database written in Java and built on top of HBase. It was originally authored by Benoît Sigoure at StumbleUpon beginning in 2010 and open-sourced under LGPL. OpenTSDB is not a standalone time series database. Instead, it relies upon HBase as its data storage layer, so the OpenTSDB Time Series Daemons (TSDs in OpenTSDB parlance) effectively provide the functionality of a query engine with no shared state between instances. This can require a significant amount of additional operational cost and overhead to manage in a production deployment. In OpenTSDB’s data model, time series are identified by a set of arbitrary key-value pairs, and each value belongs to exactly one measurement; each value may have tags associated with it. All data for a metric is stored together, limiting the cardinality of metrics. OpenTSDB does not have a full query language but allows simple aggregation and math via its API. OpenTSDB supports up to millisecond resolution. This becomes increasingly important as sub-millisecond operations become more common, and additionally allows the freedom to accurately store timestamps for events that may occur in close temporal proximity to one another. One caveat about OpenTSDB is that it is primarily designed for generating dashboard graphs, not for satisfying arbitrary queries nor for storing data exactly. This has implications for how it should be used. Read how InfluxData’s time series database compares to OpenTSDB.
Riak comparison: Riak is a distributed NoSQL key-value data store that offers high availability, fault tolerance, operational simplicity, and scalability. Riak TS is a key/value store optimized for fast reads and writes of time series data. And like all Times Series Databases, Riak TS is built to handle the unique needs of time series applications ensuring high availability, data accuracy, and scale.
KDB+ comparison: Kdb+ is a column-based relational time series database with in-memory capabilities, developed and marketed by Kx Systems. Kdb+ has nanosecond timestamp precision, time ordered querying, and aggregation across time buckets.
What makes InfluxDB time series database unique?
InfluxDB was built from the ground up to be a purpose-built time series database; i.e., it was not repurposed to be time series. Time was built-in from the beginning. InfluxDB is part of a comprehensive platform that supports the collection, storage, monitoring, visualization and alerting of time series data. Not just a simple database.
The whole InfluxData platform is built from an open source core. InfluxData is an active contributor to the Telegraf, InfluxDB, Chronograf and Kapacitor (TICK) projects — the “I,C,K” from the TICK Stack is being collapsed into a single binary in InfluxDB 2.0 — as well as selling InfluxDB Enterprise and InfluxDB Cloud on this open source core. The InfluxDB data model is quite different from other time series solutions like Graphite, RRD, or OpenTSDB. InfluxDB has a line protocol for sending time series data which takes the following form:
measurement-name tag-set field-set timestamp. The measurement name is a string, the tag set is a collection of key/value pairs where all values are strings, and the field set is a collection of key/value pairs where the values can be int64, float64, bool, or string. The measurement name and tag sets are kept in an inverted index which make lookups for specific series very fast. For example, if we have CPU metrics:
cpu,host=serverA,region=uswest idle=23,user=42,system=12 1464623548s
Timestamps in InfluxDB can be second, millisecond, microsecond, or nanosecond precision. The micro and nanosecond scales make InfluxDB a good choice for use cases in finance and scientific computing where other solutions would be excluded. Compression is variable depending on the level of precision the user needs. On disk, the data is organized in a columnar style format where contiguous blocks of time are set for the measurement, tagset, field. So, each field is organized sequentially on disk for blocks of time, which make calculating aggregates on a single field a very fast operation. There is no limit to the number of tags and fields that can be used. Other time series solutions don’t support multiple fields, which can make their network protocols bloated when transmitting data with shared tag sets. Most other time series solutions only support float64 values, which means the user is unable to encode additional metadata along with the time series. Even OpenTSDB and KairosDB, which support tags (unlike Graphite and RRD), have limitations on the number of tags that can be used. At around 5 to 6 tags, the user will start seeing hot spots within their cluster of HBase or Cassandra machines. InfluxDB doesn’t have this limitation. The InfluxDB data model is designed for time series specifically. It pushes the developer in the right direction to get good performance out of the database by indexing tags and keeping fields unindexed. It’s flexible in that many data types are supported, and the user can have many fields and tags.
Available as InfluxDB open source, InfluxDB Cloud & InfluxDB Enterprise