The need for a purpose-built time series platform
Storing time-stamped or time series data is not new. We have been storing time series in databases since the advent of the computer. Initially, time series were just added to existing general-purpose datastores, like MySQL. First-generation time series platforms, like KDB+, RRDtool and Graphite, were introduced almost two decades ago to primarily analyze and provide insight on individual systems within data centers and on edge-case use cases like high-frequency financial data, stock volatility and algorithmic trading applications.
Why is time series the fastest growing database category?
According to industry analysts and database tracking sites like DB-Engines, time series databases are the fastest growing segment in the database market. The reason for this is obvious: there has been an explosion in the amount of data that is being created as all things in the virtual world — databases, networks, containers, systems, applications – and the physical world — homes, cities, factories, power grids – are being instrumented, thus creating relentless streams of time-stamped data for organizations. What was formerly only relevant to a few specific use-cases is now vital to every business getting insights from data to inspire better customer experiences, automate factory floors, and build previously unthinkable applications.
This means that the underlying data platforms need to evolve to support the new workloads — more data points, more data sources, more monitoring, more controls, and the need for real value in real time. And that’s where purpose-built time series platforms step in.
Runtime requirements for a next-generation time series platform
Designed for volume
As stacks, sensors and systems are increasingly instrumented, they are producing larger volumes of data, which are being collected at higher frequencies. Data ingest rates of millions of points per second are commonplace. All this data needs to be ingested in a non-blocking way and compressible to conserve finite compute resources.
Designed for real-time actions
Today’s world is mercilessly real-time. Customers, businesses and ops need access to data in real time. They need to identify patterns, establish predictions, control systems, and get the insights necessary to stay ahead of the curve. Data should be available and queryable as soon as it is written. Basic monitoring is too passive. Advances in machine learning and analytics make automation and self-regulating actions a reality. The time series platform must be able to trigger actions, perform automated control functions, be self-regulating, and provide the basis for performing actions based on predictive trends.
Designed for cloud scale
The world demands that systems are available 24x7x365 and automatically scale up and down depending on demand. They must be deployable across different cloud, on-premise, and edge infrastructures without undue complexity. They need to make optimal use of resources, for instance keeping only what is needed in memory, compressing data on disk when necessary, and moving less relevant data to cold storage for historical analysis.
Developer requirements for a next-generation time series platform
The fast adoption of technology by developers and DevOps engineers is critical for business success. Enterprises can no longer dictate the technology developers use; instead, savvy developers are experimenting and identifying solutions that they can bring to the enterprise. Next-generation platforms must be built with developer requirements in mind.
Developers need a platform that is ready to go on the cloud or via download in just a few clicks. They need to see productivity in minutes. The platform needs to understand agile development-test-deployment cycles and repeat them all in minutes, not days or weeks. The platform must be elegant and simple to use, free of external dependencies, yet open and flexible enough for complex deployments.
Open source core
Open source is not about licensing — it’s about sharing ideas and information, participating and collaborating on solutions, and a community where the whole is greater than the sum of its parts. It’s about full transparency where nothing is hidden.
Built to scale
Developers must be able to start small and be confident that their platform can evolve and grow to meet both their operational needs and the enterprise’s business goals. They must be able to go from prototype to production in days not years, be able to scale up to handle more data sources, larger data volumes and more users as their solutions become wildly successful.
We built InfluxDB from the ground up to support next-generation requirements and continue to evolve the product to meet these requirements. InfluxDB is built on an open source core and completely written in Go. You can download and get up and running, with zero external dependencies, in just minutes. Or we can run it for you on InfluxDB Cloud and rely on the InfluxData team to keep the software up to date and fully optimized. Developer happiness is important to us and we have an active dedicated community that we recommend you join to network with like-minded developers, share best practices, and get answers to your questions.
InfluxDB is the leading time series platform – offering a highly scalable data ingestion and storage engine, which is highly efficient at collecting, storing, querying, visualizing, and taking action on data streams in real time. It offers downsampling and data retention policies to support keeping high-value, high-precision data in memory and lower-value data to disk. It is built in a cloud-native fashion, providing scalability across multiple deployment topologies, including cloud, on-premises, and hybrid environments.
We are never done. We are relentlessly innovating constantly to support these next-generation workloads and focus on developer happiness and productivity. Our goal is to enable developers to focus on the business applications they’re building rather than the infrastructure required to support the solution. It’s the inspiration for why we created a purpose-built time series platform rather than just building on top of an existing general-purpose database.
Why general-purpose databases fail by design
Many general-purpose databases are adding some support for time-stamped data and while t might be tempting to use those, they are fundamentally not designed for the new workloads and real-time enterprise requirements. There are multiple types of databases that get pulled up for comparison; most of these are distributed databases like Cassandra, MongoDB or HBase.
Comparing these general-purpose databases to InfluxDB, a purpose-built time series platform, reveals some stark differences. These databases require significant investment in developer time and code to recreate the functionality provided out of the box with InfluxDB. Developers will need to:
- Write code to shard data across clusters, aggregate and downsampling functions, data eviction and lifecycle management, and summarization
- Create an API to write and query their new service
- Write tools for data collection
- Introduce a real-time processing system and write code for monitoring and alerting
- Write a visualization engine to display the time series data to the user
InfluxDB vs. Elasticsearch
Elasticsearch is designed for search and is an excellent choice for that function. However, for time series data, it’s like putting a square peg into a round hole.
It’s difficult to work with the API, causing developers to take much more time to get up and running. InfluxDB far outperforms the Elasticsearch platform when handling for time series data. For write throughput, InfluxDB typically outperforms Elasticsearch by about 3.8x depending on the schema. Query speed on specific time series is 7.7x worse with Elasticsearch than with InfluxDB, depending on the range of time being queried. Finally, on-disk size is 9-12x larger on Elasticsearch than InfluxDB if you need to store the raw data for querying later. If using a config that summarizes the data before it goes into the database, Elasticsearch’s on-disk size is larger than InfluxDB. It really gets down to using the right tool for the job — Elasticsearch is great for search but not for time series data and analytics.
InfluxDB vs. MongoDB
MongoDB is an open source, document-oriented database, colloquially known as a NoSQL database, written in C and C++. Though it’s not considered a true time series database (TSDB), it’s often promoted as capable of handling time series workloads.
It offers modeling primitives in the form of timestamps and bucketing, which give users the ability to store and query time series data. MongoDB is designed to store “schema-less” data store in which each object may have a different structure. In practice, MongoDB is typically used to store large, variable-sized payloads represented as JSON or BSON objects. Because of both its generality and schema-less datastore design, MongoDB isn’t able to take advantage of the highly structured nature of time series data. In particular, time series data is composed of tags (key/value string pairs) and sequences of time-stamped numbers (the values being measured). As a result, MongoDB must be specifically configured to work with time series data but is totally inefficient at doing so. Again, it gets down to using the right tool for the job – MongoDB is great for documents and arbitrary objects but not for time series data and real-time analytics at scale.
InfluxDB vs. Cassandra
Cassandra is a distributed, non-relational database written in Java. Developed by Facebook, the project was open-sourced in 2008 and officially became part of the Apache Foundation in 2010.
It’s a general-purpose platform that provides a partitioned row store, which offers features of both key-value and column-oriented data stores. Though it provides excellent tools for building a scalable, distributed database, Cassandra lacks most key features of a TSDB. Thus, a common pattern is to build application logic on top of Cassandra to handle the missing functionality. Cassandra requires major upfront engineering effort to be useful. It is probably possible to force-fit Cassandra to handle time series data at scale, but why invest time and effort? And even if you do, performance outcomes leave a lot to be desired.
First-generation TSDBs vs. InfluxDB
Graphite is an older time series database monitoring tool that runs equally well on low-end hardware or cloud infrastructure.
Teams use Graphite to track the performance of their websites, applications, business services and networked servers. Graphite was originally designed and written by Chris Davis at Orbitz in 2006 as a side project that ultimately grew to become the company’s foundational monitoring tool. It marked the start of a new generation of monitoring tools, making it easier than ever to store, retrieve, share, and visualize time series data. In 2008, Orbitz allowed Graphite to be released under the open source Apache 2.0 license. Graphite stores numeric samples for named time series and expresses a value and its associated metadata with period delimited strings. These are commonly called
prod.sequencer-142.ingen.com.cpu.user 0.0 1473802170
prod.sequencer-142.ingen.com.cpu.nice 1.3 1473802170
prod.sequencer-142.ingen.com.cpu.system 2.3 1473802170
With this method, metadata associated with the various measurements, CPU measurements in the example above, are transmitted multiple times for every same interval. That means for something like a standard Sensu CPU check, Graphite will easily emit 6-10 different metrics in the above format for each CPU on each host. That extra metadata adds up quickly. Additionally, in Graphite, each of the strings is also stored in a different file and takes up an index space.
OpenTSDB is a scalable, distributed time series database written in Java and built on top of HBase. It was originally authored by Benoît Sigoure at StumbleUpon in 2010 and open-sourced under LGPL. OpenTSDB is not a standalone time series database. Instead, it relies upon HBase as its data storage layer, so the OpenTSDB Time Series Daemons (TSDs in OpenTSDB parlance) effectively provide the functionality of a query engine with no shared state between instances. This can require a significant amount of additional operational cost and overhead to manage in a production deployment.
In OpenTSDB’s data model, time series are identified by a set of arbitrary key-value pairs, and each value belongs to exactly one measurement; and each value may have tags associated with it. All data for a metric is stored together, limiting the cardinality of metrics. OpenTSDB does not have a full query language but allows simple aggregation and math via its API. OpenTSDB supports up to millisecond resolution, which is increasingly important as sub-millisecond operations become more common, and it allows the freedom to accurately store timestamps for events that may occur in close temporal proximity to one another. One caveat about OpenTSDB is that it is primarily designed for generating dashboard graphs, not for satisfying arbitrary queries nor for storing data. These limitations affect how it should be used.
Riak is a distributed NoSQL key-value data store that offers high availability, fault tolerance, operational simplicity and scalability. Riak TS is a key/value store optimized for fast reads and writes of time series data. And like all time series databases, Riak TS is built to handle the unique needs of time series applications ensuring high availability, data accuracy and scale.
kdb+ is a column-based relational time series database with in-memory capabilities, developed and sold by Kx Systems. Kdb+ has nanosecond timestamp precision, time-ordered querying, and aggregation across time buckets.
Available as InfluxDB open source, InfluxDB Cloud & InfluxDB Enterprise