Announcing InfluxDB IOx - The Future Core of InfluxDB Built with Rust and Arrow
On November 12, 2013, I gave the first public talk about InfluxDB titled: InfluxDB, an open source distributed time series database. In that talk I introduced InfluxDB and outlined what I meant when I talked about time series: specifically, it was any data that you might ask questions about over time. As examples I presented metrics (which is what people normally think of when you talk about time series), along with financial market data, user analytics data, log data, events, and even Twitter timelines. More broadly, I claimed that all data you perform analytics on is time series data. Meaning, anytime you’re doing data analysis, you’re doing it either over time or as a snapshot in time.
In this post, I’m going to lay out a vision for the future of InfluxDB and introduce you to a new project that will form the basis of it: InfluxDB IOx (pronounced eye-ox, short for iron oxide). As the title of the post says, this new project is written in Rust (iron oxide, natch) with Apache Arrow as the core. Before I do a deeper dive into the project and its underlying technology, I’d like to look at what the original goals of InfluxDB were and how things have changed since its introduction to the world seven years ago.
But first, here are answers to some questions you might have on reading the headline:
- InfluxDB 2.x Open Source will continue forward in parallel with regular point releases.
- InfluxDB IOx will support SQL (natively), InfluxQL, and Flux (via 2.0 and APIs with separate daemons).
- InfluxDB IOx will become an optional storage backend for InfluxDB in a future point release.
- InfluxDB Cloud customers will have InfluxDB IOx as an optional backend for new buckets early next year.
- InfluxDB Cloud customers will have a seamless, zero-downtime transition to this new technology at some point in 2021.
- InfluxDB Enterprise customers will have a commercially supported version of InfluxDB IOx and InfluxDB Enterprise in the latter half of 2021.
- InfluxDB IOx will have its own builds and can be run separately from the rest of the platform.
My original vision for InfluxDB was that it would be used for all kinds of analytic tasks, both real-time and larger scale. That it would be useful for server and application performance monitoring, sensor data applications and other analytics applications. The basic insight was that time series is a useful abstraction for building applications in these domains and InfluxDB would form the core infrastructure for those applications.
InfluxDB as it exists today is a great database for the metrics use case. It’s also quite good for analytics, with the caveat that you do not have data with significant cardinality. That is, the tags you have defined don’t have too many unique values. With the addition of Flux as a lightweight query and scripting language, you can now do complex analytics within the database and even interact with other third-party APIs and databases at query time.
With users doing more in the database, the limitation on cardinality is something that is coming up more frequently as a blocker. For example, we’d like to be able to store and work with distributed tracing data, but this type of use case is out of reach for the current underlying InfluxDB architecture. We’d also like to be welcoming to a broader audience of users that might prefer to work in SQL and other scripting languages.
On the distributed side of things, InfluxDB as it exists in open source today is not a distributed database and doesn’t give users tools to create a distributed operational environment with it. We held out that functionality for our commercial offerings in our Cloud and Enterprise products. From the perspective of our open source efforts, this was an unfortunate, but necessary limitation at the time since we needed to build a business to be able to support all of the open source work we do.
In the intervening seven years since the introduction of InfluxDB, we’ve seen the addition of many other time series databases to the open and closed source landscape. We’ve also seen a significant change in the infrastructure software landscape with Kubernetes rising to prominence, the decline of Hadoop and the rise of more generic object storage with more flexible software computing stacks built on top.
Requirements for the next evolution of InfluxDB
With all these changes in the broader ecosystem and the increasing needs of our users, we deliberated over where InfluxDB should go and settled on a set of higher-level requirements to work towards.
- No limits on cardinality. Write any kind of event data and don't worry about what a tag or field is.
- Best-in-class performance on analytics queries in addition to our already well-served metrics queries.
- Separate compute from storage and tiered data storage. The DB should use cheaper object storage as its long-term durable store.
- Operator control over memory usage. The operator should be able to define how much memory is used for each of buffering, caching, and query processing.
- Operator-controlled replication. The operator should be able to set fine-grained replication rules on each server.
- Operator-controlled partitioning. The operator should be able to define how data is split up amongst many servers and on a per-server basis.
- Operator control over topology including the ability to break up and decouple server tasks for write buffering and subscriptions, query processing, and sorting and indexing for long term storage.
- Designed to run in an ephemeral containerized environment. That is, it should be able to run with no locally attached storage.
- Bulk data export and import.
- Fine-grained subscriptions for some or all of the data.
- Broader ecosystem compatibility. Where possible, we should aim to use and embrace emerging standards in the data and analytics ecosystem.
- Run at the edge and in the datacenter. Federated by design.
- Embeddable scripting for in-process computation.
This is a fairly broad set of requirements. Also, requirements like the ones on replication and partitioning currently don’t fall under our open source umbrella. There are many ramifications to attempting to build towards these requirements. The first I’d like to talk about is licensing.
Why MIT & Apache 2 licenses matter
The open source software we produce at InfluxData is licensed under the permissive MIT license. We’ve long held the view that our open source code should be truly open and our commercial code should be separate and closed. As I mentioned before, we’ve held back high availability and clustering features for our commercial offering. Given that our new requirements call out replication, data partitioning, and other distributed aspects, we had to revisit this decision again.
One of the reasons we created InfluxDB in the first place was because we saw many developers and organizations re-creating the same wheel when it came to distributed time series databases. We thought then, as we do now, that time series is a specific use case that benefits from a system designed for it explicitly.
This was my experience when building time series APIs on top of Cassandra, Redis and other systems. I ended up having to write a significant amount of application level code to enable the kind of functionality that I needed. I thought InfluxDB could be that shared infrastructure.
However, without distributed features in the core open source project, this left a significant gap in the market. This is evidenced by the appearance of so many time series databases over the last few years. If we want InfluxDB to be the building block for future projects that need to work with and operate on time series at scale, we need to find a way to open source more, rather than less.
Further, this work must be licensed under a true open source license that is permissive. If large companies can’t adopt it, they’ll end up inventing it themselves. If other companies can’t build their business on it and expose its API to their users, they’ll end up building it themselves. If cloud providers won’t host it, they’ll end up creating their own versions that may not be compatible and may not be built upon open APIs.
Infrastructure projects that license under source available or restricted permissions are by their very nature evolutionary dead ends. A limited set of users can adopt the software, and an even more restricted set of users can actually build a product on top of it. They can often use restricted license projects for internal purposes, but often not for external customer-facing ones. These limitations mean that many developers have no choice but to build something else either from scratch or from other projects that are actually open source.
I understand why companies choose to go this route, but it’s severely limiting. It’s also a shame since many of them built from a core that was actually open source to begin with. Community and open source aren’t about how much value your individual company manages to capture. Open source is about creating a cambrian explosion in software that creates value that far outstrips what a single vendor manages to monetize. Open source is not a zero sum game. A larger community and ecosystem brings more opportunity to all vendors.
We have a vision of InfluxDB becoming the basis for countless analytics, sensor, monitoring and data analysis projects and companies. The only way this might become a reality is if we come from a starting point that is extensible, adoptable and commercializable by anyone.
In service of this larger goal, InfluxDB IOx will be dual-licensed under MIT & Apache 2 with no restrictions.
That being said, InfluxDB IOx is designed to run as a loosely federated shared-nothing architecture. The operator should have full control over the behavior of the individual servers. This also means that any production scenario involving multiple servers will likely need an additional piece of software for operations.
In the most basic scenarios, this could be as simple as some shell scripts and a cron job. In more complex scenarios involving ephemeral environments and many servers, this can be a complex piece of software. This operational tooling is where we will have our commercial offering. This will come first in our Cloud product and later as an on-premise offering as part of InfluxDB Enterprise.
The design of the system was deliberate in terms of how the architecture should work, but also how the commercialization should work. We want our Cloud product to run the exact same bits as the open source project, not a fork. We want the on-premise commercial offering to be complementary to the open source, not a replacement.
Conway’s Law says you ship your org chart. Dix’s maxim says your licensing strategy is your commercialization strategy, whether by accident or design. With InfluxDB IOx, ours is to have permissively licensed open source software that is maximally useful for its project goal paired with commercial software that is complementary to the open source.
Why InfluxDB IOx is a new project
There are a few requirements that make this effort something that we’d be unable to build towards incrementally. In an ideal world, you can refactor and incrementally build on your existing code base to get to where you want to go. This is great for evolutionary improvement, but difficult for significant leaps forward.
The toughest requirement to meet with how InfluxDB is currently built is for unlimited cardinality. Here’s how data is organized in InfluxDB. Data is sent via our Line Protocol and looks like this:
cpu,host=serverA,region=west user=23.2,system=53.2 1604944036000000000
This is data for the cpu measurement with tags of host and region and two fields of user and system with a trailing nanosecond epoch timestamp. InfluxDB ingests this data and indexes it by:
measurement, tag key/value pairs, field name
That is an individual time series where you have a time-ordered list of value, time pairs. Further, the measurement name, tag key and value, and field name are indexed in an inverted index. So the core of InfluxDB is actually two databases: an inverted index and time series runs. This makes it very fast to look up time series by measurement name, or filter them down by tag dimensions.
However, this means that any new tag value that comes in creates a new series and must be indexed in the inverted index. Over time this means that the higher your cardinality becomes, the bigger your inverted index is. In pathological cases like distributed tracing, you get unique IDs on every row. This means that your secondary index will be larger than the underlying time series data itself. You also spend a massive amount of CPU cycles and memory on this continuous indexing.
One workaround for these use cases is to put this data into fields. However, that causes the user to have to think about tags vs. fields and in some cases limits what they can do at query time as fields are not indexed in any way.
If we want to handle unlimited cardinality, the split structure of inverted index and time series would need to change. However, this structure is core to how the database is designed.
Strict memory control requirements make a refactor difficult as well. This means we can’t use MMAP, which is how all data and index files are used in the current InfluxDB. MMAP is a great tool and many modern databases have been built with it. However, in containerized environments it is tricky to work with. Also, MMAP doesn’t align well with our requirement of being able to run without a locally attached disk.
Finally, the requirements for object storage as the durability layer and bulk data import and export are difficult to deliver on with the underlying storage engine that we’ve built. The design fundamentally assumes a locally attached SSD and doesn’t lend itself to exporting some of those files to object storage and importing them later at query time. Bulk data import and export is difficult to realize with our bifurcated index and time series database structure.
These underlying changes conspire to create a situation where we’re unable to gradually bring the core of InfluxDB forward. We needed to radically rethink how the core of the database was organized.
Betting on Rust, Apache Arrow and columnar databases
Once we realized that we’d need to rework a significant portion of the core, we started to think about what tools we could use to make this effort faster, more reliable, and more community oriented. It’s no secret that I’m a longtime fan of both Rust as a systems language and Apache Arrow as an in-memory analytics toolset. I think these two are the long-term future for systems software and OLAP and data analytics. They’ve both made tremendous progress over the last few years.
Rust gives us more fine grained control over runtime behavior and memory management. As an added bonus it makes concurrent programming easier and eliminates data races. Its packaging system, Crates.io, is fantastic and includes everything you need out of the box. With the addition of async/await last fall, I thought it was time to start seriously considering it.
Apache Arrow defines an in-memory format for columnar data along with Parquet, a durable persistence format, and Flight, a client/server framework and protocol for “high performance transfer of large datasets over network interfaces”. As an added bonus, within the Rust set of Apache Arrow tools is DataFusion, a Rust native SQL query engine for Apache Arrow. Given that we’re building with DataFusion as the core, this means that InfluxDB IOx will support a subset of SQL out of the box with expanding functionality as the DataFusion project matures both for use in InfluxDB IOx and elsewhere through the development efforts of collaborators outside of InfluxData. However, InfluxDB IOx also supports InfluxQL and Flux. Once again, we’re taking backwards compatibility seriously.
The last part of this architecture is the structure of the database. Our bet is that a columnar database optimized for time series can be used as the basis for future InfluxDB. This is the structure of Apache Arrow and DataFusion, and lines up well with our goal for broader ecosystem compatibility. Here’s how we’ve mapped InfluxDB’s data model onto a columnar database model:
- Measurements become tables (i.e. each measurement is a table)
- Tags and fields become columns in those tables (so they're scoped by measurement)
- Tag keys and field keys must be unique within a measurement
- Time is a column
In addition to this organization scheme, we’ve chosen Parquet as the long-term persistence format. Each Parquet file contains some of the data for a single table, which means each one contains the data for a single measurement. Through our research, we’ve found that we can achieve as good or better compression on real-world data with Parquet as the persistence format than we do with our own TSM engine.
Further, we break all of the data into partitions. How that data is split into partitions is defined by the operator or user when they create the database. The most common partitioning scheme will simply be based on time (for example: every 2 hours).
For each partition, we keep a summary of what it contains that can be kept in memory. This summary includes what tables exist, what their columns are, and what the min and max values of those columns are. This means that the query planner can rule out large portions of the data set before execution by simply looking at the partition metadata.
This partitioning scheme also makes it easier for us to use object storage as the long-term store and to manage the data lifecycle from memory to object storage to indexed Parquet files.
Columnar databases aren’t new, so why are we building yet another one? We weren’t able to find one in open source that was optimized for time series. On the query side, we needed dictionary support and windowed aggregates to be first-class and optimized. On the persistence side, we needed something that was intended to separate compute from storage.
Nothing specific in this project is revolutionary. Columnar databases have been extensively researched over the last few decades or so. Separating compute from storage with object storage is another thing that has been gaining momentum over the last ten years. Snowflake represents one of the more visible recent examples in the closed source realm.
We think the combination and composition of these tools together, with Apache Arrow as the core in an open source server project, represents a new and interesting offering in the open source world. We think this is something that future analytics and monitoring projects both real-time and at scale might use as a building block.
We’re already contributing back to Arrow in the Rust and Go languages. This doubles down on our commitment there. However, some of our needs are outside the goals of Apache Arrow. For example, we need a system that can keep compressed data in memory and execute queries against it with late materialization. To that end, we’re creating extensions to DataFusion that will make InfluxDB IOx able to work with more time series data in-memory than we’d otherwise be able to handle.
This project is still in its very early stages. We’re not currently producing builds, and there is no documentation beyond the InfluxDB IOx project README. The team is a small focused group of senior engineers and our efforts are in parallel to all the efforts of our larger engineering organization on the rest of the platform. Our goal is to produce open source builds early next year along with offering it in alpha in InfluxDB Cloud.
We’re talking about this now because we thought it was important to let the community know where we’re headed. We also want to open up the project so that others that take interest in it can follow along and even contribute.
Journey for InfluxDB Open Source users
For open source InfluxDB 2.0 users, this has no immediate impact. This work is still quite early and likely won’t be ready for production usage for a little while. You can expect that a future InfluxDB 2.x point release will be able to use InfluxDB IOx as an optional storage and query backend.
Work on the InfluxDB 2.x line will continue during and after the development of InfluxDB IOx. This represents no breaking change to the API. However, additional functionality in the 2.x API will be landing over time, and will be usable by the community as it becomes available. The development and adoption cycle of InfluxDB IOx should be incremental rather than all at once like the 1.x to 2.0 release cycle.
Journey for InfluxDB customers
For our InfluxDB Cloud customers, they will have an option to use this new technology as a backend for new buckets at some point in the first half of next year. As we mature the technology and get more operational experience with it, we will start planning on upgrading existing customers and buckets. This will have no impact on the API and will be delivered with zero downtime.
For our enterprise customers, we will be integrating InfluxDB IOx with InfluxDB Enterprise and offering a product to help run this software in production in Kubernetes environments. However, Kubernetes isn’t a requirement to run InfluxDB Enterprise or InfluxDB IOx. It just enables a larger set of features relating to cluster flexibility and elasticity.
Right now everything is happening either on the InfluxDB IOx repository itself or in our community Slack channel #influxdb_iox. You can also post questions or comments on the InfluxDB IOx category on our community Discourse.
There’s so much more I wanted to get to in this post. I didn’t even really get into the core technical details. If you’re hungry for more of that, I’ll be hosting a monthly tech deep dive presentation on the 2nd Wednesday of every month at 8:00 AM Pacific Time.
Lastly, if you’re interested in this project and want to get paid to work on it full-time, we’re hiring. We’re looking for people with one or more skillsets in: Rust, distributed systems, and columnar databases. The downside is that you’d have to work with me directly, but the upside is that you’d get to work with my brilliant teammates. Send an email to [email protected] and CC me [email protected].