Bringing it all together: Speed, performance, and efficiency in InfluxDB 3.0
Jason Myers /
Nov 03, 2023
For most of the past year, we here at InfluxData focused on shipping the latest version of InfluxDB. To date, we launched three commercial products (InfluxDB Cloud Serverless, InfluxDB Cloud Dedicated, and InfluxDB Clustered), with more open source options on the way. All the while, we claimed that this latest version of InfluxDB surpasses anything we built before. We’re not in the business of making empty promises, so this post draws together all the information currently available to support those claims in one place.
There are several inter-related factors and developments that contribute to the overall success of InfluxDB, and my goal is to draw clear connections between them. While some of this is necessarily simplified, I encourage readers to check out the in-depth, technical articles linked throughout if you’re interested in the details. So, without further ado, let’s dive in.
At the outset of this journey, InfluxData founder Paul Dix decided to write the new database version in Rust. While Rust isn’t the easiest language to work with, it has many built-in advantages. It was also the case that some of the open source projects we used to build the new InfluxDB were written in Rust, but we’ll get back to those in a minute.
In addition to the shift from Go to Rust, we reconfigured the architecture of the database. One of the key decisions in this regard was to separate compute and storage. This allows InfluxDB to scale each of those components individually, giving users more flexibility with how they can scale their database.
One of the key challenges this new version of InfluxDB sought to solve was the cardinality problem. Because older versions wrote an index of the data to disk, the number of series that users could ingest and query without impacting performance correlated to the amount of memory committed to writing and managing the index (See this short explainer video). For high cardinality use cases, like traces, InfluxDB struggled.
We knew that optimizing InfluxDB to handle large time series workloads required us to solve the cardinality problem.
The road traveled
Spoiler alert: We solved the cardinality problem. It’s a bit of a chicken/egg question when thinking about what came first, the performance gains in the new InfluxDB or support for unlimited cardinality data. These, and several other items are interrelated.
One of the most critical decisions our team made was to build the new version on the Apache Arrow ecosystem. The FDAP stack (Apache Flight, DataFusion, Arrow, and Parquet) provided a lot of core functionality, and provided upstream open source tools that we could both use and contribute to. Not only did this expand our internal commitment to open source, but it allowed us to contribute time-specific code to these projects that facilitated the features of InfluxDB and simultaneously made InfluxDB more interoperable with other solutions build on the FDAP stack or its components.
A columnar database
Using Arrow as the data representation layer allowed us to build InfluxDB as a columnar database. This is a key development because it is a major shift from versions 1.x/2.x. One of the reasons that we opted for a columnar approach is because it lends itself to better data compression. The columnar approach also allowed us to rethink our data model. We simplified our data model so each measurement groups data together instead of separating it into individual time series. This means the database has less work to do at the point of ingest and can therefore, ingest data faster.
The columnar framework makes it easier to compress data on a per-column basis. This is a huge boost in its own right. We coupled this with Apache Parquet as our data persistence format. Parquet is designed to work with columnar data and provides high data compression ratios. So, once the data gets to a Parquet file, we’re able to compress it even further.
There’s no judgment if you skipped the “how” sections to get to the results of all our hard work.
At the outset, we wanted to eliminate the bottlenecks that prevented InfluxDB from being able to handle the full range of time series workloads. This meant solving the cardinality problem, which meant the database needed to be able to ingest and query vast amounts of data in real-time without impacting performance.
To accomplish that, we restructured the database, which included separating compute and storage to make them independently scalable. We streamlined the data ingest process so that it uses fewer resources and doesn’t rely on an on-disk index. In fact, v3.0 can ingest data with 45x more throughput than previous open source versions.
Storage and compression
Once all that data hits InfluxDB, it keeps fresh and frequently queried data in a hot storage tier. Older data goes to a cold storage tier as Apache Parquet files. Thanks to a columnar format and the compression-friendly Parquet format, InfluxDB 3.0 delivers high-ratio compression. That means that you can store more, high-fidelity data in less space.
For use cases that rely on historical analysis, this is a huge win because you don’t need to choose between analytical integrity and storage costs. For the cold storage tier, we use low-cost cloud storage, like Amazon S3. When we combine all those factors, the result can be a cost savings of 90%+.
So, at this point we can ingest a ton of data and store it in a cost-effective manner. Now we just need to be able to query that data and analyze it. For the query side of things, we leaned into Apache DataFusion, which is a state-of-the-art query engine built in Rust that uses Arrow as its in-memory model. In short, it’s really fast and works well with the rest of the database. An additional advantage to DataFusion is that it allowed us to build in native support for SQL, which a lot of users asked for over the years. Not only does SQL support reduce barriers for entry for many people, the sheer speed of DataFusion helps InfluxDB 3.0 deliver real-time results.
The last piece of the puzzle is data analysis and visualization. In version 3.0, we wanted to get back to focusing on the core database. Instead of investing our resources in custom visualization tools, we think it makes more sense to leverage other best-in-breed tools. In other words, instead of making this version of InfluxDB try to do everything, we focused on integration and interoperability. It has a native integration with Grafana (also here), and can connect to Apache SuperSet and Tableau, with integrations for other tools in active development as well.
Being built on open source also facilitates other integrations, too, like with artificial intelligence (AI) and machine learning (ML) solutions. ‘Real-world’ AI tools, which differ from generative AI tools, typically rely on time series data. These are the solutions that drive automation and predictive models for industrial operations, making them more efficient and effective. At the same time, these tools require large amounts of data to train their models, and the volume and velocity of time series data makes it a key source. InfluxDB functions as the intermediary between data sources, AI models, and end-user analysis, by managing that data and making it available to AI/ML tools in real-time.
To hammer home the point, check out these benchmarks, comparing InfluxDB 3.0 with previous open source versions. The speed and performance gains against our own – already leading – product are significant.
As time series data becomes increasingly critical across industries and sectors, the sheer volume of data produced requires a solution that can keep up at a real-world pace, in real-time, without sacrificing performance. InfluxDB 3.0 is that solution. And the best part? Version 3.0 is just the beginning; it’s only going to get better from here.