Why InfluxDB Cloud, Powered by IOx is a Big Deal to Me
By Rick Spencer / Mar 10, 2023 / InfluxDB IOx, Community
From time to time throughout my career, I have been involved in projects with dramatic releases when we built and delivered something very new and very special. The release of InfluxDB Cloud, powered by IOx (referred to as “InfluxDB IOx” for short below) absolutely meets those criteria. I want to explain my personal views of why this release is so impactful and why I am so excited to be part of it.
For more information on the motivation and technologies that went into building this new database engine, check out the blog posts here, here, and here.
First and foremost, we designed InfluxDB IOx to be fast for large time series workloads. This means:
Fast queries for the leading edge of data. Typically, but not always, this equates to the last two hours of data, though all queries should show a significant performance boost.
Fast ingest of massive amounts of data.
The smallest possible on disk footprint.
Users no longer need to make compromises with their data and can write as much data as necessary, and query it efficiently. As a technologist and Product Manager, what can be more exciting than delivering the things that users want the most?
Expanded approach to open source
I have been an Open Source contributor and community member since 2008 at least. InfluxData’s commitment to Open Source is one of the core things that attracted me to the company. InfluxData considers itself an open source company, a belief held so firmly we encoded it in our company value statements. In the past, this entailed working in the open and maintaining open source software that was ready for users to pick up and use.
Apache Arrow project
However, with InfluxDB IOx, we went beyond simply delivering code for our projects. Instead, we are working upstream, contributing to the Apache Arrow project, with a special focus on Apache DataFusion, the SQL query engine for Apache Arrow.
What does this mean for the open source community? While it was always nice that users could see, tinker with, and contribute to InfluxDB code, in practice this had limited utility. For example, what if you wanted to build a different kind of database, say a location database? Sure, you could get some ideas by looking at the InfluxDB code base, but it wouldn’t help you actually build your database. The Apache Arrow project fundamentally changes what’s possible by providing you with the actual components to assemble your own high performance database to meet your specific requirements by extending the work in the Apache Arrow project. This approach to open source truly enables the community.
Doing work that both strengthens the company, but also creates technical and economic opportunity for the wider community, is deeply satisfying for me.
Embracing emerging standards means we don’t have to provide every piece of functionality ourselves. Users have a range of tools and services that they prefer, and by eschewing a walled garden approach, we can cooperate with other developers and companies to help satisfy all types of users.
SQL query language
Clearly, SQL isn’t an “emerging standard,” but the route we chose to implement SQL support is. As described in detail elsewhere, IOx uses Datafusion for querying, and Datafusion uses SQL as the query language. This investment in DataFusion means that anyone who knows even a little bit of SQL can query time series data in InfluxDB, and SQL experts can make heavy use of that same data. Additionally, when other contributing developers from the community and other companies improve DataFusion, those improvements flow into InfluxDB.
InfluxDB IOx stores files in Parquet as the native file format. Parquet is part of the Apache Arrow project and as such the two technologies work together very well. Parquet files deliver significant data compression, especially the way InfluxDB IOx uses them. Parquet is also an open standard with many high quality libraries to read and write Parquet files. As such, Parquet is emerging as the standard file format for interchanging large analytical datasets, whether that means running jobs on the data in place, or moving it to a service. Using Parquet makes it possible for services that rely on large datasets, such as anomaly detection, AI/ML, visualizations, etc., to easily work with InfluxDB IOx, and to do so without a time- or compute-intensive export step.
Flight SQL is also a standard in the Apache Arrow project. It is a client/server protocol for ingesting SQL statements and returning results in Arrow. Any database is free to implement Flight SQL. As a result, by supporting a small set of drivers, Flight SQL enables almost any dashboard, visualization, or BI tool.
For example, we created a Grafana-to-Flight SQL plugin and are in the process of contributing that to Grafana, so anyone who uses a database that supports Flight SQL can use that plugin. Similarly, we created a Flight SQL SQLAlchemy dialect so that anyone can use it to communicate between their database and Apache Superset.
You can find the repos where we contributed these resources here:
Grafana-to-FlightSQL plugin here (awaiting official inclusion in Grafana’s plugin library)
Superset support via an upstream adapter that we wrote here.
A good quality JDBC Flight SQL driver and an ODBC Flight SQL driver already exist upstream. You can use these drivers with a plethora of tools to unlock access to your data. If you have teams that want to use tools like Tableau or PowerBI, they can use these established drivers to access data in InfluxDB.
Truth be told, I am particularly excited about our support for Apache Superset. This is a full-featured, combination BI/dashboarding tool that is part of the Apache foundation. It is a very accessible Python codebase, and is well-supported by several large companies. I highly recommend trying Superset if you haven’t already.
Querying data in InfluxDB IOx
We designed InfluxDB IOx from the ground up to support SQL queries, and for those queries to be fast. But standard SQL lacks some core time series capabilities. Because of our investment in DataFusion, we were able to implement some of these core time series functions directly upstream. Not only does this ensure that these functions are performant, but it almost means that anyone in the Apache Arrow community can benefit from them and even contribute improvements to them.
Currently, there are three time-series-specific functions in DataFusion:
date_bin() - this function creates rows that are time windows of data with an aggregate.
selector_first(), selector_last() - these functions provide the first and last row of a table that meet a specific criteria.
time_bucket_gapfill() - this function returns windowed data, but if there are windows that lack data, it will fill those gaps.
So far I have discussed how the InfluxDB IOx engine leverages the Apache Arrow project to enable users to write time-series-specific queries, using a familiar language (SQL) and their tool(s) of choice, and that these low latency queries act on the leading edge of data. But the Apache Arrow and Flight SQL combination brings another critical advantage.
Remember that Apache Arrow is designed to move large amounts of columnar data and to allow tools to operate effectively on that data. Therefore, upstream libraries in the Apache Arrow project allow users to query large amounts of data, efficiently bring it onto their clients, and operate on that data in interesting ways.
For me, the most exciting aspect of this is that pyarrow, the Python arrow library, has built-in Pandas support. Pandas is by far the most popular data manipulation library, used by developers, business analysts, AI/ML practitioners, et al. Libraries such as Plotly Express and Neural Prophet are just a couple examples of richly functional libraries built on Pandas. Long-time Python users will welcome the easy interoperability between InfluxDB IOx and these libraries. That said, Arrow has libraries available for many different languages.
Remember, we designed InfluxDB IOx to process large time series workloads quickly, and its support for the Arrow libraries carries that goal all the way to your client code.
While we made a lot of changes to InfluxDB, from a user’s perspective, we did not change the way data gets written to InfluxDB. This is because we optimized InfluxDB for ingesting large amounts of time series data long ago, and, in fact, this continues to be a core point of differentiation. Your Telegraf configs and line protocol code still work the same with the InfluxDB IOx engine.
In truth, we implemented some changes in the configuration of an IOx-powered InfluxDB instance that enables even faster data ingestion and “Time to Be Readable.” But users don’t need to change the way they write data to InfluxDB to receive these benefits.
Ultimately, this means that InfluxDB Cloud, powered by IOx, can ingest more data, faster than any other database that you might consider for such a workload.
Even easier to use and for more use cases
One of the operational challenges of operating a time series database is that the things you want to measure can come and go. For example, new sensor types may arrive on your factory floor, or you start observing a new kind of server infrastructure.
With most databases, these changes are highly disruptive because you must deploy a schema change to account for the new data. These schema updates can be labor intensive and risky. InfluxDB always handled this problem by being “schema on write.” This means you can simply write data with a new schema and InfluxDB handles the changes for you behind the scenes. There is no change to this functionality with InfluxDB IOx.
In previous versions of InfluxDB, users had to be cognizant of the concept of cardinality to maintain good performance. InfluxDB IOx completely removes this concern. The way it stores and queries data does not require any notion of cardinality. This means that users can create a schema based on the way they think about the data, and how it is collected, without worrying about slowing down their queries!
Every journey starts with a single step
This release is just the beginning for InfluxDB IOx. We are already hard at work on the next steps. This includes:
Optimize, optimize, optimize. We already see lots of opportunities to make InfluxDB IOx even faster, so expect the database to get faster and faster over the coming months.
Fast InfluxQL. We are implementing native InfluxQL support into DataFusion. This means that users with workloads from previous versions of InfluxDB who used InfluxQL for their queries will be able to get the benefits of InfluxDB IOx with minimal changes to their application.
Single Tenant and Enterprise versions. We are happy to release InfluxDB IOx to new Multi-tenant customers, but over the course of the year, you can expect offerings for customers who want InfluxDB IOX in a single tenant managed service, as well as those who want to manage InfluxDB IOx themselves.
InfluxDB IOx is a real game changer in multiple ways, and I have to admit that I feel exhilarated. This feeling reminds me of paddling out to a break of waves when surfing. That combination of excitement but bracing for the unknown really makes me feel alive.
Aside from deepening InfluxData’s commitment to open source, it allows better performance on bigger workloads, with support for all of your team’s favorite tools. If you are an existing InfluxDB user, you will have to think about how you use InfluxDB a bit differently to realize these benefits, but I think you will agree that the benefits will be extreme as you bring your new time series workloads to InfluxDB IOx.
At the time of writing, InfluxDB IOx is the engine powering InfluxDB Cloud in two specific regions. Sign up for an InfluxDB Cloud account in either the US East 1 (Virginia) or EU Central 1 (Frankfurt) regions. We plan to roll out the InfluxDB IOx engine to additional regions and cloud providers in the near future.