Designing a Parquet Catalog for InfluxDB IOx
One of the things we needed to either adopt or build for InfluxDB IOx is a database catalog. If you haven’t heard us talk about it yet, InfluxDB IOx (pronounced eye-ox) is the new in-memory columnar database that uses object storage for persistence. We’re building it as the future core of InfluxDB. A database catalog usually contains the definitions of a database’s structure like schema and indexes. For systems that use object storage for persistence, they also may contain additional information like files (object locations) and summary statistics.
The SQL standard specifies a way to access the catalog through information_schema tables. IOx supports this method of getting information about the schema of the database. You can see it here in the output of the show tables command executed against the paulco_mydata database in the IOx SQL REPL:
paulco_mydata> show tables; +---------------+--------------------+---------------+------------+ | table_catalog | table_schema | table_name | table_type | +---------------+--------------------+---------------+------------+ | public | iox | cpu | BASE TABLE | | public | iox | disk | BASE TABLE | | public | iox | diskio | BASE TABLE | | public | iox | mem | BASE TABLE | | public | iox | processes | BASE TABLE | | public | iox | swap | BASE TABLE | | public | iox | system | BASE TABLE | | public | system | chunks | BASE TABLE | | public | system | columns | BASE TABLE | | public | system | chunk_columns | BASE TABLE | | public | system | operations | BASE TABLE | | public | information_schema | tables | VIEW | | public | information_schema | columns | VIEW | +---------------+--------------------+---------------+------------+ Returned 13 rows in 4.085615ms paulco_mydata>
While IOx supports this more standard method of getting at schema information, the catalog this post talks about is focused on tracking what exists in object storage. We needed an efficient way to keep track of schema and statistics information for the Parquet files that InfluxDB IOx writes to object storage. Given that, we had to decide whether to design our own solution or to adopt a standard from some other project. We took a look at Apache Hive, Delta Lake, and Apache Iceberg as potential implementations. In the end we decided to implement our own design, and in the rest of this post, I’ll talk about that design and compare it a bit to the others.
There are a number of specific requirements we needed to fill with a catalog implementation. First and foremost, we need pointers to the list of Parquet files that are part of the database. However, we also want to support soft deletes, giving users the ability to delete data, but have it still be around for some period of time if it needs to be restored. This means that not all Parquet files in object storage are part of the queryable set.
One of the tricks that IOx will use to optimize performance is to simply avoid evaluating as much of the data in a database as possible. It does this by partitioning data and then pruning out irrelevant partitions at query time based on the query. After that, we execute the query on the relevant partitions to get a result back as quickly as possible. To do this, the database needs to know the schema and the summary statistics of what exists in the various partitions and the files within them. Without the catalog, the database would need to scan all Parquet files and read the summary data from them at startup.
InfluxDB IOx must also be able to efficiently keep track of many thousands of tables in the database. IOx users may have a database in which all data goes into one wide table (likely for tracing use cases), but they also have others where data will be split across many tables (likely for metrics use cases). IOx users will be able to use tables in their schema to help them organize it logically as well as segmenting it for better query performance. We’ll also need schema for all these tables so we can do quick lookups for queries and validation for writes.
Lastly, the catalog will need to support tombstone records for deletes. This makes it possible to have deletes reflected in queries immediately without having to actually rewrite the Parquet files themselves.
All those requirements taken together ruled out using either Hive, Delta Lake, or Iceberg. These standards are also heavily tied to the JVM and/or the Hadoop based ecosystem. IOx must be deployable at the edge, so we need something lightweight and written in Rust. None of the existing standards has a mature Rust implementation.
Hive doesn’t support soft deletes, atomic commits or time travel. Further, it relies on expensive LIST operations to build its view of the world.
Delta Lake, created at DataBricks, is more focused on a single table with multiple writers. The design of IOx carefully avoids the need for concurrency control because each server gets a unique identifier, which is the first part of the path in object storage. We don’t have a need to support multiple writers so we don’t need that complexity.
Iceberg seems to be the most promising in the list for our needs. However, it is focused on single-table use cases, which rules it out. Only Java and Python implementations exist, and it looks like there’s a decent amount of complexity to implement the spec.
So we went with our own design that borrows many of the concepts from these three projects. In addition to supporting the features we need, we chose to use the Parquet metadata format in Apache Thrift to keep information about metadata and statistics. This is convenient because we’ll use that same code path when reading the catalog or when reading Parquet files directly.
We have more details about the InfluxDB IOx catalog written up in the GitHub repository. Or if you prefer to watch a talk, Marco Neumann, the Influx engineer who designed the catalog, gave a talk on the InfluxDB IOx catalog design at our monthly tech talk last week. If this kind of detail on the implementation of IOx is interesting to you, please join us for the monthly InfluxDB IOx Tech Talk on the 2nd Wednesday of every month at 8:00 am Pacific.