Apache Parquet is an open source columnar data file format that emerged out of Cloudera and was designed for fast data processing for complex data. Initially built to handle the data interchange for the Apache Hadoop ecosystem, it has since been adopted by a number of open source projects like Delta Lake, Apache Iceberg and InfluxDB 3.0 and is an with big data systems like Hive, Drill, Impala, Presto, Spark, Kudu, Redshift, BigQuery, Snowflake, Clickhouse, and others.
Parquet supports different encoding and compression schemes on a per column basis that allow for efficient data storage and retrieval in bulk. Dictionary and RLA encoding are also supported as well as a wide variety of data types like numerical data, textual data, and structured data like JSON. Parquet is self-describing allowing for meta data (schema, structure like min/max values, Bloom filters) to be included on a per-column basis. This helps the query planner and executor to optimize what needs to be read and decoded from the Paquet file.
Additionally, it is built around nested data structures using the record shredding and assembly algorithm first described in the Dremel paper that came out of Google.
Advantages of Parquet
Performance — I/O and compute will be reduced since a subset of the columns can be scanned when performing queries.
Compression — Organizing homogeneous data into columns allows for better compression.
Open Source — Robust open-source ecosystem, constantly being improved and maintained by a vibrant community.
How Parquet stores data
Let’s use an example to see how Parquet stores data in columns. The following is a simple data set of 3 sensors with information about the IDs, the type and location. When you save as a row-based format like csv or Arvo, it looks like this:
When you store it in Parquet columnar format, your data is organized with each specific data type in a column like the following:
The row-based format is easy to understand since it is the typical format that we see in spreadsheets. However, for any type of large-scale querying, columnar formats will have the advantage since you can query the subset of columns needed to answer a query. In addition, storage will be more cost effective since row-based stores like csv are not compressed as efficiently as Parquet.
InfluxDB and the Parquet file format Here is an example that we will use to demonstrate how InfluxDB stores time series data in the Parquet file format.
The measurement will map to one or more Parquet Files, and the tags, fields, and timestamps will map to individual columns. These columns use different encoding schemes to get the best possible compression.
In addition, the data is organized in non-overlapping time ranges. Here we have 3 sample Parquet files for the same measurement but into 3 different time groupings.
This executes a faster query since it allows the user to gather data for a specific time range. In addition, storing time series data in this way allows for easy eviction of data to save space, if required.