InfluxData Blog - Raphael Taylor-Davies

Aggregating Millions of Groups Fast in Apache Arrow DataFusion

Andrew Lamb, Raphael Taylor-Davies, Daniël Heres (InfluxData) — Tue, 01 Aug 2023 07:35:00 +0000

TLDR

Grouped aggregations are a core part of any analytic tool, creating understandable summaries of huge data volumes. Apache Arrow DataFusion’s parallel aggregation capability is 2-3x faster in version 28.0.0 for queries with a large number (10,000 or more) of groups.

Improving aggregation performance matters to us as users of DataFusion. Both InfluxDB, a time series data platform and Coralogix, a full-stack observability platform, aggregate vast amounts of raw data to monitor and create insights for our customers. Improving DataFusion’s performance lets us provide better user experiences by generating insights faster with fewer resources. Because DataFusion is open source and released under the permissive Apache 2.0 license, the whole DataFusion community benefits as well.

With the new optimizations, DataFusion’s grouping speed is now close to DuckDB, a system that regularly reports great grouping benchmark performance numbers. Figure 1 contains a representative sample of ClickBench on a single Parquet file, and the full results are at the end of this article.

Figure 1: Query performance for ClickBench queries on queries 16, 17,18 and 19 on a single Parquet file for DataFusion 27.0.0, DataFusion 28.0.0 and DuckDB 0.8.1.

Introduction to high cardinality grouping

Aggregation is a fancy word for computing summary statistics across many rows that have the same value in one or more columns. We call the rows with the same values groups and “high cardinality” means there are a large number of distinct groups in the dataset. At the time of writing, a “large” number of groups in analytic engines is around 10,000.

For example the ClickBench hits dataset contains 100 million anonymized user clicks across a set of websites. ClickBench Query 17 is:

SELECT "UserID", "SearchPhrase", COUNT(*) 
FROM hits
GROUP BY "UserID", "SearchPhrase" 
ORDER BY COUNT(*) 
DESC LIMIT 10;

In English, this query finds “the top ten (user, search phrase) combinations, across all clicks” and produces the following results (there are no search phrases for the top ten users):

UserID	Search Phrase	Count (UInt8(1))
1313338681122956954		29097
1907779576417363396		25333
2305303682471783379		10597
7982623143712728547		6669
7280399273658728997		6408
1090981537032625727		6196
5730251990344211405		6019
6018350421959114808		5990
835157184735512989		5209
770542365400669095		4906

The ClickBench dataset contains

99,997,497 total rows ¹
17,630,976 different users (distinct UserIDs) ²
6,019,103 different search phrases ³
24,070,560 distinct combinations ⁴ of (UserID, SearchPhrase)

Thus, to answer the query, DataFusion must map each of the 100M different input rows into one of the 24 million different groups, and keep count of how many such rows there are in each group.

The solution

Like most concepts in databases and other analytic systems, the basic ideas of this algorithm are straightforward and taught in introductory computer science courses. You could compute the query with a program such as this ⁵:

import pandas as pd
from collections import defaultdict
from operator import itemgetter

# read file
hits = pd.read_parquet('hits.parquet', engine='pyarrow')

# build groups
counts = defaultdict(int)
for index, row in hits.iterrows():
    group = (row['UserID'], row['SearchPhrase']);
    # update the dict entry for the corresponding key
    counts[group] += 1

# Print the top 10 values
print (dict(sorted(counts.items(), key=itemgetter(1), reverse=True)[:10]))

This approach, while simple, is both slow and very memory inefficient. It requires over 40 seconds to compute the results for less than 1% of the dataset ⁶. Both DataFusion 28.0.0 and DuckDB 0.8.1 compute results in under 10 seconds for the entire dataset.

To answer this query quickly and efficiently, you have to write your code such that it:

Keeps all cores busy aggregating via parallelized computation
Updates aggregate values quickly, using vectorizable loops that are easy for compilers to translate into the high performance SIMD instructions available in modern CPUs.

The rest of this article explains how grouping works in DataFusion and the improvements we made in 28.0.0.

Two phase parallel partitioned grouping

Both DataFusion 27 and 28 use state-of-the-art, two phase parallel hash partitioned grouping, similar to other high-performance vectorized engines like DuckDB’s Parallel Grouped Aggregates. In pictures this looks like:

Figure 2: Two phase repartitioned grouping: data flows from bottom (source) to top (results) in two phases. First (steps 1 and 2), each core reads the data into a core-specific hash table, computing intermediate aggregates without any cross-core coordination. Then (steps 3 and 4) DataFusion divides the data (“repartitioned”) into distinct subsets by group value, and each subset is sent to a specific core which computes the final aggregate.

The two phases are critical for keeping cores busy in a multi-core system. Both phases use the same hash table approach (explained in the next section), but differ in how the groups are distributed and the partial results emitted from the accumulators. The first phase aggregates data as soon as possible after it is produced. However, as shown in Figure 2, the groups can be anywhere in any input, so the same group is often found on many different cores. The second phase uses a hash function to redistribute data evenly across the cores, so each group value is processed by exactly one core which emits the final results for that group.

Figure 3: Group value distribution across 2 cores during aggregation phases. In the first phase, every group value 1, 2, 3, 4, is present in the input stream processed by each core. In the second phase, after repartitioning, the group values 1 and 2 are processed by core A, and values 3 and 4 are processed only by core B.

There are some additional subtleties in the DataFusion implementation not mentioned above due to space constraints, such as:

The policy of when to emit data from the first phase’s hash table (e.g. because the data is partially sorted)
Handling specific filters per aggregate (due to the FILTER SQL clause)
Data types of intermediate values (which may not be the same as the final output for some aggregates such as AVG).
Action taken when memory use exceeds its budget.

Hash grouping

DataFusion queries can compute many different aggregate functions for each group, both built in and/or user defined AggregateUDFs The state for each aggregate function, called an accumulator, is tracked with a hash table (DataFusion uses the excellent HashBrown RawTable API), which logically stores the “index” identifying the specific group value.

Hash grouping in 27.0.0

As shown in Figure 3, DataFusion 27.0.0 stores the data in a GroupState structure which, unsurprisingly, tracks the state for each group. The state for each group consists of:

The actual value of the group columns, in Arrow Row format.
In-progress accumulations (e.g. the running counts for the COUNT aggregate) for each group, in one of two possible formats Accumulator or RowAccumulator).
Scratch space for tracking which rows match each aggregate in each batch.

Figure 4: Hash group operator structure in DataFusion 27.0.0. A hash table maps each group to a GroupState which contains all the per-group states.

To compute the aggregate, DataFusion performs the following steps for each input batch:

Calculate hash using efficient vectorized code, specialized for each data type.
Determine group indexes for each input row using the hash table (creating new entries for newly seen groups).
Update Accumulators for each group that had input rows, assembling the rows into a contiguous range for vectorized accumulator if there are a sufficient number of them.

DataFusion also stores the hash values in the table to avoid potentially costly hash recomputation when resizing the hash table.

This scheme works very well for a relatively small number of distinct groups: all accumulators are efficiently updated with large contiguous batches of rows.

However, this scheme is not ideal for high cardinality grouping due to:

Multiple allocations per group for the group value row format, as well as for the RowAccumulators and each Accumulator. The Accumulator may have additional allocations within it as well.
Non-vectorized updates: Accumulator updates often fall back to a slower non-vectorized form because the number of distinct groups is large (and thus number of values per group is small) in each input batch.

Hash grouping in 28.0.0

For 28.0.0, we rewrote the core group by implementation following traditional system optimization principles: fewer allocations, type specialization, and aggressive vectorization.

DataFusion 28.0.0 uses the same RawTable and still stores group indexes. The major differences, as shown in Figure 4, are:

Group values are stored either
- Inline in the RawTable (for single columns of primitive types), where the conversion to Row format costs more than its benefit
- In a separate Rows structure with a single contiguous allocation for all groups values, rather than an allocation per group. Accumulators manage the state for all the groups internally, so the code to update intermediate values is a tight type specialized loop. The new GroupsAccumulator interface results in highly efficient type accumulator update loops.

Figure 5: Hash group operator structure in DataFusion 28.0.0. Group values are stored either directly in the hash table, or in a single allocation using the arrow Row format. The hash table contains group indexes. A single GroupsAccumulator stores the per-aggregate state for all groups.

This new structure improves performance significantly for high cardinality groups due to:

Reduced allocations: There are no longer any individual allocations per group.
Contiguous native accumulator states: Type-specialized accumulators store the values for all groups in a single contiguous allocation using a Rust Vec<T> of some native type.
Vectorized state update: The inner aggregate update loops, which are type-specialized and in terms of native Vecs, are well-vectorized by the Rust compiler (thanks LLVM!).

Notes

Some vectorized grouping implementations store the accumulator state row-wise directly in the hash table, which often uses modern CPU caches efficiently. Managing accumulator state in columnar fashion may sacrifice some cache locality, however it ensures the size of the hash table remains small, even when there are large numbers of groups and aggregates, making it easier for the compiler to vectorize the accumulator update.

Depending on the cost of recomputing hash values, DataFusion 28.0.0 may or may not store the hash values in the table. This optimizes the tradeoff between the cost of computing the hash value (which is expensive for strings, for example) vs. the cost of storing it in the hash table.

One subtlety that arises from pushing state updates into GroupsAccumulators is that each accumulator must handle similar variations with/without filtering and with/without nulls in the input. DataFusion 28.0.0 uses a templated NullState which encapsulates these common patterns across accumulators.

The code structure is heavily influenced by the fact DataFusion is implemented using Rust, a new(ish) systems programming language focused on speed and safety. Rust heavily discourages many of the traditional pointer casting “tricks” used in C/C++ hash grouping implementations. The DataFusion aggregation code is almost entirely safe, deviating into unsafe only when necessary. (Rust is a great choice because it makes DataFusion fast, easy to embed, and prevents many crashes and security issues often associated with multi-threaded C/C++ code).

ClickBench results

The full results of running the ClickBench queries against the single Parquet file with DataFusion 27.0.0, DataFusion 28.0.0, and DuckDB 0.8.1 are below. These numbers were run on a GCP e2-standard-8 machine with 8 cores and 32 GB of RAM, using the scripts here.

As the industry moves towards data systems assembled from components, it is increasingly important that they exchange data using open standards such as Apache Arrow and Parquet rather than custom storage and in-memory formats. Thus, this benchmark uses a single input Parquet file representative of many DataFusion users and aligned with the current trend in analytics of avoiding a costly load/transformation into a custom storage format prior to query.

DataFusion now reaches near-DuckDB-speeds querying Parquet data. While we don’t plan to engage in a benchmarking shootout with a team that literally wrote Fair Benchmarking Considered Difficult, hopefully everyone can agree that DataFusion 28.0.0 is a significant improvement.

Figure 6: Performance of DataFusion 27.0.0, DataFusion 28.0.0, and DuckDB 0.8.1 on all 43 ClickBench queries against a single hits.parquet file. Lower is better.

Notes

DataFusion 27.0.0 was not able to run several queries due to either planner bugs (Q9, Q11, Q12, 14) or running out of memory (Q33). DataFusion 28.0.0 solves those issues.

DataFusion is faster than DuckDB for query 21 and 22, likely due to optimized implementations of string pattern matching.

Conclusion: performance matters

Improving aggregation performance by more than a factor of two allows developers building products and projects with DataFusion to spend more time on value-added domain specific features. We believe building systems with DataFusion is much faster than trying to build something similar from scratch. DataFusion increases productivity because it eliminates the need to rebuild well-understood, but costly to implement, analytic database technology. While we’re pleased with the improvements in DataFusion 28.0.0, we are by no means done and are pursuing (Even More) Aggregation Performance. The future for performance is bright.

Acknowledgments

DataFusion is a community effort and this work was not possible without contributions from many in the community. A special shout out to sunchao, yjshen, yahoNanJing, mingmwang, ozankabak, mustafasrepo, and everyone else who contributed ideas, reviews, and encouragement during this work.

About DataFusion

Apache Arrow DataFusion is an extensible query engine and database toolkit, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion, along with Apache Calcite, Facebook’s Velox, and similar technology are part of the next generation “Deconstructed Database” architectures, where new systems are built on a foundation of fast, modular components, rather than as a single tightly integrated system.

¹ SELECT COUNT(*) FROM 'hits.parquet';

² SELECT COUNT(DISTINCT "UserID") as num_users FROM 'hits.parquet';

³ SELECT COUNT(DISTINCT "SearchPhrase") as num_phrases FROM 'hits.parquet';

⁴ SELECT COUNT(*) FROM (SELECT DISTINCT "UserID", "SearchPhrase" FROM 'hits.parquet')

⁵ Full script at hash.py

⁶ hits_0.parquet, one of the files from the partitioned ClickBench dataset, which has 100,000 rows and is 117 MB in size. The entire dataset has 100,000,000 rows in a single 14 GB Parquet file. The script did not complete on the entire dataset after 40 minutes, and used 212GB RAM at peak.

Querying Parquet with Millisecond Latency

Raphael Taylor-Davies, Andrew Lamb (InfluxData) — Wed, 07 Dec 2022 07:00:00 +0000

We believe that querying data in Apache Parquet files directly can achieve similar or better storage efficiency and query performance than most specialized file formats. While it requires significant engineering effort, the benefits of Parquet’s open format and broad ecosystem support make it the obvious choice for a wide class of data systems.

In this article we explain several advanced techniques needed to query data stored in the Parquet format quickly that we implemented in the Apache Arrow Rust Parquet reader. Together these techniques make the Rust implementation one of, if not the, fastest implementation for querying Parquet files — be it on local disk or remote object storage. It is able to query GBs of Parquet in a matter of milliseconds.

Background

Apache Parquet is an increasingly popular open format for storing analytic datasets, and has become the de-facto standard for cost-effective, DBMS-agnostic data storage. Initially created for the Hadoop ecosystem, Parquet’s reach now expands broadly across the data analytics ecosystem due to its compelling combination of:

High compression ratios
Amenability to commodity blob-storage such as S3
Broad ecosystem and tooling support
Portability across many different platforms and tools
Support for arbitrarily structured data

Increasingly other systems, such as DuckDB and Redshift allow querying data stored in Parquet directly, but support is still often a secondary consideration compared to their native (custom) file formats. Such formats include the DuckDB .duckdb file format, the Apache IOT TsFile, the Gorilla format, and others.

For the first time, access to the same sophisticated query techniques, previously only available in closed source commercial implementations, are now available as open source. The required engineering capacity comes from large, well-run open source projects with global contributor communities, such as Apache Arrow and Apache Impala.

Parquet file format

Before diving into the details of efficiently reading from Parquet, it is important to understand the file layout. The file format is carefully designed to quickly locate the desired information, skip irrelevant portions, and decode what remains efficiently.

The data in a Parquet file is broken into horizontal slices called RowGroups
Each RowGroup contains a single ColumnChunk for each column in the schema

For example, the following diagram illustrates a Parquet file with three columns “A”, “B” and “C” stored in two RowGroups for a total of 6 ColumnChunks.

The logical values for a ColumnChunk are written using one of the many available encodings into one or more Data Pages appended sequentially in the file. At the end of a Parquet file is a footer, which contains important metadata, such as:

The file’s schema information such as column names and types
The locations of the RowGroup and ColumnChunks in the file The footer may also contain other specialized data structures:
Optional statistics for each ColumnChunk including min/max values and null counts
Optional pointers to OffsetIndexes containing the location of each individual Page
Optional pointers to ColumnIndex containing row counts and summary statistics for each Page
Optional pointers to BloomFilterData, which can quickly check if a value is present in a ColumnChunk

For example, the logical structure of 2 Row Groups and 6 ColumnChunks in the previous diagram might be stored in a Parquet file as shown in the following diagram (not to scale). The pages for the ColumnChunks come first, followed by the footer. The data, the effectiveness of the encoding scheme, and the settings of the Parquet encoder determine the number of and size of the pages needed for each ColumnChunk. In this case, ColumnChunk 1 required 2 pages while ColumnChunk 6 required only 1 page. In addition to other information, the footer contains the locations of each Data Page and the types of the columns.

There are many important criteria to consider when creating Parquet files such as how to optimally order/cluster data and structure it into RowGroups and Data Pages. Such “physical design” considerations are complex, worthy of their own series of articles, and not addressed in this blog post. Instead, we focus on how to use the available structure to make queries very fast.

Optimizing queries

In any query processing system, the following techniques generally improve performance:

Reduce the data that must be transferred from secondary storage for processing (reduce I/O)
Reduce the computational load for decoding the data (reduce CPU)
Interleave/pipeline the reading and decoding of the data (improve parallelism)

The same principles apply to querying Parquet files, as we describe below:

Decode optimization

Parquet achieves impressive compression ratios by using sophisticated encoding techniques such as run length compression, dictionary encoding, delta encoding, and others. Consequently, the CPU-bound task of decoding can dominate query latency. Parquet readers can use a number of techniques to improve the latency and throughput of this task, as we have done in the Rust implementation.

Vectorized decode

Most analytic systems decode multiple values at a time to a columnar memory format, such as Apache Arrow, rather than processing data row-by-row. This is often called vectorized or columnar processing, and is beneficial because it:

Amortizes dispatch overheads to switch on the type of column being decoded
Improves cache locality by reading consecutive values from a ColumnChunk
Often allows multiple values to be decoded in a single instruction.
Avoid many small heap allocations with a single large allocation, yielding significant savings for variable length types such as strings and byte arrays

Thus, Rust Parquet Reader implements specialized decoders for reading Parquet directly into a columnar memory format (Arrow Arrays).

Streaming decode

There is no relationship between which rows are stored in which Pages across ColumnChunks. For example, the logical values for the 10,000th row may be in the first page of column A and in the third page of column B.

The simplest approach to vectorized decoding, and the one often initially implemented in Parquet decoders, is to decode an entire RowGroup (or ColumnChunk) at a time.

However, given Parquet’s high compression ratios, a single RowGroup may well contain millions of rows. Decoding so many rows at once is non-optimal because it:

Requires large amounts of intermediate RAM: typical in-memory formats optimized for processing, such as Apache Arrow, require much more than their Parquet-encoded form.
Increases query latency: Subsequent processing steps (like filtering or aggregation) can only begin once the entire RowGroup (or ColumnChunk) is decoded.

As such, the best Parquet readers support “streaming” data out in by producing configurable sized batches of rows on demand. The batch size must be large enough to amortize decode overhead, but small enough for efficient memory usage and to allow downstream processing to begin concurrently while the subsequent batch is decoded.

While streaming is not a complicated feature to explain, the stateful nature of decoding, especially across multiple columns and arbitrarily nested data, where the relationship between rows and values is not fixed, requires complex intermediate buffering and significant engineering effort to handle correctly.

Dictionary preservation

Dictionary Encoding, also called categorical encoding, is a technique where each value in a column is not stored directly, but instead, an index in a separate list called a “Dictionary” is stored. This technique achieves many of the benefits of third normal form for columns that have repeated values (low cardinality) and is especially effective for columns of strings such as “City”.

The first page in a ColumnChunk can optionally be a dictionary page, containing a list of values of the column’s type. Subsequent pages within this ColumnChunk can then encode an index into this dictionary, instead of encoding the values directly.

Given the effectiveness of this encoding, if a Parquet decoder simply decodes dictionary data into the native type, it will inefficiently replicate the same value over and over again, which is especially disastrous for string data. To handle dictionary-encoded data efficiently, the encoding must be preserved during decode. Conveniently, many columnar formats, such as the Arrow DictionaryArray, support such compatible encodings.

Preserving dictionary encoding drastically improves performance when reading to an Arrow array, in some cases in excess of 60x, as well as using significantly less memory.

The major complicating factor for preserving dictionaries is that the dictionaries are stored per ColumnChunk, and therefore the dictionary changes between RowGroups. The reader must automatically recompute a dictionary for batches that span multiple RowGroups, while also optimizing for the case that batch sizes divide evenly into the number of rows per RowGroup. Additionally a column may be only partly dictionary encoded, further complicating implementation. More information on this technique and its complications can be found in the blog post on applying this technique to the C++ Parquet reader.

Projection pushdown

The most basic Parquet optimization, and the one most commonly described for Parquet files, is projection pushdown, which reduces both I/Oand CPU requirements. Projection in this context means “selecting some but not all of the columns.” Given how Parquet organizes data, it is straightforward to read and decode only the ColumnChunks required for the referenced columns.

For example, consider a SQL query of the form

SELECT B from table where A > 35

This query only needs data for columns A and B (and not C) and the projection can be “pushed down” to the Parquet reader.

Specifically, using the information in the footer, the Parquet reader can entirely skip fetching (I/O) and decoding (CPU) the Data Pages that store data for column C (ColumnChunk 3 and ColumnChunk 6 in our example).

Predicate pushdown

Similar to projection pushdown, predicate pushdown also avoids fetching and decoding data from Parquet files, but does so using filter expressions. This technique typically requires closer integration with a query engine such as DataFusion, to determine valid predicates and evaluate them during the scan. Unfortunately without careful API design, the Parquet decoder and query engine can end up tightly coupled, preventing reuse (e.g. there are different Impala and Spark implementations in Cloudera Parquet Predicate Pushdown docs). The Rust Parquet reader uses the RowSelection API to avoid this coupling.

RowGroup pruning

The simplest form of predicate pushdown, supported by many Parquet based query engines, uses the statistics stored in the footer to skip entire RowGroups. We call this operation RowGroup pruning, and it is analogous to partition pruning in many classical data warehouse systems.

For the example query above, if the maximum value for A in a particular RowGroup is less than 35, the decoder can skip fetching and decoding any ColumnChunks from that entire RowGroup.

Note that pruning on minimum and maximum values is effective for many data layouts and column types, but not all. Specifically, it is not as effective for columns with many distinct pseudo-random values (e.g. identifiers or uuids). Thankfully for this use case, Parquet also supports per ColumnChunk Bloom Filters. We are actively working on adding bloom filter support in Apache Rust’s implementation.

Page pruning

A more sophisticated form of predicate pushdown uses the optional page index in the footer metadata to rule out entire Data Pages. The decoder decodes only the corresponding rows from other columns, often skipping entire pages.

The fact that pages in different ColumnChunks often contain different numbers of rows, due to various reasons, complicates this optimization. While the page index may identify the needed pages from one column, pruning a page from one column doesn’t immediately rule out entire pages in other columns.

Page pruning proceeds as follows:

Uses the predicates in combination with the page index to identify pages to skip
Uses the offset index to determine what row ranges correspond to non-skipped pages
Computes the intersection of ranges across non-skipped pages, and decodes only those rows

This last point is highly non-trivial to implement, especially for nested lists where a single row may correspond to multiple values. Fortunately, the Rust Parquet reader hides this complexity internally, and can decode arbitrary RowSelections.

For example, to scan Columns A and B, stored in 5 Data Pages as shown in the figure below:

If the predicate is A > 35,

Page 1 is pruned using the page index (max value is 20), leaving a RowSelection of [200->onwards],
Parquet reader skips Page 3 entirely (as its last row index is 99)
(Only) the relevant rows are read by reading pages 2, 4, and 5.

If the predicate is instead A > 35 AND B = “F” the page index is even more effective

Using A > 35, yields a RowSelection of [200->onwards] as before
Using B = “F”, on the remaining Page 4 and Page 5 of B, yields a RowSelection of [100-244]
Intersecting the two RowSelections leaves a combined RowSelection [200-244]
Parquet reader only decodes those 50 rows from Page 2 and Page 4.

Support for reading and writing these indexes from Arrow C++, and by extension pyarrow/pandas, is tracked in PARQUET-1404.

Late materialization

The two previous forms of predicate pushdown only operated on metadata stored for RowGroups, ColumnChunks, and Data Pages prior to decoding values. However, the same techniques also extend to values of one or more columns after decoding them but prior to decoding other columns, which is often called “late materialization”.

This technique is especially effective when:

The predicate is very selective, i.e. filters out large numbers of rows
Each row is large, either due to wide rows (e.g. JSON blobs) or many columns
The selected data is clustered together
The columns required by the predicate are relatively inexpensive to decode, e.g. PrimitiveArray / DictionaryArray

There is additional discussion about the benefits of this technique in SPARK-36527 and Impala.

For example, given the predicate A > 35 AND B = “F” from above where the engine uses the page index to determine only 50 rows within RowSelection of [100-244] could match, using late materialization, the Parquet decoder:

Decodes the 50 values of Column A
Evaluates A > 35 on those 50 values
In this case, only 5 rows pass, resulting in the RowSelection:
- RowSelection[205-206]
- RowSelection[238-240]
Only decodes the 5 rows for column B for those selections

In certain cases, such as our example where B stores single character values, the cost of late materialization machinery can outweigh the savings in decoding. However, the savings can be substantial when some of the conditions listed above are fulfilled. The query engine must decide which predicates to push down and in which order to apply them for optimal results.

While it is outside the scope of this document, the same technique can be applied for multiple predicates as well as predicates on multiple columns. See the RowFilter interface in the Parquet crate for more information, and the row_filter implementation in DataFusion.

I/O pushdown

While Parquet was designed for efficient access on the HDFS distributed file system, it works very well with commodity blob storage systems such as AWS S3 as they have very similar characteristics:

Relatively slow “random access” reads: it is much more efficient to read large (MBs) sections of data in each request than issue many requests for smaller portions
Significant latency before retrieving the first byte
High per-request cost: Often billed per request, regardless of number of bytes read, which incentivizes fewer requests that each read a large contiguous section of data.

To read optimally from such systems, a Parquet reader must:

Minimize the number of I/O requests, while also applying the various pushdown techniques to avoid fetching large amounts of unused data.
Integrate with the appropriate task scheduling mechanism to interleave I/O and processing on the data that is fetched to avoid pipeline bottlenecks.

As these are substantial engineering and integration challenges, many Parquet readers still require the files to be fetched in their entirety to local storage.

Fetching the entire files in order to process them is not ideal for several reasons:

High Latency: Decoding cannot begin until the entire file is fetched (Parquet metadata is at the end of the file, so the decoder must see the end prior to decoding the rest)
Wasted work: Fetching the entire file fetches all necessary data, but also potentially lots of unnecessary data that will be skipped after reading the footer. This increases the cost unnecessarily.
Requires costly “locally attached” storage (or memory): Many cloud environments do not offer computing resources with locally attached storage – they either rely on expensive network block storage such as AWS EBS or else restrict local storage to certain classes of VMs.

Avoiding the need to buffer the entire file requires a sophisticated Parquet decoder, integrated with the I/O subsystem, that can initially fetch and decode the metadata followed by ranged fetches for the relevant data blocks, interleaved with the decoding of Parquet data. This optimization requires careful engineering to fetch large enough blocks of data from the object store that the per request overhead doesn’t dominate gains from reducing the bytes transferred. SPARK-36529 describes the challenges of sequential processing in more detail.

Not included in this diagram picture are details like coalescing requests and ensuring minimum request sizes needed for an actual implementation.

The Rust Parquet crate provides an async Parquet reader, to efficiently read from any AsyncFileReader that:

Efficiently reads from any storage medium that supports range requests
Integrates with Rust’s futures ecosystem to avoid blocking threads waiting on network I/O and easily can interleave CPU and network
Requests multiple ranges simultaneously, to allow the implementation to coalesce adjacent ranges, fetch ranges in parallel, etc.
Uses the pushdown techniques described previously to eliminate fetching data where possible
Integrates easily with the Apache Arrow object_store crate which you can read more about here

To give a sense of what is possible, the following picture shows a timeline of fetching the footer metadata from remote files, using that metadata to determine what Data Pages to read, and then fetching data and decoding simultaneously. This process often must be done for more than one file at a time in order to match network latency, bandwidth, and available CPU.

Conclusion

We hope you enjoyed reading about the Parquet file format, and the various techniques used to quickly query parquet files.

We believe that the reason most open source implementations of Parquet do not have the breadth of features described in this post is that it takes a monumental effort, that was previously only possible at well-financed commercial enterprises which kept their implementations closed source.

However, with the growth and quality of the Apache Arrow community, both Rust practitioners and the wider Arrow community, our ability to collaborate and build a cutting-edge open source implementation is exhilarating and immensely satisfying. The technology described in this blog is the result of the contributions of many engineers spread across companies, hobbyists, and the world in several repositories, notably Apache Arrow DataFusion, Apache Arrow and Apache Arrow Ballista.

If you are interested in joining the DataFusion Community, please get in touch.

Rust Object Store Donation

Andrew Lamb, Raphael Taylor-Davies (InfluxData) — Mon, 22 Aug 2022 07:00:00 +0000

Today we are happy to officially announce that InfluxData has donated a generic object store implementation to the Apache Arrow project.

Using this crate, the same code can easily interact with AWS S3, Azure Blob Storage, Google Cloud Storage, local files, memory, and more by a simple runtime configuration change.

You can find the latest release on crates.io.

We expect this will accelerate the pace of innovation within the Rust ecosystem. Whether you are building a cloud-agnostic service to handle user-uploaded videos, images, and documents, a high-performance analytics system, or something else that needs access to commodity object storage, this crate can help you and we can’t wait to see what people build with it.

Why do you need an object store crate?

Aside from providing bulk data storage for many cloud-based services, we believe the future of analytic systems in particular involves querying data stored on object storage.

Object store is the generic term for what might be loosely described as an “infinite FTP server in the cloud”, that offers almost unlimited highly available and durable key-value storage on demand. Alongside virtual machines and block storage, object storage is one of the key commodity services provided by all modern cloud service providers. Examples include S3, Microsoft Azure Blob Storage, Google Cloud Storage, MinIO, Ceph Object Gateway, HDFS, and others.

To achieve this near-infinite scaling, object stores provide a subset of the functionality of traditional file systems such as NTFS or ext4. Specifically, they identify objects with a “key” and store arbitrary bytes as a value:

Figure 1: Object stores store arbitrary bytes identified by a string key.

Unlike filesystems, object stores typically lack an explicit notion of directories, and best practice uses a restricted subset of ASCII for keys. Instead, path-like traversal is achieved using LIST operations with a prefix, and illegal character sequencers are percent-encoded.

Figure 2: Object stores can LIST objects with a specified prefix, which can be used to group files together. In this example, asking for objects with prefix “/pictures/” results in all the .jpg objects, while asking for prefix “/parquet/” results in all the .parquet objects.

Consistently listing and traversing the quasi-directory structure encoded in the object keys across object store implementations and local file systems is one common source of frustration, as not only do filesystems behave very differently to object stores, but each of the object store implementations have their own quirks.

Having a focused, easy-to-use, high-performance, async object store library, written in idiomatic Rust, frees you from worrying about these details and lets you instead focus on your system’s logic. The underlying implementation is abstracted away from application code, and can easily be selected at runtime, allowing the same binary to run in multiple clouds.

This flexibility also facilitates local development as it allows testing against a local filesystem, or even an in-memory store, without requiring any additional binaries such as MinIO, and allowing the use of familiar tools such ls, cat or your choice of file browser.

How to use it?

Here is a simplistic example that finds the number of zeros in files that are on remote object storage:

let object_store: Arc<dyn ObjectStore> = get_object_store();

    // list all objects in the "parquet" prefix (aka directory)                                                                                                                     
    let path: Path = "parquet".try_into().unwrap();
    let list_stream = object_store
        .list(Some(&path))
        .await
        .expect("Error listing files");

    // List all files in the store                                                                                                                                                  
    list_stream
        .map(|meta| async {
            let meta = meta.expect("Error listing");

            // fetch the bytes from object store                                                                                                                                    
            let stream = object_store
                .get(&meta.location)
                .await
                .unwrap()
                .into_stream();

            // Count the zeros                                                                                                                                                      
            let num_zeros = stream
                .map(|bytes| {
                    let bytes = bytes.unwrap();
                    bytes.iter().filter(|b| **b == 0).count()
                })
                .collect::<Vec<usize>>()
                .await
                .into_iter()
                .sum::<usize>();

            (meta.location.to_string(), num_zeros)
        })
        .collect::<FuturesOrdered<_>>()
        .await
        .collect::<Vec<_>>()
        .await
        .into_iter()
        .for_each(|i| println!("{} has {} zeros", i.0, i.1));
}

Which prints out something like:

test_fixtures/parquet/1.parquet has 174 zeros
test_fixtures/parquet/2.parquet has 53 zeros

As written the code lists the files (in a paginated way) and fetches their contents in parallel. This may not be great if there are thousands of files. However, we can easily take advantage of the Rust streams and change

.collect::<FuturesOrdered<_>>()

.buffered(10)

Which will now limit the program to 10 GET requests in parallel.

The coolest part of the object_store crate is that the same code works for all the different object stores, and the only thing that changes is the definition of get_object_store

To read from S3:

fn get_object_store() -> Arc<dyn ObjectStore> {
    let s3 = AmazonS3Builder::new()
        .with_access_key_id(ACCESS_KEY_ID)
        .with_secret_access_key(SECRET_KEY)
        .with_region(REGION)
        .with_bucket_name(BUCKET_NAME)
        .build()
        .expect("error creating s3");

    Arc::new(s3)
}

To read from Azure:

fn get_object_store() -> Arc<dyn ObjectStore> {
    let azure = MicrosoftAzureBuilder::new()
        .with_account(STORAGE_ACCOUNT)
        .with_access_key(ACCESS_KEY)
        .with_container_name(BUCKET_NAME)
        .build()
        .expect("error creating azure");

    Arc::new(azure)
}

To read from GCP:

fn get_object_store() -> Arc<dyn ObjectStore> {
    let gcs = GoogleCloudStorageBuilder::new()
        .with_service_account_path(PATH_TO_SERVICE_ACCOUNT_JSON)
        .with_bucket_name(BUCKET_NAME)
        .build()
        .expect("error creating gcs");
    Arc::new(gcs)
}

To read from the local filesystem:

fn get_object_store() -> Arc<dyn ObjectStore> {
    let local_fs =
        LocalFileSystem::new_with_prefix(PREFIX)
          .expect("Error creating local file system");
    Arc::new(local_fs)
}

To reiterate, the major benefit is that you do not have to integrate different abstractions for the different object stores – the client code is always the same and under the covers uses the appropriate optimized implementation.

The object_store crate is also extensible which allows plugging in other object storage systems, while still retaining the ability to read files from the local filesystem, to take advantage of optimized file access offered by some systems – see GetFileResult.

A more full-featured and working example can be found in the rust_object_store_demo repository.

Why donate to Apache

The dream for Rust is the development productivity of Python or Ruby with the speed and memory efficiency of C/C++. Part of delivering this dream is ensuring that it integrates easily with the broader technology ecosystem, and in modern analytic systems this increasingly means data on object storage.

Thus, it is important to make it easy, and yet still efficient, for Rust programs to read and write data to object stores (AWS, S3, GCP). There are individual crates which implement cloud provider specific SDKs such as rusoto_s3 or Azure_storage; however, accessing the most common feature set via the same interface is often what is needed to accelerate the development of cross-cloud analytic systems. This crate is explicitly NOT meant to replace the full-blown cloud SDKs, but instead to provide a consistent object store abstraction that is portable across the many different underlying implementations.

We had exactly this requirement when we set out to develop influxdb_iox. InfluxDB and InfluxData Cloud run on AWS, GCP, Azure, and on-prem, and we needed IOx to do so as well. We could not find an existing library that suited our needs, so the InfluxData IOx team developed one within our project.

This effort was originally implemented by Rust Ecosystem Legend Carol (Nichols II Goulding) @carols10cents (primary author of the Rust Book) and heavily extended by Marco Neumann and Raphael Taylor-Davies as we crafted its integration into DataFusion.

IOx uses the Rust, Apache Arrow, Apache Parquet and DataFusion projects, which we also contribute to heavily, and it was increasingly important that IOx’s object store interactions were efficient via DataFusion. As we investigated the alternatives, we hit the point where this required deeper integration with the object store.

We hope that this donation further accelerates the creation of high-quality analytic systems in Rust and can’t wait to see what the community builds with it! We especially hope that the alignment with Apache Arrow will permit an elegantly integrated experience with libraries that can easily and efficiently read arrow-compatible files, such as parquet, CSV and newline-delimited JSON, natively from local or remote object storage. For applications that desire SQL or other higher level query engine capabilities, check out Apache Arrow DataFusion.

You can see more about the donation, and its rationale in this GitHub issue and this one as well.

What’s next

In the near term, we plan better integration with the parquet crate. In particular the async parquet reader has been explicitly developed with a generic object_store crate in mind. It currently supports projection, and row-group level predicate pushdown to minimize the data fetched from object storage, and support for page and row-level predicate pushdown is likely to land in the next release slated for the 22nd August 2022.

We also expect to continue to improve the integration with Apache Arrow DataFusion, ensuring it provides best in class support for querying data from object storage, efficiently decoupling IO from CPU-bound work, and making the most efficient use of modern multicore processors.

Finally there is an ongoing effort to move away from depending on large SDKs such as rusoto, and the Azure SDK for Rust. Whilst they have served us well, moving away from them will significantly reduce the dependency burden, simplify the implementation, and further improve consistency across the various implementations.

Join the community

We think a thriving community drives everyone forward. We encourage you to check out the crate, and lend us a hand! Try it out in your project and let us know how it goes, or find us on github here. There is a list of good open items for new comers here.

Kudos

Thank you to Raphael Taylor-Davies, Paul Dix, Nga Tran, and Marco Neumann who reviewed early versions of this document and contributed many improvements.