<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>InfluxData Blog - Raphael Taylor-Davies</title>
    <description>Posts by Raphael Taylor-Davies on the InfluxData Blog</description>
    <link>https://www.influxdata.com/blog/author/raphael-taylor-davies/</link>
    <language>en-us</language>
    <lastBuildDate>Tue, 01 Aug 2023 07:35:00 +0000</lastBuildDate>
    <pubDate>Tue, 01 Aug 2023 07:35:00 +0000</pubDate>
    <ttl>1800</ttl>
    <item>
      <title>Aggregating Millions of Groups Fast in Apache Arrow DataFusion</title>
      <description>&lt;h2 id="tldr"&gt;TLDR&lt;/h2&gt;

&lt;p&gt;Grouped aggregations are a core part of any analytic tool, creating understandable summaries of huge data volumes. &lt;a href="https://arrow.apache.org/datafusion/"&gt;Apache Arrow DataFusion&lt;/a&gt;’s parallel aggregation capability is 2-3x faster in version &lt;a href="https://crates.io/crates/datafusion/28.0.0"&gt;28.0.0&lt;/a&gt; for queries with a large number (10,000 or more) of groups.&lt;/p&gt;

&lt;p&gt;Improving aggregation performance matters to us as users of DataFusion. Both InfluxDB, a &lt;a href="https://github.com/influxdata/influxdb"&gt;time series data platform&lt;/a&gt; and Coralogix, a &lt;a href="https://coralogix.com/?utm_source=InfluxDB&amp;amp;utm_medium=Blog&amp;amp;utm_campaign=organic"&gt;full-stack observability&lt;/a&gt; platform, aggregate vast amounts of raw data to monitor and create insights for our customers. Improving DataFusion’s performance lets us provide better user experiences by generating insights faster with fewer resources. Because DataFusion is open source and released under the permissive &lt;a href="https://github.com/apache/arrow-datafusion/blob/main/LICENSE.txt"&gt;Apache 2.0&lt;/a&gt; license, the whole DataFusion community benefits as well.&lt;/p&gt;

&lt;p&gt;With the new optimizations, DataFusion’s grouping speed is now close to DuckDB, a system that regularly reports &lt;a href="https://duckdblabs.github.io/db-benchmark/"&gt;great&lt;/a&gt;  &lt;a href="https://duckdb.org/2022/03/07/aggregate-hashtable.html#experiments"&gt;grouping&lt;/a&gt; benchmark performance numbers. Figure 1 contains a representative sample of &lt;a href="https://github.com/ClickHouse/ClickBench/tree/main"&gt;ClickBench&lt;/a&gt; on a single Parquet file, and the full results are at the end of this article.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/2lBOgZwjHynFfVmveJFlum/51f93636fba965fb58c4f63ac6b72090/ClickBench-single_Parquet_file.png" alt="ClickBench-single Parquet file" /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 1:&lt;/strong&gt; Query performance for ClickBench queries on queries 16, 17,18 and 19 on a single Parquet file for DataFusion 27.0.0, DataFusion 28.0.0 and DuckDB 0.8.1.&lt;/p&gt;

&lt;h2 id="introduction-to-high-cardinality-grouping"&gt;Introduction to high cardinality grouping&lt;/h2&gt;

&lt;p&gt;Aggregation is a fancy word for computing summary statistics across many rows that have the same value in one or more columns. We call the rows with the same values &lt;em&gt;groups&lt;/em&gt; and “high cardinality” means there are a large number of distinct groups in the dataset. At the time of writing, a “large” number of groups in analytic engines is around 10,000.&lt;/p&gt;

&lt;p&gt;For example the &lt;a href="https://github.com/ClickHouse/ClickBench"&gt;ClickBench&lt;/a&gt; &lt;em&gt;hits&lt;/em&gt; dataset contains 100 million anonymized user clicks across a set of websites. ClickBench Query 17 is:&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-sql"&gt;SELECT "UserID", "SearchPhrase", COUNT(*) 
FROM hits
GROUP BY "UserID", "SearchPhrase" 
ORDER BY COUNT(*) 
DESC LIMIT 10;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In English, this query finds “the top ten (user, search phrase) combinations, across all clicks” and produces the following results (there are no search phrases for the top ten users):&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;UserID&lt;/th&gt;
      &lt;th&gt;Search Phrase&lt;/th&gt;
      &lt;th&gt;Count (UInt8(1))&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;1313338681122956954&lt;/td&gt;
      &lt;td&gt; &lt;/td&gt;
      &lt;td&gt;29097&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;1907779576417363396&lt;/td&gt;
      &lt;td&gt; &lt;/td&gt;
      &lt;td&gt;25333&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;2305303682471783379&lt;/td&gt;
      &lt;td&gt; &lt;/td&gt;
      &lt;td&gt;10597&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;7982623143712728547&lt;/td&gt;
      &lt;td&gt; &lt;/td&gt;
      &lt;td&gt;6669&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;7280399273658728997&lt;/td&gt;
      &lt;td&gt; &lt;/td&gt;
      &lt;td&gt;6408&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;1090981537032625727&lt;/td&gt;
      &lt;td&gt; &lt;/td&gt;
      &lt;td&gt;6196&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;5730251990344211405&lt;/td&gt;
      &lt;td&gt; &lt;/td&gt;
      &lt;td&gt;6019&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;6018350421959114808&lt;/td&gt;
      &lt;td&gt; &lt;/td&gt;
      &lt;td&gt;5990&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;835157184735512989&lt;/td&gt;
      &lt;td&gt; &lt;/td&gt;
      &lt;td&gt;5209&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;770542365400669095&lt;/td&gt;
      &lt;td&gt; &lt;/td&gt;
      &lt;td&gt;4906&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The ClickBench dataset contains&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;99,997,497 total rows &lt;a href="#sup-1"&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;17,630,976 different users (distinct UserIDs) &lt;a href="#sup-2"&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;6,019,103 different search phrases &lt;a href="#sup-3"&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;24,070,560 distinct combinations &lt;a href="#sup-4"&gt;&lt;sup&gt;4&lt;/sup&gt;&lt;/a&gt; of (UserID, SearchPhrase)&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thus, to answer the query, DataFusion must map each of the 100M different input rows into one of the &lt;strong&gt;24 million different groups&lt;/strong&gt;, and keep count of how many such rows there are in each group.&lt;/p&gt;

&lt;h2 id="the-solution"&gt;The solution&lt;/h2&gt;

&lt;p&gt;Like most concepts in databases and other analytic systems, the basic ideas of this algorithm are straightforward and taught in introductory computer science courses. You could compute the query with a program such as this &lt;a href="#sup-5"&gt;&lt;sup&gt;5&lt;/sup&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-python"&gt;import pandas as pd
from collections import defaultdict
from operator import itemgetter

# read file
hits = pd.read_parquet('hits.parquet', engine='pyarrow')

# build groups
counts = defaultdict(int)
for index, row in hits.iterrows():
    group = (row['UserID'], row['SearchPhrase']);
    # update the dict entry for the corresponding key
    counts[group] += 1

# Print the top 10 values
print (dict(sorted(counts.items(), key=itemgetter(1), reverse=True)[:10]))&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This approach, while simple, is both slow and very memory inefficient. It requires over 40 seconds to compute the results for less than 1% of the dataset &lt;a href="#sup-6"&gt;&lt;sup&gt;6&lt;/sup&gt;&lt;/a&gt;. Both DataFusion 28.0.0 and DuckDB 0.8.1 compute results in under 10 seconds for the &lt;em&gt;entire&lt;/em&gt; dataset.&lt;/p&gt;

&lt;p&gt;To answer this query quickly and efficiently, you have to write your code such that it:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Keeps all cores busy aggregating via parallelized computation&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Updates aggregate values quickly, using vectorizable loops that are easy for compilers to translate into the high performance &lt;a href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_data"&gt;SIMD&lt;/a&gt; instructions available in modern CPUs.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The rest of this article explains how grouping works in DataFusion and the improvements we made in 28.0.0.&lt;/p&gt;

&lt;h3 id="two-phase-parallel-partitioned-grouping"&gt;Two phase parallel partitioned grouping&lt;/h3&gt;

&lt;p&gt;Both DataFusion 27 and 28 use state-of-the-art, two phase parallel hash partitioned grouping, similar to other high-performance vectorized engines like &lt;a href="https://duckdb.org/2022/03/07/aggregate-hashtable.html"&gt;DuckDB’s Parallel Grouped Aggregates&lt;/a&gt;. In pictures this looks like:&lt;/p&gt;

&lt;p&gt;&lt;img style="padding: 20px 0px;" src="//images.ctfassets.net/o7xu9whrs0u9/5WBvyhluMQrCxVuGZXcvbX/7b355fd60c87a2fb6b50d4a2b5962b60/parallel_partitioned_grouping.png" alt="parallel partitioned grouping" /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 2:&lt;/strong&gt; Two phase repartitioned grouping: data flows from bottom (source) to top (results) in two phases. First (steps 1 and 2), each core reads the data into a core-specific hash table, computing intermediate aggregates without any cross-core coordination. Then (steps 3 and 4) DataFusion divides the data (“repartitioned”) into distinct subsets by group value, and each subset is sent to a specific core which computes the final aggregate.&lt;/p&gt;

&lt;p&gt;The two phases are critical for keeping cores busy in a multi-core system. Both phases use the same hash table approach (explained in the next section), but differ in how the groups are distributed and the partial results emitted from the accumulators. The first phase aggregates data as soon as possible after it is produced. However, as shown in Figure 2, the groups can be anywhere in any input, so the same group is often found on many different cores. The second phase uses a hash function to redistribute data evenly across the cores, so each group value is processed by exactly one core which emits the final results for that group.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/g8yqMdk62nuJ4sRBgO4z5/b2e515fe0dbc065a9df281cc354eb545/Core-A-Core-B.png" alt="Core-A-Core-B" /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 3:&lt;/strong&gt; Group value distribution across 2 cores during aggregation phases. In the first phase, every group value &lt;code class="language-python"&gt;1, 2, 3, 4,&lt;/code&gt; is present in the input stream processed by each core. In the second phase, after repartitioning, the group values &lt;code class="language-python"&gt;1&lt;/code&gt; and &lt;code class="language-python"&gt;2&lt;/code&gt; are processed by core A, and values &lt;code class="language-python"&gt;3&lt;/code&gt; and &lt;code class="language-python"&gt;4&lt;/code&gt; are processed only by core B.&lt;/p&gt;

&lt;p&gt;There are some additional subtleties in the &lt;a href="https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/physical_plan/aggregates/row_hash.rs"&gt;DataFusion implementation&lt;/a&gt; not mentioned above due to space constraints, such as:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;The policy of when to emit data from the first phase’s hash table (e.g. because the data is partially sorted)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Handling specific filters per aggregate (due to the &lt;code class="language-python"&gt;FILTER&lt;/code&gt; SQL clause)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Data types of intermediate values (which may not be the same as the final output for some aggregates such as AVG).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Action taken when memory use exceeds its budget.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id="hash-grouping"&gt;Hash grouping&lt;/h3&gt;

&lt;p&gt;DataFusion queries can compute many different aggregate functions for each group, both &lt;a href="https://arrow.apache.org/datafusion/user-guide/sql/aggregate_functions.html"&gt;built in&lt;/a&gt; and/or user defined &lt;a href="https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.AggregateUDF.html"&gt;&lt;code class="language-python"&gt;AggregateUDFs&lt;/code&gt;&lt;/a&gt; The state for each aggregate function, called an &lt;em&gt;accumulator&lt;/em&gt;, is tracked with a hash table (DataFusion uses the excellent &lt;a href="https://docs.rs/hashbrown/latest/hashbrown/index.html"&gt;HashBrown&lt;/a&gt;   &lt;a href="https://docs.rs/hashbrown/latest/hashbrown/raw/struct.RawTable.html"&gt;RawTable API&lt;/a&gt;), which logically stores the “index” identifying the specific group value.&lt;/p&gt;

&lt;h3 id="hash-grouping-in-2700"&gt;Hash grouping in 27.0.0&lt;/h3&gt;

&lt;p&gt;As shown in Figure 3, DataFusion 27.0.0 stores the data in a &lt;a href="https://github.com/apache/arrow-datafusion/blob/4d93b6a3802151865b68967bdc4c7d7ef425b49a/datafusion/core/src/physical_plan/aggregates/utils.rs#L38-L50"&gt;&lt;code class="language-python"&gt;GroupState&lt;/code&gt;&lt;/a&gt; structure which, unsurprisingly, tracks the state for each group. The state for each group consists of:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;The actual value of the group columns, in &lt;a href="https://docs.rs/arrow-row/latest/arrow_row/index.html"&gt;Arrow Row&lt;/a&gt; format.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;In-progress accumulations (e.g. the running counts for the &lt;code class="language-python"&gt;COUNT&lt;/code&gt; aggregate) for each group, in one of two possible formats &lt;a href="https://github.com/apache/arrow-datafusion/blob/a6dcd943051a083693c352c6b4279156548490a0/datafusion/expr/src/accumulator.rs#L24-L49"&gt;&lt;code class="language-python"&gt;Accumulator&lt;/code&gt;&lt;/a&gt; or &lt;a href="https://github.com/apache/arrow-datafusion/blob/a6dcd943051a083693c352c6b4279156548490a0/datafusion/physical-expr/src/aggregate/row_accumulator.rs#L26-L46"&gt;&lt;code class="language-python"&gt;RowAccumulator&lt;/code&gt;&lt;/a&gt;).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Scratch space for tracking which rows match each aggregate in each batch.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/2xIILntP7GQkTIRmL45tqZ/4b979836303cf232abfa1eb057c14659/Hash_grouping_in_27-0-0.png" alt="Hash grouping in 27-0-0" /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 4:&lt;/strong&gt; Hash group operator structure in DataFusion 27.0.0. A hash table maps each group to a GroupState which contains all the per-group states.&lt;/p&gt;

&lt;p&gt;To compute the aggregate, DataFusion performs the following steps for each input batch:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Calculate hash using &lt;a href="https://github.com/apache/arrow-datafusion/blob/a6dcd943051a083693c352c6b4279156548490a0/datafusion/physical-expr/src/hash_utils.rs#L264-L307"&gt;efficient vectorized code&lt;/a&gt;, specialized for each data type.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Determine group indexes for each input row using the hash table (creating new entries for newly seen groups).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href="https://github.com/apache/arrow-datafusion/blob/4ab8be57dee3bfa72dd105fbd7b8901b873a4878/datafusion/core/src/physical_plan/aggregates/row_hash.rs#L562-L602"&gt;Update&lt;/a&gt; &lt;code class="language-python"&gt;Accumulator&lt;/code&gt;s &lt;a href="https://github.com/apache/arrow-datafusion/blob/4ab8be57dee3bfa72dd105fbd7b8901b873a4878/datafusion/core/src/physical_plan/aggregates/row_hash.rs#L562-L602"&gt;for each group that had input rows,&lt;/a&gt; assembling the rows into a contiguous range for vectorized accumulator if there are a sufficient number of them.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;DataFusion also stores the hash values in the table to avoid potentially costly hash recomputation when resizing the hash table.&lt;/p&gt;

&lt;p&gt;This scheme works very well for a relatively small number of distinct groups: all accumulators are efficiently updated with large contiguous batches of rows.&lt;/p&gt;

&lt;p&gt;However, this scheme is not ideal for high cardinality grouping due to:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Multiple allocations per group&lt;/strong&gt; for the group value row format, as well as for the &lt;code class="language-python"&gt;RowAccumulator&lt;/code&gt;s and each &lt;code class="language-python"&gt;Accumulator&lt;/code&gt;. The &lt;code class="language-python"&gt;Accumulator&lt;/code&gt; may have additional allocations within it as well.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Non-vectorized updates:&lt;/strong&gt; Accumulator updates often fall back to a slower non-vectorized form because the number of distinct groups is large (and thus number of values per group is small) in each input batch.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id="hash-grouping-in-2800"&gt;Hash grouping in 28.0.0&lt;/h3&gt;

&lt;p&gt;For 28.0.0, we rewrote the core group by implementation following traditional system optimization principles: fewer allocations, type specialization, and aggressive vectorization.&lt;/p&gt;

&lt;p&gt;DataFusion 28.0.0 uses the same RawTable and still stores group indexes. The major differences, as shown in Figure 4, are:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Group values are stored either&lt;/p&gt;

    &lt;ul&gt;
      &lt;li&gt;
        &lt;p&gt;Inline in the &lt;code class="language-python"&gt;RawTable&lt;/code&gt; (for single columns of primitive types), where the conversion to Row format costs more than its benefit&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;In a separate &lt;a href="https://docs.rs/arrow-row/latest/arrow_row/struct.Row.html"&gt;Rows&lt;/a&gt; structure with a single contiguous allocation for all groups values, rather than an allocation per group. Accumulators manage the state for all the groups internally, so the code to update intermediate values is a tight type specialized loop. The new &lt;a href="https://github.com/apache/arrow-datafusion/blob/a6dcd943051a083693c352c6b4279156548490a0/datafusion/physical-expr/src/aggregate/groups_accumulator/mod.rs#L66-L75"&gt;&lt;code class="language-python"&gt;GroupsAccumulator&lt;/code&gt;&lt;/a&gt; interface results in highly efficient type accumulator update loops.&lt;/p&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;img style="padding: 20px;" src="//images.ctfassets.net/o7xu9whrs0u9/EGybGfabKW1L5WTQpwOez/4a3347b6fb6fcff103a11e2478f925a3/Group_State.png" alt="Group State" /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 5:&lt;/strong&gt; Hash group operator structure in DataFusion 28.0.0. Group values are stored either directly in the hash table, or in a single allocation using the arrow Row format. The hash table contains group indexes. A single &lt;code class="language-python"&gt;GroupsAccumulator&lt;/code&gt; stores the per-aggregate state for all groups.&lt;/p&gt;

&lt;p&gt;This new structure improves performance significantly for high cardinality groups due to:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Reduced allocations:&lt;/strong&gt; There are no longer any individual allocations per group.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Contiguous native accumulator states:&lt;/strong&gt; Type-specialized accumulators store the values for all groups in a single contiguous allocation using a &lt;a href="https://doc.rust-lang.org/std/vec/struct.Vec.html"&gt;Rust Vec&amp;lt;T&amp;gt;&lt;/a&gt; of some native type.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Vectorized state update:&lt;/strong&gt; The inner aggregate update loops, which are type-specialized and in terms of native Vecs, are well-vectorized by the Rust compiler (thanks &lt;a href="https://llvm.org/"&gt;LLVM&lt;/a&gt;!).&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id="notes"&gt;Notes&lt;/h3&gt;

&lt;p&gt;Some vectorized grouping implementations store the accumulator state row-wise directly in the hash table, which often uses modern CPU caches efficiently. Managing accumulator state in columnar fashion may sacrifice some cache locality, however it ensures the size of the hash table remains small, even when there are large numbers of groups and aggregates, making it easier for the compiler to vectorize the accumulator update.&lt;/p&gt;

&lt;p&gt;Depending on the cost of recomputing hash values, DataFusion 28.0.0 may or may not store the hash values in the table. This optimizes the tradeoff between the cost of computing the hash value (which is expensive for strings, for example) vs. the cost of storing it in the hash table.&lt;/p&gt;

&lt;p&gt;One subtlety that arises from pushing state updates into GroupsAccumulators is that each accumulator must handle similar variations with/without filtering and with/without nulls in the input. DataFusion 28.0.0 uses a templated &lt;a href="https://github.com/apache/arrow-datafusion/blob/a6dcd943051a083693c352c6b4279156548490a0/datafusion/physical-expr/src/aggregate/groups_accumulator/accumulate.rs#L28-L54"&gt;&lt;code class="language-python"&gt;NullState&lt;/code&gt;&lt;/a&gt; which encapsulates these common patterns across accumulators.&lt;/p&gt;

&lt;p&gt;The code structure is heavily influenced by the fact DataFusion is implemented using &lt;a href="https://www.rust-lang.org/"&gt;Rust&lt;/a&gt;, a new(ish) systems programming language focused on speed and safety. Rust heavily discourages many of the traditional pointer casting “tricks” used in C/C++ hash grouping implementations. The DataFusion aggregation code is almost entirely &lt;a href="https://doc.rust-lang.org/nomicon/meet-safe-and-unsafe.html#:~:text=Safe%20Rust%20is%20the%20true,Undefined%20Behavior%20(a.k.a.%20UB)."&gt;&lt;code class="language-python"&gt;safe&lt;/code&gt;&lt;/a&gt;, deviating into &lt;code class="language-python"&gt;unsafe&lt;/code&gt; only when necessary. (Rust is a great choice because it makes DataFusion fast, easy to embed, and prevents many crashes and security issues often associated with multi-threaded C/C++ code).&lt;/p&gt;

&lt;h2 id="clickbench-results"&gt;ClickBench results&lt;/h2&gt;

&lt;p&gt;The full results of running the &lt;a href="https://github.com/ClickHouse/ClickBench/tree/main"&gt;ClickBench&lt;/a&gt; queries against the single Parquet file with DataFusion 27.0.0, DataFusion 28.0.0, and DuckDB 0.8.1 are below. These numbers were run on a GCP &lt;code class="language-python"&gt;e2-standard-8 machine&lt;/code&gt; with 8 cores and 32 GB of RAM, using the scripts &lt;a href="https://github.com/alamb/datafusion-duckdb-benchmark"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As the industry moves towards data systems assembled from components, it is increasingly important that they exchange data using open standards such as &lt;a href="https://arrow.apache.org/"&gt;Apache Arrow&lt;/a&gt; and &lt;a href="https://parquet.apache.org/"&gt;Parquet&lt;/a&gt; rather than custom storage and in-memory formats. Thus, this benchmark uses a single input Parquet file representative of many DataFusion users and aligned with the current trend in analytics of avoiding a costly load/transformation into a custom storage format prior to query.&lt;/p&gt;

&lt;p&gt;DataFusion now reaches near-DuckDB-speeds querying Parquet data. While we don’t plan to engage in a benchmarking shootout with a team that literally wrote &lt;a href="https://dl.acm.org/doi/abs/10.1145/3209950.3209955"&gt;Fair Benchmarking Considered Difficult&lt;/a&gt;, hopefully everyone can agree that DataFusion 28.0.0 is a significant improvement.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/TiYSwJknkrb7Si1O2lHoe/5f8f044dd6db8ca9d20549dac95c82e3/ClickBench_-_all_queries.png" alt="ClickBench - all queries" /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 6:&lt;/strong&gt; Performance of DataFusion 27.0.0, DataFusion 28.0.0, and DuckDB 0.8.1 on all 43 ClickBench queries against a single &lt;code class="language-python"&gt;hits.parquet file&lt;/code&gt;. Lower is better.&lt;/p&gt;

&lt;h3 id="notes-1"&gt;Notes&lt;/h3&gt;

&lt;p&gt;DataFusion 27.0.0 was not able to run several queries due to either planner bugs (Q9, Q11, Q12, 14) or running out of memory (Q33). DataFusion 28.0.0 solves those issues.&lt;/p&gt;

&lt;p&gt;DataFusion is faster than DuckDB for query 21 and 22, likely due to optimized implementations of string pattern matching.&lt;/p&gt;

&lt;h2 id="conclusion-performance-matters"&gt;Conclusion: performance matters&lt;/h2&gt;

&lt;p&gt;Improving aggregation performance by more than a factor of two allows developers building products and projects with DataFusion to spend more time on value-added domain specific features. We believe building systems with DataFusion is much faster than trying to build something similar from scratch. DataFusion increases productivity because it eliminates the need to rebuild well-understood, but costly to implement, analytic database technology. While we’re pleased with the improvements in DataFusion 28.0.0, we are by no means done and are pursuing &lt;a href="https://github.com/apache/arrow-datafusion/issues/7000"&gt;(Even More) Aggregation Performance&lt;/a&gt;. The future for performance is bright.&lt;/p&gt;

&lt;h2 id="acknowledgments"&gt;Acknowledgments&lt;/h2&gt;

&lt;p&gt;DataFusion is a &lt;a href="https://arrow.apache.org/datafusion/contributor-guide/communication.html"&gt;community effort&lt;/a&gt; and this work was not possible without contributions from many in the community. A special shout out to &lt;a href="https://github.com/sunchao"&gt;sunchao&lt;/a&gt;, &lt;a href="https://github.com/jyshen"&gt;yjshen&lt;/a&gt;, &lt;a href="https://github.com/yahoNanJing"&gt;yahoNanJing&lt;/a&gt;, &lt;a href="https://github.com/mingmwang"&gt;mingmwang&lt;/a&gt;, &lt;a href="https://github.com/ozankabak"&gt;ozankabak&lt;/a&gt;, &lt;a href="https://github.com/mustafasrepo"&gt;mustafasrepo&lt;/a&gt;, and everyone else who contributed ideas, reviews, and encouragement &lt;a href="https://github.com/apache/arrow-datafusion/pull/6800"&gt;during&lt;/a&gt; this &lt;a href="https://github.com/apache/arrow-datafusion/pull/6904"&gt;work&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id="about-datafusion"&gt;About DataFusion&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://arrow.apache.org/datafusion/"&gt;Apache Arrow DataFusion&lt;/a&gt; is an extensible query engine and database toolkit, written in &lt;a href="https://www.rust-lang.org/"&gt;Rust&lt;/a&gt;, that uses &lt;a href="https://arrow.apache.org/"&gt;Apache Arrow&lt;/a&gt; as its in-memory format. DataFusion, along with &lt;a href="https://calcite.apache.org/"&gt;Apache Calcite&lt;/a&gt;, Facebook’s &lt;a href="https://github.com/facebookincubator/velox"&gt;Velox&lt;/a&gt;, and similar technology are part of the next generation “&lt;a href="https://www.usenix.org/publications/login/winter2018/khurana"&gt;Deconstructed Database&lt;/a&gt;” architectures, where new systems are built on a foundation of fast, modular components, rather than as a single tightly integrated system.&lt;/p&gt;

&lt;hr /&gt;

&lt;div id="sup-1"&gt;
&lt;sup&gt;1&lt;/sup&gt; SELECT COUNT(*) FROM &lt;code&gt;'hits.parquet'&lt;/code&gt;;&lt;/div&gt;

&lt;div id="sup-2"&gt;
&lt;sup&gt;2&lt;/sup&gt; SELECT COUNT(DISTINCT "UserID") as num_users FROM &lt;code&gt;'hits.parquet'&lt;/code&gt;;  &lt;/div&gt;

&lt;div id="sup-3"&gt;
&lt;sup&gt;3&lt;/sup&gt; SELECT COUNT(DISTINCT "SearchPhrase") as &lt;code&gt;num_phrases&lt;/code&gt; FROM &lt;code&gt;'hits.parquet'&lt;/code&gt;;  &lt;/div&gt;

&lt;div id="sup-4"&gt;
&lt;sup&gt;4&lt;/sup&gt; SELECT COUNT(*) FROM (SELECT DISTINCT "UserID", "SearchPhrase" FROM &lt;code&gt;'hits.parquet'&lt;/code&gt;) &lt;/div&gt;

&lt;div id="sup-5"&gt;
&lt;sup&gt;5&lt;/sup&gt; Full script at &lt;a href="https://github.com/alamb/datafusion-duckdb-benchmark/blob/main/hash.py" target="_blank"&gt;hash.py&lt;/a&gt; &lt;/div&gt;

&lt;div id="sup-6"&gt;
&lt;sup&gt;6&lt;/sup&gt; &lt;a href="https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_%7B%7D.parquet" target="_blank"&gt; hits_0.parquet&lt;/a&gt;, one of the files from the partitioned ClickBench dataset, which has 100,000 rows and is 117 MB in size. The entire dataset has 100,000,000 rows in a single 14 GB Parquet file. The script did not complete on the entire dataset after 40 minutes, and used 212GB RAM at peak.
 &lt;/div&gt;
</description>
      <pubDate>Tue, 01 Aug 2023 07:35:00 +0000</pubDate>
      <link>https://www.influxdata.com/blog/aggregating-millions-groups-fast-apache-arrow-datafusion/</link>
      <guid isPermaLink="true">https://www.influxdata.com/blog/aggregating-millions-groups-fast-apache-arrow-datafusion/</guid>
      <category>Developer</category>
      <author>Andrew Lamb, Raphael Taylor-Davies, Daniël Heres (InfluxData)</author>
    </item>
    <item>
      <title>Querying Parquet with Millisecond Latency</title>
      <description>&lt;p&gt;We believe that querying data in &lt;a href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt; files directly can achieve similar or better storage efficiency and query performance than most specialized file formats. While it requires significant engineering effort, the benefits of Parquet’s open format and broad ecosystem support make it the obvious choice for a wide class of data systems.&lt;/p&gt;

&lt;p&gt;In this article we explain several advanced techniques needed to query data stored in the Parquet format quickly that we implemented in the &lt;a href="https://docs.rs/parquet/27.0.0/parquet/"&gt;Apache Arrow Rust Parquet reader&lt;/a&gt;. Together these techniques make the Rust implementation one of, if not the, fastest implementation for querying Parquet files — be it on local disk or remote object storage. It is able to query GBs of Parquet in a &lt;a href="https://github.com/tustvold/access-log-bench"&gt;matter of milliseconds&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id="background"&gt;Background&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://parquet.apache.org/"&gt;Apache Parquet&lt;/a&gt; is an increasingly popular open format for storing &lt;a href="https://www.influxdata.com/glossary/olap/"&gt;analytic datasets&lt;/a&gt;, and has become the de-facto standard for cost-effective, DBMS-agnostic data storage. Initially created for the Hadoop ecosystem, Parquet’s reach now expands broadly across the data analytics ecosystem due to its compelling combination of:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;High compression ratios&lt;/li&gt;
  &lt;li&gt;Amenability to commodity blob-storage such as S3&lt;/li&gt;
  &lt;li&gt;Broad ecosystem and tooling support&lt;/li&gt;
  &lt;li&gt;Portability across many different platforms and tools&lt;/li&gt;
  &lt;li&gt;Support for &lt;a href="https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/"&gt;arbitrarily structured data&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Increasingly other systems, such as &lt;a href="https://duckdb.org/2021/06/25/querying-parquet.html"&gt;DuckDB&lt;/a&gt; and &lt;a href="https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum.html#c-spectrum-overview"&gt;Redshift&lt;/a&gt; allow querying data stored in Parquet directly, but support is still often a secondary consideration compared to their native (custom) file formats. Such formats include the DuckDB &lt;code class="language-markup"&gt;.duckdb&lt;/code&gt; file format, the Apache IOT &lt;a href="https://github.com/apache/iotdb/blob/master/tsfile/README.md"&gt;TsFile&lt;/a&gt;, the &lt;a href="https://www.vldb.org/pvldb/vol8/p1816-teller.pdf"&gt;Gorilla format&lt;/a&gt;, and others.&lt;/p&gt;

&lt;p&gt;For the first time, access to the same sophisticated query techniques, previously only available in closed source commercial implementations, are now available as open source. The required engineering capacity comes from large, well-run open source projects with global contributor communities, such as &lt;a href="https://arrow.apache.org/"&gt;Apache Arrow&lt;/a&gt; and &lt;a href="https://impala.apache.org/"&gt;Apache Impala&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id="parquet-file-format"&gt;Parquet file format&lt;/h2&gt;

&lt;p&gt;Before diving into the details of efficiently reading from &lt;a href="https://www.influxdata.com/glossary/apache-parquet/"&gt;Parquet&lt;/a&gt;, it is important to understand the file layout. The file format is carefully designed to quickly locate the desired information, skip irrelevant portions, and decode what remains efficiently.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The data in a Parquet file is broken into horizontal slices called RowGroups&lt;/li&gt;
  &lt;li&gt;Each RowGroup contains a single ColumnChunk for each column in the schema&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, the following diagram illustrates a Parquet file with three columns “A”, “B” and “C” stored in two RowGroups for a total of 6 ColumnChunks.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/6LB0vApMrjmHuYoJpl1PhV/9f52582c5adad182853a36f9013ba365/Parquet_File_Format_Diagram_12.05.2022v1.png" alt="Parquet File Format Diagram 12.05.2022v1" /&gt;&lt;/p&gt;

&lt;p&gt;The logical values for a ColumnChunk are written using one of the many &lt;a href="https://parquet.apache.org/docs/file-format/data-pages/encodings/"&gt;available encodings&lt;/a&gt; into one or more Data Pages appended sequentially in the file. At the end of a Parquet file is a footer, which contains important metadata, such as:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The file’s schema information such as column names and types&lt;/li&gt;
  &lt;li&gt;The locations of the RowGroup and ColumnChunks in the file
The footer may also contain other specialized data structures:&lt;/li&gt;
  &lt;li&gt;Optional statistics for each ColumnChunk including min/max values and null counts&lt;/li&gt;
  &lt;li&gt;Optional pointers to &lt;a href="https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L926-L932"&gt;OffsetIndexes&lt;/a&gt; containing the location of each individual Page&lt;/li&gt;
  &lt;li&gt;Optional pointers to &lt;a href="https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L938"&gt;ColumnIndex&lt;/a&gt; containing row counts and summary statistics for each Page&lt;/li&gt;
  &lt;li&gt;Optional pointers to &lt;a href="https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L621-L630"&gt;BloomFilterData&lt;/a&gt;, which can quickly check if a value is present in a ColumnChunk&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, the logical structure of 2 Row Groups and 6 ColumnChunks in the previous diagram might be stored in a Parquet file as shown in the following diagram (not to scale). The pages for the ColumnChunks come first, followed by the footer. The data, the effectiveness of the encoding scheme, and the settings of the Parquet encoder determine the number of and size of the pages needed for each ColumnChunk. In this case, ColumnChunk 1 required 2 pages while ColumnChunk 6 required only 1 page. In addition to other information, the footer contains the locations of each Data Page and the types of the columns.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/4zaQPMjZtgqEennJJuK6hX/3c6f41d792ec0560e7134646376a2489/Parquet_File_Format_Diagram_2_12.05.2022v1.png" alt="Parquet File Format Diagram 2 12.05.2022v1" /&gt;&lt;/p&gt;

&lt;p&gt;There are many important criteria to consider when creating Parquet files such as how to optimally order/cluster data and structure it into RowGroups and Data Pages. Such “physical design” considerations are complex, worthy of their own series of articles, and not addressed in this blog post. Instead, we focus on how to use the available structure to make queries very fast.&lt;/p&gt;

&lt;h2 id="optimizing-queries"&gt;Optimizing queries&lt;/h2&gt;

&lt;p&gt;In any query processing system, the following techniques generally improve performance:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Reduce the data that must be transferred from secondary storage for processing (reduce I/O)&lt;/li&gt;
  &lt;li&gt;Reduce the computational load for decoding the data (reduce CPU)&lt;/li&gt;
  &lt;li&gt;Interleave/pipeline the reading and decoding of the data (improve parallelism)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The same principles apply to querying Parquet files, as we describe below:&lt;/p&gt;

&lt;h2 id="decode-optimization"&gt;Decode optimization&lt;/h2&gt;

&lt;p&gt;Parquet achieves impressive compression ratios by using &lt;a href="https://parquet.apache.org/docs/file-format/data-pages/encodings/"&gt;sophisticated encoding techniques&lt;/a&gt; such as run length compression, dictionary encoding, delta encoding, and others. Consequently, the CPU-bound task of decoding can dominate query latency. Parquet readers can use a number of techniques to improve the latency and throughput of this task, as we have done in the Rust implementation.&lt;/p&gt;

&lt;h3 id="vectorized-decode"&gt;Vectorized decode&lt;/h3&gt;

&lt;p&gt;Most analytic systems decode multiple values at a time to a columnar memory format, such as Apache Arrow, rather than processing data row-by-row. This is often called vectorized or columnar processing, and is beneficial because it:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Amortizes dispatch overheads to switch on the type of column being decoded&lt;/li&gt;
  &lt;li&gt;Improves cache locality by reading consecutive values from a ColumnChunk&lt;/li&gt;
  &lt;li&gt;Often allows multiple values to be decoded in a single instruction.&lt;/li&gt;
  &lt;li&gt;Avoid many small heap allocations with a single large allocation, yielding significant savings for variable length types such as strings and byte arrays&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thus, Rust Parquet Reader implements specialized decoders for reading Parquet directly into a &lt;a href="https://www.influxdata.com/glossary/column-database/"&gt;columnar&lt;/a&gt; memory format (Arrow Arrays).&lt;/p&gt;

&lt;h3 id="streaming-decode"&gt;Streaming decode&lt;/h3&gt;

&lt;p&gt;There is no relationship between which rows are stored in which Pages across ColumnChunks. For example, the logical values for the 10,000th row may be in the first page of column A and in the third page of column B.&lt;/p&gt;

&lt;p&gt;The simplest approach to vectorized decoding, and the one often initially implemented in Parquet decoders, is to decode an entire RowGroup (or ColumnChunk) at a time.&lt;/p&gt;

&lt;p&gt;However, given Parquet’s high compression ratios, a single RowGroup may well contain millions of rows. Decoding so many rows at once is non-optimal because it:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Requires large amounts of intermediate RAM:&lt;/strong&gt; typical in-memory formats optimized for processing, such as Apache Arrow, require much more than their Parquet-encoded form.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Increases query latency:&lt;/strong&gt; Subsequent processing steps (like filtering or aggregation) can only begin once the entire RowGroup (or ColumnChunk) is decoded.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As such, the best Parquet readers support “streaming” data out in by producing configurable sized batches of rows on demand. The batch size must be large enough to amortize decode overhead, but small enough for efficient memory usage and to allow downstream processing to begin concurrently while the subsequent batch is decoded.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/2qCu93BPhnyTqeNGzVq1Q5/878e133452698c4341817fc581a37c96/Parquet_File_Streaming_Decode_Diagram_12.05.2022v1.png" alt="Parquet File Streaming Decode Diagram 12.05.2022v1" /&gt;&lt;/p&gt;

&lt;p&gt;While streaming is not a complicated feature to explain, the stateful nature of decoding, especially across multiple columns and &lt;a href="https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/"&gt;arbitrarily nested data&lt;/a&gt;, where the relationship between rows and values is not fixed, requires &lt;a href="https://github.com/apache/arrow-rs/blob/b7af85cb8dfe6887bb3fd43d1d76f659473b6927/parquet/src/arrow/record_reader/mod.rs"&gt;complex intermediate buffering&lt;/a&gt; and significant engineering effort to handle correctly.&lt;/p&gt;

&lt;h3 id="dictionary-preservation"&gt;Dictionary preservation&lt;/h3&gt;

&lt;p&gt;Dictionary Encoding, also called &lt;a href="https://pandas.pydata.org/docs/user_guide/categorical.html"&gt;categorical&lt;/a&gt; encoding, is a technique where each value in a column is not stored directly, but instead, an index in a separate list called a “Dictionary” is stored. This technique achieves many of the benefits of &lt;a href="https://en.wikipedia.org/wiki/Third_normal_form#:~:text=Third%20normal%20form%20(3NF)%20is,in%201971%20by%20Edgar%20F."&gt;third normal form&lt;/a&gt; for columns that have repeated values (low &lt;a href="https://www.influxdata.com/glossary/cardinality/"&gt;cardinality&lt;/a&gt;) and is especially effective for columns of strings such as “City”.&lt;/p&gt;

&lt;p&gt;The first page in a ColumnChunk can optionally be a dictionary page, containing a list of values of the column’s type. Subsequent pages within this ColumnChunk can then encode an index into this dictionary, instead of encoding the values directly.&lt;/p&gt;

&lt;p&gt;Given the effectiveness of this encoding, if a Parquet decoder simply decodes dictionary data into the native type, it will inefficiently replicate the same value over and over again, which is especially disastrous for string data. To handle dictionary-encoded data efficiently, the encoding must be preserved during decode. Conveniently, many columnar formats, such as the Arrow &lt;a href="https://docs.rs/arrow/27.0.0/arrow/array/struct.DictionaryArray.html"&gt;DictionaryArray&lt;/a&gt;, support such compatible encodings.&lt;/p&gt;

&lt;p&gt;Preserving dictionary encoding drastically improves performance when reading to an Arrow array, in some cases in excess of &lt;a href="https://github.com/apache/arrow-rs/pull/1180"&gt;60x&lt;/a&gt;, as well as using significantly less memory.&lt;/p&gt;

&lt;p&gt;The major complicating factor for preserving dictionaries is that the dictionaries are stored per ColumnChunk, and therefore the dictionary changes between RowGroups. The reader must automatically recompute a dictionary for batches that span multiple RowGroups, while also optimizing for the case that batch sizes divide evenly into the number of rows per RowGroup. Additionally a column may be only &lt;a href="https://github.com/apache/parquet-format/blob/111dbdcf8eff2e9f8e0d4e958cecbc7e00028aca/README.md?plain=1#L194-L199"&gt;partly dictionary encoded&lt;/a&gt;, further complicating implementation. More information on this technique and its complications can be found in the &lt;a href="https://arrow.apache.org/blog/2019/09/05/faster-strings-cpp-parquet/"&gt;blog post&lt;/a&gt; on applying this technique to the C++ Parquet reader.&lt;/p&gt;

&lt;h2 id="projection-pushdown"&gt;Projection pushdown&lt;/h2&gt;

&lt;p&gt;The most basic Parquet optimization, and the one most commonly described for Parquet files, is &lt;em&gt;projection pushdown&lt;/em&gt;, which reduces both I/Oand CPU requirements. Projection in this context means “selecting some but not all of the columns.” Given how Parquet organizes data, it is straightforward to read and decode only the ColumnChunks required for the referenced columns.&lt;/p&gt;

&lt;p&gt;For example, consider a SQL query of the form&lt;/p&gt;

&lt;p&gt;&lt;code class="language-sql"&gt;SELECT B from table where A &amp;gt; 35&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This query only needs data for columns A and B (and not C) and the projection can be “pushed down” to the Parquet reader.&lt;/p&gt;

&lt;p&gt;Specifically, using the information in the footer, the Parquet reader can entirely skip fetching (I/O) and decoding (CPU) the Data Pages that store data for column C (ColumnChunk 3 and ColumnChunk 6 in our example).&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/7zolMaASAFpmL7z2UHmA0e/c2f80f4cd664d598b32a48ce6fa9ffe3/Parquet_File_Projection_Pushdown_Diagram_12.05.2022v1.png" alt="Parquet File Projection Pushdown Diagram 12.05.2022v1" /&gt;&lt;/p&gt;

&lt;h2 id="predicate-pushdown"&gt;Predicate pushdown&lt;/h2&gt;

&lt;p&gt;Similar to projection pushdown, &lt;strong&gt;predicate&lt;/strong&gt; pushdown also avoids fetching and decoding data from Parquet files, but does so using filter expressions. This technique typically requires closer integration with a query engine such as &lt;a href="https://arrow.apache.org/datafusion/"&gt;DataFusion&lt;/a&gt;, to determine valid predicates and evaluate them during the scan. Unfortunately without careful API design, the Parquet decoder and query engine can end up tightly coupled, preventing reuse (e.g. there are different Impala and Spark implementations in &lt;a href="https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_ig_predicate_pushdown_parquet.html#concept_pgs_plb_mgb"&gt;Cloudera Parquet Predicate Pushdown docs&lt;/a&gt;). The Rust Parquet reader uses the &lt;a href="https://docs.rs/parquet/27.0.0/parquet/arrow/arrow_reader/struct.RowSelector.html"&gt;RowSelection&lt;/a&gt; API to avoid this coupling.&lt;/p&gt;

&lt;h3 id="rowgroup-pruning"&gt;RowGroup pruning&lt;/h3&gt;

&lt;p&gt;The simplest form of predicate pushdown, supported by many Parquet based query engines, uses the statistics stored in the footer to skip entire RowGroups. We call this operation RowGroup pruning, and it is analogous to &lt;a href="https://docs.oracle.com/database/121/VLDBG/GUID-E677C85E-C5E3-4927-B3DF-684007A7B05D.htm#VLDBG00401"&gt;partition pruning&lt;/a&gt; in many classical &lt;a href="https://www.influxdata.com/glossary/data-warehouse/"&gt;data warehouse&lt;/a&gt; systems.&lt;/p&gt;

&lt;p&gt;For the example query above, if the maximum value for A in a particular RowGroup is less than 35, the decoder can skip fetching and decoding any ColumnChunks from that &lt;strong&gt;entire&lt;/strong&gt; RowGroup.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/5W1ZN8oCkbZAMcqRAwrLJz/1d19e94d3ad8a60a08fa848eea2b4fcb/Parquet_File_RowGroup_Pruning_Diagram_12.05.2022v1.png" alt="Parquet File RowGroup Pruning Diagram 12.05.2022v1" /&gt;&lt;/p&gt;

&lt;p&gt;Note that pruning on minimum and maximum values is effective for many data layouts and column types, but not all. Specifically, it is not as effective for columns with many distinct pseudo-random values (e.g. identifiers or uuids). Thankfully for this use case, Parquet also supports per ColumnChunk &lt;a href="https://github.com/apache/parquet-format/blob/master/BloomFilter.md"&gt;Bloom Filters&lt;/a&gt;. We are actively working on &lt;a href="https://github.com/apache/arrow-rs/issues/3023"&gt;adding bloom filter&lt;/a&gt; support in Apache Rust’s implementation.&lt;/p&gt;

&lt;h3 id="page-pruning"&gt;Page pruning&lt;/h3&gt;

&lt;p&gt;A more sophisticated form of predicate pushdown uses the optional &lt;a href="https://github.com/apache/parquet-format/blob/master/PageIndex.md"&gt;page index&lt;/a&gt; in the footer metadata to rule out entire Data Pages. The decoder decodes only the corresponding rows from other columns, often skipping entire pages.&lt;/p&gt;

&lt;p&gt;The fact that pages in different ColumnChunks often contain different numbers of rows, due to various reasons, complicates this optimization. While the page index may identify the needed pages from one column, pruning a page from one column doesn’t immediately rule out entire pages in other columns.&lt;/p&gt;

&lt;p&gt;Page pruning proceeds as follows:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Uses the predicates in combination with the page index to identify pages to skip&lt;/li&gt;
  &lt;li&gt;Uses the offset index to determine what row ranges correspond to non-skipped pages&lt;/li&gt;
  &lt;li&gt;Computes the intersection of ranges across non-skipped pages, and decodes only those rows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This last point is highly non-trivial to implement, especially for nested lists where &lt;a href="https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/"&gt;a single row may correspond to multiple values&lt;/a&gt;. Fortunately, the Rust Parquet reader hides this complexity internally, and can decode arbitrary &lt;a href="https://docs.rs/parquet/27.0.0/parquet/arrow/arrow_reader/struct.RowSelection.html"&gt;RowSelections&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For example, to scan Columns A and B, stored in 5 Data Pages as shown in the figure below:&lt;/p&gt;

&lt;p&gt;If the predicate is A &amp;gt; 35,&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Page 1 is pruned using the page index (max value is 20), leaving a RowSelection of [200-&amp;gt;onwards],&lt;/li&gt;
  &lt;li&gt;Parquet reader skips Page 3 entirely (as its last row index is 99)&lt;/li&gt;
  &lt;li&gt;(Only) the relevant rows are read by reading pages 2, 4, and 5.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the predicate is instead A &amp;gt; 35 AND B = “F” the page index is even more effective&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Using A &amp;gt; 35, yields a RowSelection of [200-&amp;gt;onwards] as before&lt;/li&gt;
  &lt;li&gt;Using B = “F”, on the remaining Page 4 and Page 5 of B, yields a RowSelection of [100-244]&lt;/li&gt;
  &lt;li&gt;Intersecting the two RowSelections leaves a combined RowSelection [200-244]&lt;/li&gt;
  &lt;li&gt;Parquet reader only decodes those 50 rows from Page 2 and Page 4.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/1vgrpsarMMbWmHK7bPmbEl/c431046fe1c6e53ade36c96e068bb9dc/Parquet_File_Page_Pruning_Diagram_12.05.2022v1.png" alt="Telegraf configuration" width="600" height="auto" /&gt;&lt;/p&gt;

&lt;p&gt;Support for reading and writing these indexes from Arrow C++, and by extension pyarrow/pandas, is tracked in &lt;a href="https://issues.apache.org/jira/browse/PARQUET-1404"&gt;PARQUET-1404&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id="late-materialization"&gt;Late materialization&lt;/h3&gt;

&lt;p&gt;The two previous forms of predicate pushdown only operated on metadata stored for RowGroups, ColumnChunks, and Data Pages prior to decoding values. However, the same techniques also extend to values of one or more columns &lt;em&gt;after&lt;/em&gt; decoding them but prior to decoding other columns, which is often called “late materialization”.&lt;/p&gt;

&lt;p&gt;This technique is especially effective when:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The predicate is very selective, i.e. filters out large numbers of rows&lt;/li&gt;
  &lt;li&gt;Each row is large, either due to wide rows (e.g. JSON blobs) or many columns&lt;/li&gt;
  &lt;li&gt;The selected data is clustered together&lt;/li&gt;
  &lt;li&gt;The columns required by the predicate are relatively inexpensive to decode, e.g. PrimitiveArray / DictionaryArray&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is additional discussion about the benefits of this technique in &lt;a href="https://issues.apache.org/jira/browse/SPARK-36527"&gt;SPARK-36527&lt;/a&gt; and &lt;a href="https://docs.cloudera.com/cdw-runtime/cloud/impala-reference/topics/impala-lazy-materialization.html"&gt;Impala&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For example, given the predicate A &amp;gt; 35 AND B = “F” from above where the engine uses the page index to determine only 50 rows within RowSelection of [100-244] could match, using late materialization, the Parquet decoder:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Decodes the 50 values of Column A&lt;/li&gt;
  &lt;li&gt;Evaluates A &amp;gt; 35 on those 50 values&lt;/li&gt;
  &lt;li&gt;In this case, only 5 rows pass, resulting in the RowSelection:
    &lt;ul&gt;
      &lt;li&gt;RowSelection[205-206]&lt;/li&gt;
      &lt;li&gt;RowSelection[238-240]&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Only decodes the 5 rows for column B for those selections&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/16g37IQh6ljdiA0u87sQbt/1d677280f4f0853303ce33de9db9ec5f/Parquet_File_Late_Materialization_Diagram_12.05.2022v1.png" alt="Parquet File Late Materialization Diagram 12.05.2022v1" /&gt;&lt;/p&gt;

&lt;p&gt;In certain cases, such as our example where B stores single character values, the cost of late materialization machinery can outweigh the savings in decoding. However, the savings can be substantial when some of the conditions listed above are fulfilled. The query engine must decide which predicates to push down and in which order to apply them for optimal results.&lt;/p&gt;

&lt;p&gt;While it is outside the scope of this document, the same technique can be applied for multiple predicates as well as predicates on multiple columns. See the &lt;a href="https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowFilter.html"&gt;RowFilter&lt;/a&gt; interface in the Parquet crate for more information, and the &lt;a href="https://github.com/apache/arrow-datafusion/blob/58b43f5c0b629be49a3efa0e37052ec51d9ba3fe/datafusion/core/src/physical_plan/file_format/parquet/row_filter.rs#L40-L70"&gt;row_filter&lt;/a&gt; implementation in DataFusion.&lt;/p&gt;

&lt;h2 id="io-pushdown"&gt;I/O pushdown&lt;/h2&gt;

&lt;p&gt;While Parquet was designed for efficient access on the &lt;a href="https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html"&gt;HDFS distributed file system&lt;/a&gt;, it works very well with commodity blob storage systems such as AWS S3 as they have very similar characteristics:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Relatively slow “random access” reads&lt;/strong&gt;: it is much more efficient to read large (MBs) sections of data in each request than issue many requests for smaller portions&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Significant latency before retrieving the first byte&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;High per-request cost&lt;/strong&gt;: Often billed per request, regardless of number of bytes read, which incentivizes fewer requests that each read a large contiguous section of data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To read optimally from such systems, a Parquet reader must:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Minimize the number of I/O requests, while also applying the various pushdown techniques to avoid fetching large amounts of unused data.&lt;/li&gt;
  &lt;li&gt;Integrate with the appropriate task scheduling mechanism to interleave I/O and processing on the data that is fetched to avoid pipeline bottlenecks.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As these are substantial engineering and integration challenges, many Parquet readers still require the files to be fetched in their entirety to local storage.&lt;/p&gt;

&lt;p&gt;Fetching the entire files in order to process them is not ideal for several reasons:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;High Latency:&lt;/strong&gt; Decoding cannot begin until the entire file is fetched (Parquet metadata is at the end of the file, so the decoder must see the end prior to decoding the rest)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Wasted work:&lt;/strong&gt; Fetching the entire file fetches all necessary data, but also potentially lots of unnecessary data that will be skipped after reading the footer. This increases the cost unnecessarily.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Requires costly “locally attached” storage (or memory):&lt;/strong&gt; Many cloud environments do not offer computing resources with locally attached storage – they either rely on expensive network block storage such as AWS EBS or else restrict local storage to certain classes of VMs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Avoiding the need to buffer the entire file requires a sophisticated Parquet decoder, integrated with the I/O subsystem, that can initially fetch and decode the metadata followed by ranged fetches for the relevant data blocks, interleaved with the decoding of Parquet data. This optimization requires careful engineering to fetch large enough blocks of data from the object store that the per request overhead doesn’t dominate gains from reducing the bytes transferred. &lt;a href="https://issues.apache.org/jira/browse/SPARK-36529"&gt;SPARK-36529&lt;/a&gt; describes the challenges of sequential processing in more detail.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/6uRMH3yBp7B9xJ6R3BWtGC/3971a116b1ce47186705c0aab5e199b4/Parquet_File_IO_Pushdown_Diagram_12.05.2022v1.png" alt="Parquet File IO Pushdown Diagram 12.05.2022v1" /&gt;&lt;/p&gt;

&lt;p&gt;Not included in this diagram picture are details like coalescing requests and ensuring minimum request sizes needed for an actual implementation.&lt;/p&gt;

&lt;p&gt;The Rust Parquet crate provides an async Parquet reader, to efficiently read from any &lt;a href="https://docs.rs/parquet/latest/parquet/arrow/async_reader/trait.AsyncFileReader.html"&gt;AsyncFileReader&lt;/a&gt; that:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Efficiently reads from any storage medium that supports range requests&lt;/li&gt;
  &lt;li&gt;Integrates with Rust’s futures ecosystem to avoid blocking threads waiting on network I/O &lt;a href="https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/"&gt;and easily can interleave CPU and network&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Requests multiple ranges simultaneously, to allow the implementation to coalesce adjacent ranges, fetch ranges in parallel, etc.&lt;/li&gt;
  &lt;li&gt;Uses the pushdown techniques described previously to eliminate fetching data where possible&lt;/li&gt;
  &lt;li&gt;Integrates easily with the Apache Arrow &lt;a href="https://docs.rs/object_store/latest/object_store/"&gt;object_store&lt;/a&gt; crate which you can read more about &lt;a href="https://www.influxdata.com/blog/rust-object-store-donation/"&gt;here&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To give a sense of what is possible, the following picture shows a timeline of fetching the footer metadata from remote files, using that metadata to determine what Data Pages to read, and then fetching data and decoding simultaneously. This process often must be done for more than one file at a time in order to match network latency, bandwidth, and available CPU.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/23iUnmSLDBw5yyM291OhRm/d912f29c1d0dbba75d1f79bfd22e1b80/Parquet_File_IO_Pushdown_Diagram_2_12.05.2022v1.png" alt="Parquet File IO Pushdown Diagram 2 12.05.2022v1" /&gt;&lt;/p&gt;

&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;We hope you enjoyed reading about the Parquet file format, and the various techniques used to quickly query parquet files.&lt;/p&gt;

&lt;p&gt;We believe that the reason most open source implementations of Parquet do not have the breadth of features described in this post is that it takes a monumental effort, that was previously only possible at well-financed commercial enterprises which kept their implementations closed source.&lt;/p&gt;

&lt;p&gt;However, with the growth and quality of the Apache Arrow community, both Rust practitioners and the wider Arrow community, our ability to collaborate and build a cutting-edge open source implementation is exhilarating and immensely satisfying. The technology described in this blog is the result of the contributions of many engineers spread across companies, hobbyists, and the world in several repositories, notably &lt;a href="https://github.com/apache/arrow-datafusion"&gt;Apache Arrow DataFusion&lt;/a&gt;, &lt;a href="https://github.com/apache/arrow-rs"&gt;Apache Arrow&lt;/a&gt; and &lt;a href="https://github.com/apache/arrow-ballista"&gt;Apache Arrow Ballista.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you are interested in joining the DataFusion Community, please &lt;a href="https://arrow.apache.org/datafusion/contributor-guide/communication.html"&gt;get in touch&lt;/a&gt;.&lt;/p&gt;
</description>
      <pubDate>Wed, 07 Dec 2022 07:00:00 +0000</pubDate>
      <link>https://www.influxdata.com/blog/querying-parquet-millisecond-latency/</link>
      <guid isPermaLink="true">https://www.influxdata.com/blog/querying-parquet-millisecond-latency/</guid>
      <category>Product</category>
      <category>Use Cases</category>
      <author>Raphael Taylor-Davies, Andrew Lamb (InfluxData)</author>
    </item>
    <item>
      <title>Rust Object Store Donation</title>
      <description>&lt;p&gt;Today we are happy to officially announce that InfluxData has donated a &lt;a href="https://github.com/apache/arrow-rs/issues/2030"&gt;generic object store implementation to the Apache Arrow project&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Using this crate, the same code can easily interact with AWS S3, Azure Blob Storage, Google Cloud Storage, local files, memory, and more by a simple runtime configuration change.&lt;/p&gt;

&lt;p&gt;You can find the &lt;a href="https://crates.io/crates/object_store"&gt;latest release on crates.io&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We expect this will accelerate the pace of innovation within the Rust ecosystem. Whether you are building a cloud-agnostic service to handle user-uploaded videos, images, and documents, a high-performance analytics system, or something else that needs access to commodity object storage, this crate can help you and we can’t wait to see what people build with it.&lt;/p&gt;

&lt;h2 id="why-do-you-need-an-object-store-crate"&gt;Why do you need an object store crate?&lt;/h2&gt;

&lt;p&gt;Aside from providing bulk data storage for many cloud-based services, we believe the future of analytic systems in particular involves querying data stored on object storage.&lt;/p&gt;

&lt;p&gt;Object store is the generic term for what might be loosely described as an “infinite FTP server in the cloud”, that offers almost unlimited highly available and durable key-value storage on demand. Alongside virtual machines and block storage, object storage is one of the key commodity services provided by all modern cloud service providers. Examples include &lt;a href="https://aws.amazon.com/s3/"&gt;S3&lt;/a&gt;, &lt;a href="https://azure.microsoft.com/en-gb/services/storage/blobs/#overview"&gt;Microsoft Azure Blob Storage&lt;/a&gt;, &lt;a href="https://cloud.google.com/storage/"&gt;Google Cloud Storage&lt;/a&gt;, &lt;a href="https://min.io/"&gt;MinIO&lt;/a&gt;, &lt;a href="https://docs.ceph.com/en/quincy/radosgw/index.html"&gt;Ceph Object Gateway&lt;/a&gt;, &lt;a href="https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Introduction"&gt;HDFS&lt;/a&gt;, and others.&lt;/p&gt;

&lt;p&gt;To achieve this near-infinite scaling, object stores provide a subset of the functionality of traditional file systems such as &lt;a href="https://docs.microsoft.com/en-us/windows-server/storage/file-server/ntfs-overview"&gt;NTFS&lt;/a&gt; or &lt;a href="https://docs.kernel.org/admin-guide/ext4.html"&gt;ext4&lt;/a&gt;. Specifically, they identify objects with a “key” and store arbitrary bytes as a value:&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/3a6on0Fu2MNaSu2Ps49H2p/3c190cbcf671e631ad24310295ec0e1d/rust-object-store-donation-figure-1.PNG" alt="Rust Object Store Donation - Figure 1" /&gt;
&lt;strong&gt;Figure 1:&lt;/strong&gt; Object stores store arbitrary bytes identified by a string key.&lt;/p&gt;

&lt;p&gt;Unlike filesystems, object stores typically lack an explicit notion of directories, and best practice uses a restricted subset of ASCII for keys. Instead, path-like traversal is achieved using LIST operations with a prefix, and illegal character sequencers are percent-encoded.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/3SEfVhuyhjwhTuaTPXY3IU/a7135851515724c94b2532b6f82e0612/rust-object-store-donation-figure-2.PNG" alt="Rust Object Store Donation - Figure 2" /&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Figure 2:&lt;/strong&gt; Object stores can LIST objects with a specified prefix, which can be used to group files together. In this example, asking for objects with prefix “&lt;code class="language-rust"&gt;/pictures/&lt;/code&gt;” results in all the &lt;code class="language-rust"&gt;.jpg&lt;/code&gt; objects, while asking for prefix “&lt;code class="language-rust"&gt;/parquet/&lt;/code&gt;” results in all the &lt;code class="language-rust"&gt;.parquet&lt;/code&gt; objects.&lt;/p&gt;

&lt;p&gt;Consistently listing and traversing the quasi-directory structure encoded in the object keys across object store implementations and local file systems is one common source of frustration, as not only do filesystems behave very differently to object stores, but each of the object store implementations have their own quirks.&lt;/p&gt;

&lt;p&gt;Having a focused, easy-to-use, high-performance, async object store library, written in idiomatic Rust, frees you from worrying about these details and lets you instead focus on your system’s logic. The underlying implementation is abstracted away from application code, and can easily be selected at runtime, allowing the same binary to run in multiple clouds.&lt;/p&gt;

&lt;p&gt;This flexibility also facilitates local development as it allows testing against a local filesystem, or even an in-memory store, without requiring any additional binaries such as &lt;a href="https://min.io/"&gt;MinIO&lt;/a&gt;, and allowing the use of familiar tools such &lt;code class="language-rust"&gt;ls, cat&lt;/code&gt; or your choice of file browser.&lt;/p&gt;

&lt;h2 id="how-to-use-it"&gt;How to use it?&lt;/h2&gt;

&lt;p&gt;Here is a simplistic example that finds the number of zeros in files that are on remote object storage:&lt;/p&gt;
&lt;pre class="line-numbers"&gt;&lt;code class="language-rust"&gt;let object_store: Arc&amp;lt;dyn ObjectStore&amp;gt; = get_object_store();

    // list all objects in the "parquet" prefix (aka directory)                                                                                                                     
    let path: Path = "parquet".try_into().unwrap();
    let list_stream = object_store
        .list(Some(&amp;amp;path))
        .await
        .expect("Error listing files");

    // List all files in the store                                                                                                                                                  
    list_stream
        .map(|meta| async {
            let meta = meta.expect("Error listing");

            // fetch the bytes from object store                                                                                                                                    
            let stream = object_store
                .get(&amp;amp;meta.location)
                .await
                .unwrap()
                .into_stream();

            // Count the zeros                                                                                                                                                      
            let num_zeros = stream
                .map(|bytes| {
                    let bytes = bytes.unwrap();
                    bytes.iter().filter(|b| **b == 0).count()
                })
                .collect::&amp;lt;Vec&amp;lt;usize&amp;gt;&amp;gt;()
                .await
                .into_iter()
                .sum::&amp;lt;usize&amp;gt;();

            (meta.location.to_string(), num_zeros)
        })
        .collect::&amp;lt;FuturesOrdered&amp;lt;_&amp;gt;&amp;gt;()
        .await
        .collect::&amp;lt;Vec&amp;lt;_&amp;gt;&amp;gt;()
        .await
        .into_iter()
        .for_each(|i| println!("{} has {} zeros", i.0, i.1));
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Which prints out something like:&lt;/p&gt;
&lt;pre class="line-numbers"&gt;&lt;code class="language-rust"&gt;test_fixtures/parquet/1.parquet has 174 zeros
test_fixtures/parquet/2.parquet has 53 zeros&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As written the code lists the files (in a paginated way) and fetches their contents in parallel. This may not be great if there are thousands of files. However, we can easily take advantage of the Rust streams and change&lt;/p&gt;
&lt;pre class="line-numbers"&gt;&lt;code class="language-rust"&gt;.collect::&amp;lt;FuturesOrdered&amp;lt;_&amp;gt;&amp;gt;()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;to&lt;/p&gt;
&lt;pre class="line-numbers"&gt;&lt;code class="language-rust"&gt;.buffered(10)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Which will now limit the program to 10 GET requests in parallel.&lt;/p&gt;

&lt;p&gt;The coolest part of the object_store crate is that the same code works for all the different object stores, and the only thing that changes is the definition of &lt;code class="language-rust"&gt;get_object_store&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;To read from S3:&lt;/p&gt;
&lt;pre class="line-numbers"&gt;&lt;code class="language-rust"&gt;fn get_object_store() -&amp;gt; Arc&amp;lt;dyn ObjectStore&amp;gt; {
    let s3 = AmazonS3Builder::new()
        .with_access_key_id(ACCESS_KEY_ID)
        .with_secret_access_key(SECRET_KEY)
        .with_region(REGION)
        .with_bucket_name(BUCKET_NAME)
        .build()
        .expect("error creating s3");

    Arc::new(s3)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To read from Azure:&lt;/p&gt;
&lt;pre class="line-numbers"&gt;&lt;code class="language-rust"&gt;fn get_object_store() -&amp;gt; Arc&amp;lt;dyn ObjectStore&amp;gt; {
    let azure = MicrosoftAzureBuilder::new()
        .with_account(STORAGE_ACCOUNT)
        .with_access_key(ACCESS_KEY)
        .with_container_name(BUCKET_NAME)
        .build()
        .expect("error creating azure");

    Arc::new(azure)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To read from GCP:&lt;/p&gt;
&lt;pre class="line-numbers"&gt;&lt;code class="language-rust"&gt;fn get_object_store() -&amp;gt; Arc&amp;lt;dyn ObjectStore&amp;gt; {
    let gcs = GoogleCloudStorageBuilder::new()
        .with_service_account_path(PATH_TO_SERVICE_ACCOUNT_JSON)
        .with_bucket_name(BUCKET_NAME)
        .build()
        .expect("error creating gcs");
    Arc::new(gcs)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To read from the local filesystem:&lt;/p&gt;
&lt;pre class="line-numbers"&gt;&lt;code class="language-rust"&gt;fn get_object_store() -&amp;gt; Arc&amp;lt;dyn ObjectStore&amp;gt; {
    let local_fs =
        LocalFileSystem::new_with_prefix(PREFIX)
          .expect("Error creating local file system");
    Arc::new(local_fs)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To reiterate, the major benefit is that you do not have to integrate different abstractions for the different object stores – the client code is always the same and under the covers uses the appropriate optimized implementation.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://crates.io/crates/object_store"&gt;object_store&lt;/a&gt; crate is also extensible which allows plugging in other object storage systems, while still retaining the ability to read files from the local filesystem, to take advantage of optimized file access offered by some systems – see &lt;a href="https://docs.rs/object_store/latest/object_store/enum.GetResult.html"&gt;GetFileResult&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A more full-featured and working example can be found in the &lt;a href="https://github.com/alamb/rust_object_store_demo"&gt;rust_object_store_demo&lt;/a&gt; repository.&lt;/p&gt;

&lt;h2 id="why-donate-to-apache"&gt;Why donate to Apache&lt;/h2&gt;

&lt;p&gt;The dream for Rust is the development productivity of Python or Ruby with the speed and memory efficiency of C/C++. Part of delivering this dream is ensuring that it integrates easily with the broader technology ecosystem, and in modern analytic systems this increasingly means data on object storage.&lt;/p&gt;

&lt;p&gt;Thus, it is important to make it easy, and yet still efficient, for Rust programs to read and write data to object stores (AWS, S3, GCP). There are individual crates which implement cloud provider specific SDKs such as &lt;a href="https://crates.io/crates/rusoto_s3"&gt;rusoto_s3&lt;/a&gt; or &lt;a href="https://crates.io/crates/azure_storage"&gt;Azure_storage&lt;/a&gt;; however, accessing the most common feature set via the same interface is often what is needed to accelerate the development of cross-cloud analytic systems. This crate is explicitly NOT meant to replace the full-blown cloud SDKs, but instead to provide a consistent object store abstraction that is portable across the many different underlying implementations.&lt;/p&gt;

&lt;p&gt;We had exactly this requirement when we set out to develop &lt;a href="https://github.com/influxdata/influxdb"&gt;influxdb_iox&lt;/a&gt;. InfluxDB and InfluxData Cloud run on AWS, GCP, Azure, and on-prem, and we needed IOx to do so as well. We could not find an existing library that suited our needs, so the InfluxData IOx team developed one within our project.&lt;/p&gt;

&lt;p&gt;This effort was originally implemented by Rust Ecosystem Legend Carol (Nichols II Goulding) @&lt;a href="https://github.com/carols10cents"&gt;carols10cents&lt;/a&gt; (primary author of &lt;a href="https://doc.rust-lang.org/stable/book/"&gt;the Rust Book&lt;/a&gt;) and heavily extended by &lt;a href="mailto:mneumann@influxdata.com"&gt;Marco Neumann&lt;/a&gt; and &lt;a href="mailto:raphael@influxdata.com"&gt;Raphael Taylor-Davies&lt;/a&gt; as we crafted its integration into DataFusion.&lt;/p&gt;

&lt;p&gt;IOx uses the Rust, Apache Arrow, Apache Parquet and DataFusion projects, which we also contribute to heavily, and it was increasingly important that IOx’s object store interactions were efficient via DataFusion. As we investigated the alternatives, we hit the point where this required deeper integration with the object store.&lt;/p&gt;

&lt;p&gt;We hope that this donation further accelerates the creation of high-quality analytic systems in Rust and can’t wait to see what the community builds with it! We especially hope that the alignment with Apache Arrow will permit an elegantly integrated experience with libraries that can easily and efficiently read arrow-compatible files, such as parquet, CSV and newline-delimited JSON, natively from local or remote object storage. For applications that desire SQL or other higher level query engine capabilities, check out &lt;a href="https://github.com/apache/arrow-datafusion"&gt;Apache Arrow DataFusion&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You can see more about the donation, and its rationale in &lt;a href="https://github.com/influxdata/object_store_rs/issues/41"&gt;this GitHub issue&lt;/a&gt; and &lt;a href="https://github.com/apache/arrow-rs/issues/2030"&gt;this one as well&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id="whats-next"&gt;What’s next&lt;/h2&gt;

&lt;p&gt;In the near term, we plan better integration with the &lt;a href="https://docs.rs/parquet/latest/parquet/"&gt;parquet&lt;/a&gt; crate. In particular the &lt;a href="https://docs.rs/parquet/latest/parquet/arrow/async_reader/index.html"&gt;async parquet reader&lt;/a&gt; has been explicitly developed with a generic object_store crate in mind. It currently supports projection, and row-group level predicate pushdown to minimize the data fetched from object storage, and support for page and row-level predicate pushdown is likely to land in the next release slated for the 22nd August 2022.&lt;/p&gt;

&lt;p&gt;We also expect to continue to improve the integration with &lt;a href="https://github.com/apache/arrow-datafusion"&gt;Apache Arrow DataFusion&lt;/a&gt;, ensuring it provides best in class support for querying data from object storage, efficiently decoupling IO from CPU-bound work, and making the most efficient use of modern multicore processors.&lt;/p&gt;

&lt;p&gt;Finally there is an ongoing effort to move away from depending on large SDKs such as rusoto, and the Azure SDK for Rust. Whilst they have served us well, moving away from them will significantly reduce the dependency burden, simplify the implementation, and further improve consistency across the various implementations.&lt;/p&gt;

&lt;h2 id="join-the-community"&gt;Join the community&lt;/h2&gt;

&lt;p&gt;We think a thriving community drives everyone forward. We encourage you to check out the &lt;a href="https://docs.rs/object_store/latest/object_store/"&gt;crate&lt;/a&gt;, and lend us a hand! Try it out in your project and let us know how it goes, or find us on github &lt;a href="https://github.com/apache/arrow-rs/tree/master/object_store"&gt;here&lt;/a&gt;. There is a list of good open items for new comers &lt;a href="https://github.com/apache/arrow-rs/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22+label%3Aobject-store"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id="kudos"&gt;Kudos&lt;/h2&gt;

&lt;p&gt;Thank you to &lt;a href="mailto:raphael@influxdata.com"&gt;Raphael Taylor-Davies&lt;/a&gt;, &lt;a href="mailto:paul@influxdata.com"&gt;Paul Dix&lt;/a&gt;, &lt;a href="mailto:ntran@influxdata.com"&gt;Nga Tran&lt;/a&gt;, and &lt;a href="mailto:mneumann@influxdata.com"&gt;Marco Neumann&lt;/a&gt; who reviewed early versions of this document and contributed many improvements.&lt;/p&gt;
</description>
      <pubDate>Mon, 22 Aug 2022 07:00:00 +0000</pubDate>
      <link>https://www.influxdata.com/blog/rust-object-store-donation/</link>
      <guid isPermaLink="true">https://www.influxdata.com/blog/rust-object-store-donation/</guid>
      <category>Use Cases</category>
      <category>Developer</category>
      <author>Andrew Lamb, Raphael Taylor-Davies (InfluxData)</author>
    </item>
  </channel>
</rss>
