InfluxData Blog - Andrew Lamb

Optimizing SQL (and DataFrames) in DataFusion: Part 2

Andrew Lamb, Mustafa Akur (InfluxData) — Thu, 03 Apr 2025 07:00:00 +0000

Part 2: Optimizers in Apache DataFusion

In the first part of this post, we discussed what a Query Optimizer is and what role it plays and described how industrial optimizers are organized. In this second post, we describe various optimizations found in Apache DataFusion and other industrial systems in more detail.

DataFusion contains high-quality, full-featured implementations for Always Optimizations and Engine Specific Optimizations (defined in Part 1). Optimizers are implemented as rewrites of LogicalPlan in the logical optimizer or rewrites of ExecutionPlan in the physical optimizer. This design means the same optimizer passes are applied for SQL and DataFrame queries, as well as plans for other query language frontends such as InfluxQL in InfluxDB 3, PromQL in Greptime, and Vega in VegaFusion.

Always optimizations

Some optimizations are so important they are found in almost all query engines and are typically the first implemented as they provide the largest cost-benefit ratio (and performance is terrible without them).

Predicate/Filter Pushdown

Why: Avoids carrying unneeded rows as soon as possible.

What: Moves filters “down” in the plan so they run earlier in execution, as shown in Figure 1.

Example Implementations: DataFusion, DuckDB, ClickHouse

The earlier data is filtered out in the plan, the less work the rest of the plan has to do. Most mature databases aggressively use filter pushdown/early filtering combined with techniques such as partition and storage pruning (e.g., Parquet Row Group pruning) for performance.

An extreme and somewhat contrived example query:

SELECT city, COUNT(*) FROM population GROUP BY city HAVING city = ‘BOSTON’;

Semantically, HAVING is evaluated after GROUP BY in SQL. However, computing the population of all cities and discarding everything except Boston is much slower than computing only the population for Boston, so most Query Optimizers will evaluate the filter before the aggregation.

Figure 1: Filter Pushdown. In (A), without filter pushdown, the operator processes more rows, reducing efficiency. In (B) with filter pushdown, the operator receives fewer rows, resulting in less overall work and a faster and more efficient query.

Projection pushdown

Why: Avoids carrying unneeded columns as soon as possible.

What: Pushes “projection” (keeping only certain columns) earlier in the plan, as shown in Figure 2.

Example Implementations: DataFusion, DuckDB, ClickHouse

Similarly to the motivation for Filter Pushdown, the earlier the plan stops doing something, the less work it does overall and the faster it runs. For Projection Pushdown, if columns are not needed later in a plan, copying the data to the output of other operators is unnecessary and the costs of copying can add up. For example, in Figure 3 of Part 1, the species column is only needed to evaluate the filter within the scan and notes are never used, so copying them through the rest of the plan is unnecessary.

Projection Pushdown is especially effective and important for column store databases, where the storage format itself (such as Apache Parquet) supports efficiently reading only a subset of required columns. It is especially powerful in combination with filter pushdown. Projection Pushdown is still important but less effective for row-oriented formats such as JSON or CSV, where each column in each row must be parsed even if it is not used in the plan.

Figure 2: In (A), without projection pushdown, the operator receives more columns, reducing efficiency. In (B), with projection pushdown, the operator receives fewer columns, leading to optimized execution.

Limit Pushdown

Why: The earlier the plan stops generating data, the less overall work it does, and some operators have more efficient limited implementations.

What: Pushes limits (maximum row counts) down in a plan as early as possible.

Example Implementations: DataFusion, DuckDB, ClickHouse, Spark (Window and Projection)

Often, queries have a LIMIT or other clause that allows them to stop generating results early, so the sooner they can stop execution, the more efficiently they will execute.

In addition, DataFusion and other systems have more efficient implementations of some operators that can be used if there is a limit. The classic example is replacing a full sort + limit with a TopK operator that only tracks the top values using a heap. Similarly, DataFusion’s Parquet reader stops fetching and opening additional files once the limit is hit. Figure 3: In (A), without limit pushdown, all data is sorted and everything except the first few rows are discarded. In (B), with limit pushdown, Sort is replaced with TopK operator which does much less work.

Expression Simplification / Constant Folding

Why: Evaluating the same expression for each row when the value doesn’t change is wasteful

What: Partially evaluates and/or algebraically simplifies expressions

Example Implementations: DataFusion, DuckDB (has several rules such as constant folding, and comparison simplification), Spark

If an expression doesn’t change from row to row, it is better to evaluate it once during planning. This is a classic compiler technique used in database systems.

For example, given a query that finds all values from the current year:

SELECT … WHERE extract(year from time_column) = extract(year from now())

Evaluating extract(year from now()) on every row is much more expensive than evaluating it once during planning time, so the query becomes a constant.

SELECT … WHERE extract(year from time_column) = 2025

Furthermore, it is often possible to push such predicates into scans.

Rewriting `OUTER JOIN` → `INNER JOIN`

Why: INNER JOIN implementations are almost always faster (as they are simpler) than OUTER JOIN implementations. INNER JOINs impose fewer restrictions on other optimizer passes (such as join reordering and additional filter pushdown).

What: In cases where null rows introduced by an OUTER JOIN will not appear in the results, it can be rewritten to an INNER JOIN.

Example Implementations: DataFusion, Spark, ClickHouse.

For example, given a query such as the following:

SELECT …
FROM orders LEFT OUTER JOIN customer ON (orders.cid = customer.id)
WHERE customer.last_name = ‘Lamb’

The LEFT OUTER JOIN keeps all rows in orders that don’t have a matching customer but fills in the fields with null. All such rows will be filtered out by customer.last_name = ‘Lamb’ and thus an INNER JOIN produces the same answer. This is illustrated in Figure 4.

Figure 4: Rewriting OUTER JOIN to INNER JOIN. In (A), the original query contains an OUTER JOIN and a filter on customer.last_name, which filters out all rows that might be introduced by the OUTER JOIN. In (B), the OUTER JOIN is converted to inner join and a more efficient implementation can be used.

Engine specific optimizations

As discussed in Part 1 of this blog, optimizers also contain a set of passes that are still always good to do but are closely tied to the specifics of the query engine. This section describes some common types

Subquery Rewrites

Why: Implementing subqueries by running a query for each row of the outer query is very expensive.

What: It is possible to rewrite subqueries as joins, which often perform much better.

Example Implementations: DataFusion (one, two, three), Spark

Evaluating subqueries a row at a time is so expensive that execution engines in high-performance analytic systems such as DataFusion and Vertica may not support row-at-a-time evaluation, given how terrible the performance would be. Instead, analytic systems rewrite such queries into joins, which can perform 100s or 1000s of times faster for large datasets. However, transforming subqueries to joins requires “exotic” join semantics such as SEMI JOIN, ANTI JOIN, and variations on how to treat equality with null¹.

For a simple example, consider a query like this:

SELECT customer.name 
FROM customer 
WHERE (SELECT sum(value) 
       FROM orders WHERE
       orders.cid = customer.id) > 10;

This can be rewritten into:

SELECT customer.name 
FROM customer 
JOIN (
  SELECT customer.id as cid_inner, sum(value) s 
  FROM orders 
  GROUP BY customer.id
 ) ON (customer.id = cid_inner AND s > 10);

We don’t have space to detail this transformation or explain why it is so much faster to run but using it and many other transformations allows efficient subquery evaluation.

Optimized expression evaluation

Why: The capabilities of expression evaluation vary from system to system.

What: Optimize expression evaluation for the particular execution environment.

Example Implementations: There are many examples of these type of optimizations, including DataFusion’s Common Subexpression Elimination, unwrap_cast, and identifying equality join predicates. DuckDB rewrites IN clauses, and SUM expressions. Spark also unwraps casts in binary comparisons and adds special runtime filters.

To give a specific example of what DataFusion’s common subexpression elimination does, consider this query that refers to a complex expression multiple times:

SELECT date_bin('1 hour', time, '1970-01-01')
FROM table 
WHERE date_bin('1 hour', time, '1970-01-01') >= '2025-01-01 00:00:00'
ORDER BY date_bin('1 hour', time, '1970-01-01')

Evaluating date_bin('1 hour', time, '1970-01-01') each time it is encountered is inefficient compared to calculating its result once and reusing that result when it is encountered again (similar to caching). This reuse is called Common Subexpression Elimination.

Some execution engines implement this optimization internally to their expression evaluation engine, but DataFusion represents it explicitly using a separate Projection plan node, as illustrated in Figure 5. Effectively, the query above is rewritten to the following:

SELECT time_chunk 
FROM(SELECT date_bin('1 hour', time, '1970-01-01') as time_chunk 
     FROM table)
WHERE time_chunk >= '2025-01-01 00:00:00'
ORDER BY time_chunk

Figure 5: Adding a Projection to evaluate common complex subexpression decreases complexity for subsequent stages.

Algorithm Selection

Why: Different engines have different specialized operators for certain operations.

What: Selects specific implementations from the available operators based on properties of the query.

Example Implementations: DataFusion’s EnforceSorting pass uses sort-optimized implementations, Spark’s rewrite useS a special operator for ASOF joins, and ClickHouse’sjoin algorithm selection (such as when to use MergeJoin)

For example, DataFusion uses a TopK (source) operator rather than a full Sort if there is also a limit on the query. Similarly, it may choose to use the more efficient PartialOrdered grouping operation when the data is sorted on group keys or a MergeJoin. Figure 6: An example of a specialized operation for grouping. In (A), input data has no specified ordering, and DataFusion uses a hashing-based grouping operator (source) to determine distinct groups. In (B), when the input data is ordered by the group keys, DataFusion uses a specialized grouping operator (source) to find boundaries that separate groups.

Using Statistics Directly

Why: Using pre-computed statistics from a table, without actually reading or opening files, is much faster than processing data.

What: Replace calculations on data with the value from statistics.

Example Implementations: DataFusion, DuckDB

Some queries, such as the classic COUNT(*) from my_table used for data exploration, can be answered using statistics only. Optimizers often have access to statistics for other reasons (such as Access Path and Join Order Selection) and statistics are commonly stored in analytic file formats. For example, the Metadata of Apache Parquet files stores MIN, MAX, and COUNT information. Figure 7: When the aggregation result is already stored in the statistics, the query can be evaluated using the values from statistics without looking at any compressed data. The Optimizer replaces the Aggregation operation with values from statistics.

Access path and join order selection

Overview

Last but certainly not least are optimizations that choose between plans with potentially (very) different performance. The major options in this category are:

Join Order: In what order should tables be combined using JOINs?
Access Paths: Which copy of the data or index should be read to find matching tuples?
Materialized View: Can the query can be rewritten to use a materialized view (partially computed query results)? This topic deserves its own blog (or book); we don’t discuss it further here. Figure 8: Access Path and Join Order Selection Query Optimizers. Optimizers use heuristics to enumerate some subset of potential join orders (shape) and access paths (color). The plan with the lowest estimated cost is chosen according to some cost model. In this case, Plan 2, with a cost of 180,000, is chosen for execution as it has the lowest estimated cost.

This class of optimizations is a hard problem for at least the following reasons:

Exponential Search Space: The number of potential plans increases exponentially as the number of joins and indexes increases.
Performance Sensitivity: Often, different plans that are very similar in structure perform very differently. For example, swapping the input order to a hash join can result in 1000x or more (yes, thousandfold!) run time differences.
Cardinality Estimation Errors: Determining the optimal plan relies on cardinality estimates (e.g., how many rows will come out of each join). Estimating this cardinality is a known hard problem, and in practice, queries with as few as three joins often have large cardinality estimation errors.

Heuristics and Cost-Based Optimization

Industrial optimizers handle these problems using a combination of:

Heuristics: Prune the search space and avoid considering plans that are (almost) never good. Examples include considering left-deep trees or using Foreign Key / Primary Key relationships to pick the build size of a hash join.
Cost Model: Given the smaller set of candidate plans, the Optimizer then estimates their cost and picks the one using the lowest cost.

For some examples, you can read about Spark’s cost-based optimizer or look at the code for DataFusion’s join selection and DuckDB’s cost model and join-order enumeration.

However, the use of heuristics and (imprecise) cost models means optimizers must:

Make deep assumptions about the execution environment: For example, the heuristics often include assumptions that joins implement sideways information passing (RuntimeFilters) or that Join operators always preserve a particular input.
Use one particular objective function: There are almost always trade-offs between desirable plan properties, such as execution speed, memory use, and robustness in the face of cardinality estimation. Industrial optimizers typically have one cost function, which attempts to balance between the properties or a series of hard-to-use indirect tuning knobs to control the behavior.
Require statistics: Typically cost models require up-to-date statistics, which can be expensive to compute, must be kept up to date as new data arrives, and often have trouble capturing the nonuniformity of real-world datasets.

Join Ordering in DataFusion

DataFusion purposely does not include a sophisticated cost-based optimizer. Instead, in keeping with its design goals it provides a reasonable default implementation along with extension points to customize behavior.

Specifically, DataFusion includes:

“Syntactic Optimizer” (joins in the order listed in the query²) with basic join reordering (source) to prevent join disasters
Support for ColumnStatistics and Table Statistics
The framework for filter selectivity + join cardinality estimation
APIs for easily rewriting plans, such as the TreeNode API and reordering joins

This combination of features, along with custom optimizer passes, lets users customize the behavior to their use case, such as custom indexes like uWheel and materialized views.

The rationale for including only a basic optimizer is that any particular set of heuristics and cost model is unlikely to work well for the wide variety of DataFusion users because they have different tradeoffs.

For example, some users may always have access to adequate resources, want the fastest query execution, and are willing to tolerate runtime errors or a performance cliff when there is insufficient memory. Other users, however, may be willing to accept a slower maximum performance in return for a more predictable performance when running in a resource-constrained environment. This approach is not universally agreed. One of us has previously argued the case for specialized optimizers in a more academic paper, and the topic comes up regularly in the DataFusion community (e.g., this recent comment).

Note: We are actively improving this part of the code to help people write their own optimizers (🎣 come help us define and implement it!)

To summarize

Optimizers are awesome, and we hope these two posts have demystified what they are and how they are implemented in industrial systems. Like many modern query engine designs, the common techniques are well known, though require substantial effort to get right. DataFusion’s industrial strength optimizers can and do serve many real-world systems well and we expect that number to grow over time.

We also think DataFusion provides interesting opportunities for optimizer research. As we discussed, there are still unsolved problems, such as optimal join ordering. Experiments in papers often use academic systems or modify optimizers in open source but tightly integrated systems (for example, the recent POLARs paper uses DuckDB). However, this style means the research is constrained to the set of heuristics and structure provided by those particular systems. Hopefully DataFusion’s documentation, newly citeable SIGMOD paper, and modular design will encourage more broadly applicable research in this area.

And finally, as always, if you are interested in working on query engines and learning more about how they are designed and implemented, please join our community. We welcome first-time contributors as well as long-time participants to the fun of building a database together.

See Unnesting Arbitrary Queries from Neumann and Kemper for a more academic treatment.
One of my favorite terms I learned from Andy Pavlo’s CMU online lectures.

Optimizing SQL (and DataFrames) in DataFusion: Part 1

Andrew Lamb, Mustafa Akur (InfluxData) — Mon, 31 Mar 2025 07:00:00 +0000

Introduction

Sometimes Query Optimizers are seen as a sort of black magic, “the most challenging problem in computer science ,” according to Father Pavlo, or some behind-the-scenes player. We believe this perception is because:

One must implement the rest of a database system (data storage, transactions, SQL parser, expression evaluation, plan execution, etc.) before the optimizer becomes critical¹.
Some parts of the optimizer are tightly tied to the rest of the system (e.g., storage or indexes), so many classic optimizers are described with system-specific terminology.
Some optimizer tasks, such as access path selection and join order are known challenges and not yet solved (practically)—maybe they really do require black magic 🤔.

However, Query Optimizers are no more complicated in theory or practice than other parts of a database system, as we will argue in a series of posts:

Part 1:

Review what a Query Optimizer is, what it does, and why you need one for SQL and DataFrames.
Describe how industrial Query Optimizers are structured and standard optimization classes.

Part 2:

Describe the optimization categories with examples and pointers to implementations.
Describe Apache DataFusion’s rationale and approach to query optimization, specifically for access path and join ordering.

After reading these blogs, we hope people will use DataFusion to:

Build their own system specific optimizers.
Perform practical academic research on optimization (especially researchers working on new optimizations / join ordering—looking at you CMU 15-799, next year).

Query Optimizer background

The key pitch for querying databases, and likely the key to the longevity of SQL (despite people’s love/hate relationship—see SQL or Death? Seminar Series – Spring 2025), is that it disconnects the WHAT you want to compute from the HOW to do it. SQL is a declarative language—it describes what answers are desired rather than an imperative language such as Python, where you describe how to do the computation as shown in Figure 1.

Figure 1: Query Execution: Users describe the answer they want using either a DataFrame or SQL. The query planner or DataFrame API translates that description into an Initial Plan, which is correct but slow. The Query Optimizer then rewrites the initial plan to an Optimized Plan, which computes the same results but faster and more efficiently. Finally, the Execution Engine executes the optimized plan producing results.

SQL, DataFrames, LogicalPlan equivalence

Given their name, it is not surprising that Query Optimizers can improve the performance of SQL queries. However, it is under-appreciated that this also applies to DataFrame style APIs.

Classic DataFrame systems such as pandas and Polars (by default) execute eagerly and thus have limited opportunities for optimization. However, more modern APIs such as Polar’s lazy API, Apache Spark DataFrame, and DataFusion’s DataFrame are much faster as they use the design shown in Figure 1 and apply many query optimization techniques.

Example of Query Optimizer

This section motivates the value of a Query Optimizer with an example. Let’s say you have some observations of animal behavior, as illustrated in Table 1.


Location	Species	Population	Observation Time	Notes
North	contrarian spider	100	2025-02-21T10:00:00Z	Watched Me
…
South	contrarian spider	234	2025-02-23T11:23:00Z	N/A

Table 1: Example observational data.

If the user wants to know the average population for some species in the last month, a user can write a SQL query or a DataFrame such as the following:

SELECT location, AVG(population)
FROM observations
WHERE species = ‘contrarian spider’ AND 
  observation_time >= now() - interval '1 month'
GROUP BY location

df.scan("observations")
  .filter(col("species").eq("contrarian spider"))
  .filter(col("observation_time").ge(now()).sub(interval('1 month')))
  .agg(vec![col(location)], vec![avg(col("population")])

Within DataFusion, both the SQL and DataFrame are translated into the same LogicalPlan, a “tree of relational operators.” This is a fancy way of saying data flow graphs where the edges represent tabular data (rows + columns) and the nodes represent a transformation (see this DataFusion overview video for more details). The initial LogicalPlan for the queries above is shown in Figure 2.

Figure 2: Example initial LogicalPlan for SQL and DataFrame query. The plan is read from bottom to top, computing the results in each step.

The optimizer’s job is to take this query plan and rewrite it into an alternate plan that computes the same results but faster, such as the one shown in Figure 3.

Figure 3: An example optimized plan that computes the same result as the plan in Figure 2 more efficiently. The diagram highlights where the optimizer has applied Projection Pushdown, Filter Pushdown, and Constant Evaluation. Note that this is a simplified example for explanatory purposes, and actual optimizers such as the one in DataFusion perform additional tasks such as choosing specific aggregation algorithms.

Query Optimizer implementation

Industrial optimizers, such as DataFusion’s (source), ClickHouse (source, source), DuckDB (source), and Apache Spark (source), are implemented as a series of passes or rules that rewrite a query plan. The overall optimizer is composed of a sequence of these rules,⁶ as shown in Figure 4. The specific order of the rules also often matters, but we will not discuss this detail in this post.

A multi-pass design is standard because it helps:

Understand, implement, and test each pass in isolation
Easily extend the optimizer by adding new passes

Figure 4: Query Optimizers are implemented as a series of rules that each rewrite the query plan. Each rule’s algorithm is expressed as a transformation of a previous plan.

There are three major classes of optimizations in industrial optimizers:

Always Optimizations: These are always good to do and thus are always applied. This class of optimization includes expression simplification, predicate pushdown, and limit pushdown. These optimizations are typically simple in theory, though they require nontrivial amounts of code and tests to implement in practice.
Engine Specific Optimizations: These optimizations take advantage of specific engine features, such as how expressions are evaluated or what particular hash or join implementations are available.
Access Path and Join Order Selection: These passes choose one access method per table and a join order for execution, typically using heuristics and a cost model to make tradeoffs between the options. Databases often have multiple ways to access the data (e.g., index scan or full-table scan), as well as many potential orders to combine (join) multiple tables. These methods compute the same result but can vary drastically in performance.

This brings us to the end of Part 1. In Part 2, we will explain these classes of optimizations in more detail and provide examples of how they are implemented in DataFusion and other systems.

And thus in academic classes, by the time you get around to an optimizer the semester is over and everyone is ready for the semester to be done. Once industrial systems mature to the point where the optimizer is a bottleneck, the shiny newness of the hype cycle has worn off and it is likely in the trough of disappointment.

2025: The Year of 1,000 DataFusion-Based Systems

Andrew Lamb (InfluxData) — Wed, 08 Jan 2025 07:00:00 +0000

Apache DataFusion has reached an inflection point. It has matured beyond early adopters and is now a viable choice for anyone building highly performant analytic systems. I predict 2025 will bring a significant acceleration in the number of systems built on DataFusion, and my focus this year is to help drive that growth.

The journey from 0 to 1,000 projects

Two years ago, when introducing DataFusion to VCs and early collaborators, I had an ambitious goal: 1,000 projects powered by DataFusion. That number was aspirational—bold enough to challenge but grounded enough to feel achievable. I think we may hit that goal in 2025.

DataFusion achieved several key milestones in 2024 as it matured from a promising technology to a building block for highly performant systems:

Elevated to a Top-Level Project within the Apache Software Foundation (ASF).
Hosted the first in-person meetup in Austin, Texas, followed by others in San Francisco, Seattle, Belgrade, and more.
Published a research paper at ACM SIGMOD 2024, one of the world’s leading database conferences.
Gained adoption by a growing number of database products and companies^[1], with increased media attention.

The year closed with a major breakthrough: DataFusion 43.0.0 became the fastest engine for querying Apache Parquet files in ClickBench, marking the first time a Rust-based engine surpassed traditional C/C++ engines.

These milestones didn’t happen by chance—they are the result of eight years of relentless development from hundreds of individuals and countless engineering hours. Figure 1 shows my subjective appraisal of DataFusion’s timeline and my prediction of its acceleration over the next few years: Figure 1: Major milestones in the DataFusion project lifetime and my estimates of project adoption. I predict 2025 will be very exciting.

2020-2023 early adopters, including InfluxDB 3

InfluxData recognized DataFusion’s potential early on and bet on it for the rebuild of InfluxDB in Rust, along with the rest of the FDAP stack—Apache Arrow Flight, Apache DataFusion, Apache Arrow, and Apache Parquet—all ASF technologies. At the time, DataFusion was still in its infancy, developed primarily by its creator, Andy Grove, during his spare time.

Creating a high-performance time series engine using well-known columnar and vectorization techniques was central to the InfluxDB 3 design. Such an engine requires significant knowledge and investment and had previously been available only to a small number of companies and elite research institutions. We believed that the combination of being written in Rust, an ASF project, and part of the Arrow ecosystem would attract other users to DataFusion, who would both benefit and help provide the engineering needed. That bet has paid off, with over 94 individuals contributing to the most recent release.

InfluxData wasn’t alone in recognizing DataFusion’s potential. Companies like Coralogix, Greptime, and Synnada also embraced DataFusion, betting that building on its foundation and contributing to its development would allow them to deliver better products more quickly and cost-effectively than doing it entirely by themselves.

This collective investment helped grow DataFusion and its community while delivering tangible benefits to early adopters. While the journey came with challenges, the returns have been undeniably high.

Today, in InfluxDB 3, every aspect of data processing flows through a DataFusion plan after Line Protocol parsing. This includes writing and compacting Apache Parquet files and executing SQL, InfluxQL, and Flux queries. Our multi-tenant production systems alone execute 10s of millions of DataFusion plans daily. Improvements from the broader DataFusion community flow directly into InfluxDB 3, with many of our bug reports or SQL feature requests from customers resolved upstream by other contributors—requiring only a version upgrade.

2023-2025: gaining momentum

Major companies with dedicated engineering teams are now building and deploying DataFusion-based systems across diverse contexts while contributing back to the project. This virtuous cycle has driven rapid innovation in performance and features, with adoption still in its early stages. The past two years have been a turning point, with engineers from leading tech companies such as Apple, eBay, Kuaishou, Airbnb, TikTok, Huawei, and Alibaba contributing significantly to DataFusion.

A key milestone came last year when Apple developers built a replacement for Spark query execution using DataFusion, which they donated to ASF and is now developed as Apache DataFusion Comet. This not only demonstrated Apple’s confidence in DataFusion but also inspired additional contributions from the broader open source community, accelerating its growth.

Integration into the Open Data Lake

In 2025, adoption of DataFusion is set to surge as the industry embraces Open Data Lake architectures. The data landscape is evolving into a constellation of specialized processing systems, each tailored for unique use cases, as illustrated in Figure 2. Figure 2: Next-generation analytics: a constellation of different tools with a shared storage layer based on the open Apache Parquet and Apache Iceberg formats stored on Object Storage such as AWS S3, GCP Cloud Storage, and Azure Blob Storage.

These systems will share the same underlying data stored in the Apache Parquet open format, organized by Apache Iceberg, and tailored to different use cases. Achieving high performance in this architecture requires advanced, vectorized analytic technology—an area where DataFusion excels due to its permissive licensing, extensible design, and exceptional Parquet performance. The Rust-based implementations of Delta Lake, Apache Iceberg, and Apache Hudi, all built using DataFusion, highlight its central role in the shift toward open, modular data architectures.

To support this proliferation, I expect significant additional investment from the DataFusion community to improve the technological underpinnings of querying in this new architecture. Efforts include simplifying and accelerating remote file queries and exploring advanced caching strategies.

Streamlining Adoption for Downstream Users

Another major theme for investment in 2025 will be reducing friction for downstream users when adopting new versions of DataFusion. Recent efforts in DataFusion to complete projects such as StringView and Window Function Migration solidified its foundation, but the velocity of changes also caused challenges downstream for some upgrades.

As the ecosystem grows, ensuring the smooth adoption of updates becomes increasingly critical. We are discussing ways to improve this process as well as clarify the criteria for adding new features/what belongs in core DataFusion.

By balancing innovation with stability, the DataFusion community aims to maintain its rapid velocity of improvements while making it easier for users and contributors to keep pace.

Next Level Quality: Bashing Pesky Bugs

As DataFusion matures, users tend to:

Expect more concerning the breadth and depth of functionality (e.g., SQL and type support)
Run increasingly complicated queries

These trends naturally expose feature gaps and bugs. For example, given that InfluxDB 3 executes tens of millions of DataFusion plans per day on InfluxData production systems, we find occasional and increasingly esoteric issues that we report and help fix.

This “hardening” phase is a natural step for any successful software on its path to maturity and widespread adoption. While fixing these bugs can be tedious, it is a straightforward task requiring focused engineering effort. I am confident in our community’s ability to drive up the quality level.

DataFusion already benefits from extensive test coverage, and I predict we will see additional focus on automated industrial testing. Examples include Bruce Ritchie’s work on running DataFusion on the SQLite test corpus and Yongting You’s efforts to run SQLancer on Datafusion. InfluxData plans to contribute significantly to this area as well, and I hope other companies using DataFusion will do the same.

Pushing the Limits of Performance

One of DataFusion’s core principles is world-class performance: applications built on DataFusion can focus mostly on their specific features and take advantage of DataFusion’s performance (like LLVM, my favorite, though very geeky, analogy).

DataFusion already has optimized most “low-hanging fruit,” so continued performance improvements require careful and focused engineering. We continue to see performance projects such as vectorized group keys and improved pruning, but the quality bar gets higher. We will need continued, ongoing help from the community to find, implement, evaluate, and verify these improvements.

I am particularly excited about the possibility of working with academic groups—there is a wealth of talent and focused time for low-level performance optimization among PhD students. Additional collaboration can accelerate the adoption of students’ work into real-world systems and make DataFusion faster, and I am excited to help make it happen.

The year ahead

2025 will be very exciting as more DataFusion-based systems hit the market, solidifying its place as a foundational building block for analytic and data platforms. The future of the data stack is composable, and DataFusion will be one key component. While challenges are inevitable, the community (and I) will focus on driving it forward as fast as possible while maintaining a stable foundation, leading to a thriving ecosystem.

I’ll close with my usual appeal (aka 🎣 attempt): DataFusion is an open source project driven by open contributions. We welcome and encourage contributions from everyone. Review capacity remains our most limited, but impactful resource, and I encourage companies and individuals to dedicate time reviewing code, testing proposals, and helping maintain the project.

Finally, I want to express my gratitude to InfluxData. It was InfluxData’s vision and early recognition of DataFusion’s potential that introduced me to the project and supported my contributions over the past 4.5 years. This has allowed me to engage deeply—reviewing countless PRs, contributing more features (both directly and indirectly related to InfluxDB 3), writing many blog posts, traveling for meetups, and supporting my role as the project’s PMC.

2025 will be a pivotal year for DataFusion, and I look forward to seeing the innovation this community will drive.

^[1] The numbers on those lists seem modest, but they only include people who have written publicly about their use. I know of many internal projects/data systems not listed that also use DataFusion.

^[2] I am a database internals developer, after all. How cool is that!!

Apache DataFusion Meetup: Chicago December 2024 Recap

Andrew Lamb (InfluxData) — Mon, 06 Jan 2025 07:00:00 +0000

This past week, I attended and spoke at the Apache DataFusion Meetup in Chicago, Illinois. Inspired by Sami Tandogdu ’s (Synnada) great recap of the DataFusion Belgrade meetup, I figured I would try it myself.

First of all, huge thanks to 1871, Pydantic, and (of course) InfluxData for sponsoring the event; to Adrian who did much of the work organizing; and to Xiangpeng and Adrian for some of these pictures. Around 25 DataFusion enthusiasts attended, learned from talks hosted by project contributors, and discussed ideas for the future. The meetup felt somewhat unique as almost all attendees were using DataFusion in their products or projects. This led to some great discussions and a visceral feeling that the adoption of DataFusion is increasing. Below is a summary of the four featured talks:

“Building a Real-Time Data Lake with DataFusion”

Adrian Garcia Badaracco - Founding Engineer, Pydantic First up was Adrian, a founding engineer at Pydantic. His team is building the database for pydantic LogFire, an observability platform. Adrian gave an overview of how Pydantic uses DataFusion to build a near real-time data lake for observability data and some details of their indexing and metadata store. VIDEO / SLIDES

”Practical Data Science in Robotics Using DataFusion”

Tim Saucer - Director of Simulation & Infrastructure, May Mobility Next up was Tim Saucer, a contributor and committer on DataFusion, who focused on the Python bindings. Tim spoke about data science in robotics and how DataFusion can be used to address some of the challenges particular to that field. VIDEO / SLIDES

“Practical Disaggregated Cache for DataFusion”

Xiangpeng Hao (@XiangpengHao) - PhD Student, UW Madison The next speaker was Xiangpeng Hao, a fourth-year PhD student at the University of Wisconsin-Madison, studying and building database and storage systems. He spoke about his work building SplitSQL, a disaggregated cache for modern data analytics also built on DataFusion. He was a former intern at InfluxData and, in that role, contributed heavily to the StringView integration in Apache DataFusion and Parquet Metadata. VIDEO / SLIDES

“Building InfluxDB 3 with the FDAP Stack”

Andrew Lamb (@alamb) - Staff Engineer, DataFusion, PMC chair, InfluxData Finally, it was my turn to speak about the rationale for why and how we built InfluxDB 3 using the FDAP stack, with a focus on the DataFusion aspects. Sorry for the somewhat goofy picture and the fact I forgot to turn on the microphone for the recording. VIDEO (no sound 🤦 ) / SLIDES

In addition to the speakers, it was great to meet Alex Wilcoxson, Michael Maletich, and others from Relativity Software, who are building a document discovery platform using DataFusion and Michael Ward of DataFusion-Python fame. Also present were Camuel Gilyadov and Sergei Turukin from Embucket, who are working on a new DataFusion-powered project and Devan Benz, a fellow Influxer working on database internals. After lunch, we had some informal conversations about topics such as the future of the project, building secondary indexes, performance, and the DataFusion-Python roadmap.

While running around meeting other users is somewhat exhausting, I think it is important during this stage of the project’s growth. As its adoption takes off, building a community that can sustain the project over the long term is more important than ever, and I am very excited, as always, to be a part of that.

Apache DataFusion is Now the Fastest Single Node Engine for Querying Apache Parquet Files

Andrew Lamb (InfluxData) — Mon, 25 Nov 2024 07:00:00 +0000

This blog was originally published on Apache DataFusion Project News & Blog

I am extremely excited to announce that Apache DataFusion 43.0.0 is the fastest engine for querying Apache Parquet files in ClickBench. It is faster than both DuckDB and chDB/Clickhouse using the same hardware. It also marks the first time a Rust based engine holds the top spot, which has previously been held by traditional C/C++ based engines. Figure 1: 2024-11-16 ClickBench Results for the ‘hot’¹ run against the partitioned 14 GB Parquet dataset (100 files, each ~140MB) on a c6a.4xlarge (16 CPU/32 GB RAM) machine. Measurements are relative (1.x) to results using different hardware.

Best in class performance on Parquet is now available to anyone. DataFusion’s open design lets you start quickly with a full-featured Query Engine, including SQL, data formats, catalogs, and more, and then customize any behavior you need. I predict the continued emergence of new classes of data systems now that creators can focus the bulk of their innovation on areas such as query languages, system integrations, and data formats rather than trying to play catchup with core engine performance.

ClickBench also includes results for proprietary storage formats, which require costly load/export steps, making them useful in fewer use cases and thus much less important than open formats (though the idea of use case specific formats is interesting²).

This blog post highlights some of the techniques we used to achieve this performance, and celebrates the teamwork involved.

A strong history of performance improvements

Performance has long been a core focus for DataFusion’s community, and speed attracts users and contributors. Recently, we seem to have been even more focused on performance, including in July, 2024 when Mehmet Ozan Kabak, CEO of Synnada, again suggested focusing on performance. This got many of us excited (who doesn’t love a challenge!), and we have subsequently rallied to steadily improve the performance release on release as shown in Figure 2. Figure 2: ClickBench performance improved over 30% between DataFusion 34 (released Dec 2023) and DataFusion 43 (released Nov 2024).

Like all good optimization efforts, ours took sustained effort as DataFusion ran out single 2x performance improvements several years ago. Working together our community of engineers from around the world³ and all experience levels⁴ pulled it off (check out this discussion to get a sense). It may be a “hobo sandwich”⁵ but it is a tasty one!

Of course, most of these techniques have been implemented and described before, but until now they were only available in proprietary systems such as Vertica, DataBricks Photon, or Snowflake or in tightly integrated open source systems such as DuckDB or ClickHouse which were not designed to extend.

StringView

Performance improved for all queries when DataFusion switched to using Arrow StringView. Using StringView “just” saves some copies and avoids one memory access for certain comparisons. However, these copies and comparisons happen to occur in many of the hottest loops during query processing, so optimizing them resulted in noticeable performance improvements.

Figure 3: Figure from Using StringView / German Style Strings to Make Queries Faster: Part 1 showing how StringView saves copying data in many cases.

Using StringView to make DataFusion faster for ClickBench required substantial careful, low level optimization work described in Using StringView / German Style Strings to Make Queries Faster: Part 1 and Part 2. However, it also required extending the rest of DataFusion’s operations to support the new type. You can get a sense of the magnitude of the work required by looking at the 100+ pull requests linked to the epic in arrow-rs (here) and three major epics (here, here, and here) in DataFusion.

Here is a partial list of people involved in the project (I am sorry to those I forgot):

Arrow: Xiangpeng Hao (InfluxData’s amazing 2024 summer intern and UW Madison PhD), Yijun Zhao from DataBend Labs, and Raphael Taylor-Davies laid the foundation. RinChanNOW from Tencent and Andrew Duffy from SpiralDB helped push it along in the early days, and Liang-Chi Hsieh, Daniël Heres reviewed and provided guidance.
DataFusion: Xiangpeng Hao, again charted the initial path and Alex Huang, Dharan Aditya Lordworms, Jax Liu, wiedld, Tai Le Manh, yi wang, doupache, Jay Zhan, Xin Li, and Kaifeng Zheng made it real.
DataFusion String Function Migration: Trent Hauck organized the effort and set the patterns, Jax Liu made a clever testing framework, and Austin Liu, Dmitrii Bu, Tai Le Manh, Chojan Shang, WeblWabl, Lordworms, iamthinh, Bruce Ritchie , Kaifeng Zheng, and Xin Li bashed out the conversions.

Parquet

Part of DataFusion’s speed in ClickBench is reading Parquet files (really) quickly, which reflects invested effort in the Parquet reading system (see Querying Parquet with Millisecond Latency).

The DataFusion ParquetExec (built on Rust Parquet) is now the most sophisticated open source Parquet reader I know of. It has every optimization we can think of for reading Parquet, including projection pushdown, predicate pushdown (row group metadata, page index, and bloom filters), limit pushdown, parallel reading, interleaved I/O, and late materialized filtering (coming soon by default). Some recent work from June recently unblocked a remaining hurdle for enabling late materialized filtering, and conveniently Xiangpeng Hao is working on the final piece (no pressure😅).

Skipping partial aggregation when it doesn’t help

Many ClickBench queries are aggregations that summarize millions of rows, a common task for reporting and dashboarding. DataFusion uses state of the art two phase aggregation plans. Normally, two phase aggregation works well as the first phase consolidates many rows immediately after reading, while the data is still in cache. However, for certain “high cardinality” aggregate queries (that have large numbers of groups), the two phase aggregation strategy used in DataFusion was inefficient, manifesting in relatively slower performance compared to other engines for ClickBench queries such as

SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") 
FROM hits 
GROUP BY "WatchID", "ClientIP" -- **** 13M distinct Groups  ****
ORDER BY c DESC 
LIMIT 10;

For such queries, the first first aggregation phase does not significantly reduce the number of rows, which wastes significant effort. Eduard Karacharov solved this problem with a dynamic strategy to bypass the first phase when it is not working efficiently, shown in Figure 4. Figure 4: Diagram from DataFusion API docs showing when the muti-phase grouping is not effective.

Optimized multi-column grouping

Another method for improving analytic database performance is specialized (aka highly optimized) versions of operations for different data types, which the system picks at runtime based on the query. Like other systems, DataFusion has specialized code for handling different types of group columns. For example, there is special code that handles GROUP BY int_id and different specialized code that handles GROUP BY string_id.

When a query groups by multiple columns, it is tricker to apply this technique. For example GROUP BY string_id, int_id and GROUP BY int_id, string_id have different optimal structures, but it is not possible to include specialized versions for all possible combinations of group column types.

DataFusion includes a general Row based mechanism that works for any combination of column types, but this general mechanism copies each value twice as shown in Figure 5. The cost of this copy is especially high for variable length strings and binary data. Figure 5: Prior to DataFusion 43.0.0, queries with multiple group columns used Row based group storage and copied each group value twice. This copy consumes a substantial amount of the query time for queries with many distinct groups, such as several of the queries in ClickBench.

Many optimizations in Databases boil down to simply avoiding copies, and this was no exception. The trick was to figure out how to avoid copies without causing per-column comparison overhead to dominate or complexity to get out of hand. In a great example of diligent and disciplined engineering, Jay Zhan tried several different approaches until arriving at the one shipped in DataFusion 43.0.0, shown in Figure 6.

Figure 6: DataFusion 43.0.0’s new columnar group storage copies each group value exactly once, which is significantly faster when grouping by multiple columns.

Huge thanks as well to Emil Ejbyfeldt and Daniël Heres for their help reviewing and to Rachelint (kamille) for reviewing and contributing a faster vectorized append and compare for multi group which will be released in DataFusion 44. The discussion on the ticket is another great example of the power of the DataFusion community working together to build great software.

What’s next 🚀

Just as I expect the performance of other engines to improve, DataFusion has several more performance improvements lined up itself:

We are also talking about what to focus on over the next three months and are always looking for people to help! If you want to geek out (obsess??) about performance and other features with engineers from around the world, we would love you to join us.

Additional thanks

In addition to the people called out above, thanks:

Patrick McGleenon for running ClickBench and gathering this data (source).
Everyone I missed in the shoutouts—there are so many of you. We appreciate everyone.

Conclusion

I have dreamed about DataFusion being at the top of the ClickBench leaderboard for several years. I often watched with envy improvements in systems backed by large VC investments, internet companies, or world class research institutions, and doubted that we could pull off something similar in an open source project with always limited time.

The fact that we have now surpassed those other systems in query performance speaks to the power and possibility of focusing on community and aligning the collective enthusiasm and skills towards a common goal. Of course, being on the top in any particular benchmark is likely fleeting as other engines will improve, but so will DataFusion!

I love working on DataFusion—the people, the quality of the code, my interactions and the results we have achieved together far surpass my expectations as well as most of my other software development experiences. I can’t wait to see what people will build next, and hope to see you online.

Note that DuckDB is slightly faster on the ‘cold’ run.
Want to try your hand at a custom format for ClickBench fame/glory? Make DataFusion the fastest engine in ClickBench with custom file format
We have contributors from North America, South American, Europe, Asia, Africa and Australia.
Undergraduates, PhD, Junior engineers, and getting-kind-of-crotchety experienced engineers.
Thanks to Andy Pavlo, I love that nomenclature.

Using StringView / German Style Strings to Make Queries Faster: Part 2 - String Operations

Andrew Lamb, Xiangpeng Hao (InfluxData) — Tue, 03 Sep 2024 08:00:00 +0000

Section 3: Faster String Operations

In the first post, we discussed the nuances required to accelerate Parquet loading using StringViewArray by reusing buffers and reducing copies. In this second part of the post, we describe the rest of the journey: implementing additional efficient operations for real query processing.

Section 3.1 Faster comparison

String comparison is ubiquitous; it is the core of cmp, min/max, and like/ilike kernels. StringViewArray is designed to accelerate such comparisons using the inlined prefix—the key observation is that, in many cases, only the first few bytes of the string determine the string comparison results.

For example, to compare the strings InfluxDB with Apache DataFusion, we only need to look at the first byte to determine the string ordering or equality. In this case, since A is earlier in the alphabet than I, Apache DataFusion sorts first, and we know the strings are not equal. Despite only needing the first byte, comparing these strings when stored as a StringArray requires two memory accesses: 1) load the string offset and 2) use the offset to locate the string bytes. For low-level operations such as cmp that are invoked millions of times in the very hot paths of queries, avoiding this extra memory access can make a measurable difference in query performance.

For StringViewArray, typically, only one memory access is needed to load the view struct. Only if the result can not be determined from the prefix is the second memory access required. For the example above, there is no need for the second access. This technique is very effective in practice: the second access is never necessary for the more than 60% of real-world strings which are shorter than 12 bytes, as they are stored completely in the prefix.

However, functions that operate on strings must be specialized to take advantage of the inlined prefix. In addition to low-level comparison kernels, we implemented a wide range of other StringViewArray operations that cover the functions and operations seen in ClickBench queries. Supporting StringViewArray in all string operations takes quite a bit of effort, and thankfully the Arrow and DataFusion communities are already hard at work doing so (see https://github.com/apache/datafusion/issues/11752 if you want to help out).

Section 3.2: Faster `take` and `filter`

After a filter operation such as WHERE url <> ‘’ to avoid processing empty urls, DataFusion will often coalesce results to form a new array with only the passing elements. This coalescing ensures the batches are sufficiently sized to benefit from vectorized processing in subsequent steps.

The coalescing operation is implemented using the take and filter kernels in arrow-rs. For StringArray, these kernels require copying the string contents to a new buffer without “holes” in between. This copy can be expensive especially when the new array is large.

However, take and filter for StringViewArray can avoid the copy by reusing buffers from the old array. The kernels only need to create a new list of views that point at the same strings within the old buffers. Figure 1 illustrates the difference between the output of both string representations. StringArray creates two new strings at offsets 0-17 and 17-32, while StringViewArray simply points to the original buffer at offsets 0 and 25. Figure 1: Zero-copy take/filter for StringViewArray

Section 3.3: When to GC?

Zero-copy take/filter is great for generating large arrays quickly, but it is suboptimal for highly selective filters, where most of the strings are filtered out. When the cardinality drops, StringViewArray buffers become sparse—only a small subset of the bytes in the buffer’s memory are referred to by any view. This leads to excessive memory usage, especially in a filter-then-coalesce scenario. For example, a StringViewArray with 10M strings may only refer to 1M strings after some filter operations; however, due to zero-copy take/filter, the (reused) 10M buffers can not be released/reused.

To release unused memory, we implemented a garbage collection (GC) routine to consolidate the data into a new buffer to release the old sparse buffer(s). As the GC operation copies strings, similarly to StringArray, we must be careful about when to call it. If we call GC too early, we cause unnecessary copying, losing much of the benefit of StringViewArray. If we call GC too late, we hold large buffers for too long, increasing memory use and decreasing cache efficiency. The Polars blog on StringView also refers to the challenge presented by garbage collection timing.

arrow-rs implements the GC process, but it is up to users to decide when to call it. We leverage the semantics of the query engine and observed that the CoalseceBatchesExec operator, which merge smaller batches to a larger batch, is often used after the record cardinality is expected to shrink, which aligns perfectly with the scenario of GC in StringViewArray. We, therefore,implemented the GC procedure inside CoalseceBatchesExec¹,with a heuristic that estimates when the buffers are too sparse.

Section 3.4: The art of function inlining: not too much, not too little

Like string inlining, function inlining is the process of embedding a short function into the caller to avoid the overhead of function calls (caller/callee save). Usually, the Rust compiler does a good job of deciding when to inline. However, it is possible to override its default using the #[inline(always)]directive. In performance-critical code, inlined code allows us to organize large functions into smaller ones without paying the runtime cost of function invocation.

However, function inlining is not always better, as it leads to larger function bodies that are harder for LLVM to optimize (for example, suboptimal register spilling) and risk overflowing the CPU’s instruction cache. We observed several performance regressions where function inlining caused slower performance when implementing the StringViewArray comparison kernels. Careful inspection and tuning of the code was required to aid the compiler in generating efficient code. More details can be found in this PR: https://github.com/apache/arrow-rs/pull/5900.

Section 3.5: Buffer size tuning

StringViewArray permits multiple buffers, which enables a flexible buffer layout and potentially reduces the need to copy data. However, a large number of buffers slows down the performance of other operations. For example, get_array_memory_size() needs to sum the memory size of each buffer, which takes a long time with thousands of small buffers. In certain cases, we found that multiple calls to concat_batches lead to arrays with millions of buffers, which was prohibitively expensive.

For example, consider a StringViewArray with the previous default buffer size of 8 KB. With this configuration, holding 4GB of string data requires almost half a million buffers! Larger buffer sizes are needed for larger arrays, but we cannot arbitrarily increase the default buffer size, as small arrays would consume too much memory (most arrays require at least one buffer). Buffer sizing is especially problematic in query processing, as we often need to construct small batches of string arrays, and the sizes are unknown at planning time.

To balance the buffer size trade-off, we again leverage the query processing (DataFusion) semantics to decide when to use larger buffers. While coalescing batches, we combine multiple small string arrays and set a smaller buffer size to keep the total memory consumption low. In string aggregation, we aggregate over an entire Datafusion partition, which can generate a large number of strings, so we set a larger buffer size (2MB).

To assist situations where the semantics are unknown, we also implemented a classic dynamic exponential buffer size growth strategy, which starts with a small buffer size (8KB) and doubles the size of each new buffer up to 2MB. We implemented this strategy in arrow-rs and enabled it by default so that other users of StringViewArray can also benefit from this optimization. See this issue for more details: https://github.com/apache/arrow-rs/issues/6094.

Section 3.6: End-to-end query performance

We have made significant progress in optimizing StringViewArray filtering operations. Now, let’s test it in the real world to see how it works!

Let’s consider ClickBench query 22, which selects multiple string fields (URL, Title, and SearchPhase) and applies several filters.

SELECT 
  "SearchPhrase", 
  MIN("URL"), MIN("Title"), COUNT(\*) AS c, COUNT(DISTINCT "UserID") 
FROM hits 
WHERE 
  "Title" LIKE '%Google%' AND 
  "URL" NOT LIKE '%.google.%' AND 
  "SearchPhrase" <> '' 
GROUP BY "SearchPhrase" 
ORDER BY c DESC 
LIMIT 10;

We ran the benchmark using the following command in the DataFusion repo. Again, the –string-view option means we use StringViewArray instead of StringArray.

cargo run --profile release-nonlto --bin dfbench -- clickbench --queries-path benchmarks/queries/clickbench/queries.sql --iterations 3 --query 22 --path benchmarks/data/hits.parquet --string-view

To eliminate the impact of the faster Parquet reading using StringViewArray (see the first part of this blog), Figure 2 plots only the time spent in FilterExec. Without StringViewArray, the filter takes 7.17s; with StringViewArray, the filter only takes 4.86s, a 32% reduction in time. Moreover, we see a 17% improvement in end-to-end query performance. Figure 2: StringViewArray reduces the filter time by 32% on ClickBench query 22.

Section 4: Faster String Aggregation

So far, we have discussed how to exploit two StringViewArray features: reduced copy and faster filtering. This section focuses on reusing string bytes to repeat string values.

As described in part one of this blog, if two strings have identical values, StringViewArray can use two different views pointing at the same buffer range, thus avoiding repeating the string bytes in the buffer. This makes StringViewArray similar to an Arrow DictionaryArray that stores Strings—both array types work well for strings with only a few distinct values.

Deduplicating string values can significantly reduce memory consumption in StringViewArray. However, this process is expensive and involves hashing every string and maintaining a hash table, and so it cannot be done by default when creating a StringViewArray. We introduced anopt-in string deduplication mode in arrow-rs for advanced users who know their data has a small number of distinct values, and where the benefits of reduced memory consumption outweigh the additional overhead of array construction.

Once again, we leverage DataFusion query semantics to identify StringViewArray with duplicate values, such as aggregation queries with multiple group keys. For example, some ClickBench queries group by two columns:

UserID (an integer with close to 1 M distinct values)
MobilePhoneModel (a string with less than a hundred distinct values)

In this case, the output row count is count(distinct UserID) * count(distinct MobilePhoneModel, which is 100M. Each string value of MobilePhoneModel is repeated 1M times. With StringViewArray, we can save space by pointing the repeating values to the same underlying buffer.

Faster string aggregation with StringView is part of a larger project to improve DataFusion aggregation performance. We have a proof of concept implementation with StringView that can improve the multi-column string aggregation by 20%. We would love your help to get it production ready!

Section 5: StringView Pitfalls

Most existing blog posts (including this one) focus on the benefits of using StringViewArray over other string representations such as StringArray. As we have discussed, even though it requires a significant engineering investment to realize, StringViewArray is a major improvement over StringArray in many cases.

However, there are several cases where StringViewArray is slower than StringArray. For completeness, we have listed those instances here:

Tiny strings (when strings are shorter than 8 bytes): every element of the StringViewArray consumes at least 16 bytes of memory—the size of the view struct. For an array of tiny strings, StringViewArray consumes more memory than StringArray and thus can cause slower performance due to additional memory pressure on the CPU cache.
Many repeated short strings: Similar to the first point, StringViewArray can be slower and require more memory than a DictionaryArray because 1) it can only reuse the bytes in the buffer when the strings are longer than 12 bytes and 2) 32-bit offsets are always used, even when a smaller size (8 bit or 16 bit) could represent all the distinct values.
Filtering: As we mentioned above, StringViewArrays often consume more memory than the corresponding StringArray, and memory bloat quickly dominates the performance without GC. However, invoking GC also reduces the benefits of less copying so must be carefully tuned.

Section 6: Conclusion and Takeaways

In these two blog posts, we discussed what it takes to implement StringViewArray in arrow-rs and then integrate it into DataFusion. Our evaluations on ClickBench queries show that StringView can improve the performance of string-intensive workloads by up to 2x.

Given that DataFusion already performs very well on ClickBench, the level of end-to-end performance improvement using StringViewArray shows the power of this technique and, of course, is a win for DataFusion and the systems that build upon it.

StringView is a big project that has received tremendous community support. Specifically, we would like to thank @tustvold, @ariesdevil, @RinChanNOWWW, @ClSlaid, @2010YOUY01, @chloro-pn, @a10y, @Kev1n8, @Weijun-H, @PsiACE, @tshauck, and @xinlifoobar for their valuable contributions!

As the introduction states, “German Style Strings” is a relatively straightforward research idea that avoid some string copies and accelerates comparisons. However, applying this (great) idea in practice requires a significant investment in careful software engineering. Again, we encourage the research community to continue to help apply research ideas to industrial systems, such as DataFusion, as doing so provides valuable perspectives when evaluating future research questions for the greatest potential impact.

There are additional optimizations possible in this operation that the community is working on, such as https://github.com/apache/datafusion/issues/7957.

Using StringView / German Style Strings to Make Queries Faster: Part 1 - Reading Parquet

Andrew Lamb, Xiangpeng Hao (InfluxData) — Thu, 22 Aug 2024 08:00:00 +0000

Editor’s Note: This is the first of a two part blog series.

This blog describes our experience implementing StringView in the Rust implementation of Apache Arrow, and integrating it into Apache DataFusion, significantly accelerating string-intensive queries in the ClickBench benchmark by 20%- 200% (Figure 1¹).

Getting significant end-to-end performance improvements was non-trivial. Implementing StringView itself was only a fraction of the effort required. Among other things, we had to optimize UTF-8 validation, implement unintuitive compiler optimizations, tune block sizes, and time GC to realize the FDAP ecosystem’s benefit. With other members of the open source community, we were able to overcome performance bottlenecks that could have killed the project. We would like to contribute by explaining the challenges and solutions in more detail so that more of the community can learn from our experience.

StringView is based on a simple idea: avoid some string copies and accelerate comparisons with inlined prefixes. Like most great ideas, it is “obvious” only after someone describes it clearly. Although simple, straightforward implementation actually slows down performance for almost every query. We must, therefore, apply astute observations and diligent engineering to realize the actual benefits from StringView.

Although this journey was successful, not all research ideas are as lucky. To accelerate the adoption of research into industry, it is valuable to integrate research prototypes with practical systems. Understanding the nuances of real-world systems makes it more likely that research designs² will lead to practical system improvements.

StringView support was released as part of arrow-rs v52.2.0 and DataFusion v41.0.0. You can try it by setting the schema_force_string_view DataFusion configuration option, and we are hard at work with the community to make it the default. We invite everyone to try it out, take advantage of the effort invested so far, and contribute to making it better. Figure 1: StringView improves string-intensive ClickBench query performance by 20% - 200%

Section 1: What is StringView?

Figure 2: Use StringArray and StringViewArray to represent the same string content.

The concept of inlined strings with prefixes (called “German Strings” by Andy Pavlo, in homage to TUM, where the Umbra paper that describes them originated) has been used in many recent database systems (Velox, Polars, DuckDB, CedarDB, etc.) and was introduced to Arrow as a new StringViewArray³ type. Arrow’s original StringArray is very memory efficient but less effective for certain operations. StringViewArray accelerates string-intensive operations via prefix inlining and a more flexible and compact string representation.

A StringViewArray consists of three components:

The view array
The buffers
The buffer pointers (IDs) that map buffer offsets to their physical locations

Each view is 16 bytes long, and its contents differ based on the string’s length:

string length < 12 bytes: the first four bytes store the string length, and the remaining 12 bytes store the inlined string.
string length > 12 bytes: the string is stored in a separate buffer. The length is again stored in the first 4 bytes, followed by the buffer id (4 bytes), the buffer offset (4 bytes), and the prefix (first 4 bytes) of the string.

Figure 2 shows an example of the same logical content (left) using StringArray (middle) and StringViewArray (right):

The first string – “Apache DataFusion” – is 17 bytes long, and both StringArray and StringViewArray store the string’s bytes at the beginning of the buffer. The StringViewArray also inlines the first 4 bytes – “Apac” – in the view.
The second string, “InfluxDB” is only 8 bytes long, so StringViewArray completely inlines the string content in the view struct while StringArray stores the string in the buffer as well.
The third string “Arrow Rust Impl” is 15 bytes long and cannot be fully inlined. StringViewArray stores this in the same form as the first string.
The last string “Apache DataFusion” has the same content as the first string. It’s possible to use StringViewArray to avoid this duplication and reuse the bytes by pointing the view to the previous location.

StringViewArray provides three opportunities for outperforming StringArray:

Less copying via the offset + buffer format
Faster comparisons using the inlined string prefix
Reusing repeated string values with the flexible view layout

The rest of this blog post discusses how to apply these opportunities in real query scenarios to improve performance, what challenges we encountered along the way, and how we solved them.

Section 2: Faster Parquet Loading

Apache Parquet is the de facto format for storing large-scale analytical data commonly stored LakeHouse-style, such as Apache Iceberg and Delta Lake. Efficiently loading data from Parquet is thus critical to query performance in many important real-world workloads.

Parquet encodes strings (i.e., byte array) in a slightly different format than required for the original Arrow StringArray. The string length is encoded inline with the actual string data (as shown in Figure 4 left). As mentioned previously, StringArray requires the data buffer to be continuous and compact—the strings have to follow one after another. This requirement means that reading Parquet string data into an Arrow StringArray requires copying and consolidating the string bytes to a new buffer and tracking offsets in a separate array. Copying these strings is often wasteful. Typical queries filter out most data immediately after loading, so most of the copied data is quickly discarded.

On the other hand, reading Parquet data as a StringViewArray can re-use the same data buffer as storing the Parquet pages because StringViewArray does not require strings to be contiguous. For example, in Figure 4, the StringViewArray directly references the buffer with the decoded Parquet page. The string “Arrow Rust Impl” is represented by a view with offset 37 and length 15 into that buffer.

Figure 4: StringViewArray avoids copying by reusing decoded Parquet pages.

Mini benchmark

Reusing Parquet buffers is great in theory, but how much does saving a copy actually matter? We can run the following benchmark in arrow-rs to find out:

cargo bench --bench arrow_reader --features="arrow test_common experimental" "arrow_array_reader/Binary.*Array/plain encoded"

Our benchmarking machine shows that loading BinaryViewArray is almost 2x faster than loading BinaryArray (see next section about why this isn’t StringViewArray).

arrow_array_reader/BinaryArray/plain encoded                        time:   [315.86 µs **317.47 µs** 319.00 µs]
arrow_array_reader/BinaryViewArray/plain encoded
time:   [162.08 µs **162.20 µs** 162.32 µs]

You can read more on this arrow-rs issue: https://github.com/apache/arrow-rs/issues/5904

Section 2.1: From binary to strings

You may wonder why we reported performance for BinaryViewArray when this post is about StringViewArray. Surprisingly, initially, our implementation to read StringViewArray from Parquet was much slower than StringArray. Why? TLDR: Although reading StringViewArray copied less data, the initial implementation also spent much more time validating UTF-8 (as shown in Figure 5).

Strings are stored as byte sequences. When reading data from (potentially untrusted) Parquet files, a Parquet decoder must ensure those byte sequences are valid UTF-8 strings, and most programming languages, including Rust, include highly optimized routines for doing so.

Figure 5: Time to load strings from Parquet. The UTF-8 validation advantage initially eliminates the advantage of reduced copying for StringViewArray.

A StringArray can be validated in a single call to the UTF-8 validation function as it has a continuous string buffer. As long as the underlying buffer is UTF-8⁴, all strings in the array must be UTF-8. The Rust parquet reader makes a single function call to validate the entire buffer.

However, validating an arbitrary StringViewArray requires validating each string with a separate call to the validation function, as the underlying buffer may also contain non-string data (for example, the lengths in Parquet pages).

UTF-8 validation in Rust is highly optimized and favors longer strings (as shown in Figure 6), likely because it leverages SIMD instructions to perform parallel validation. The benefit of a single function call to validate UTF-8 over a function call for each string more than eliminates the advantage of avoiding the copy for StringViewArray.

Figure 6: UTF-8 validation throughput vs string length—StringArray’s contiguous buffer can be validated much faster than StringViewArray’s buffer.

Does this mean we should only use StringArray? No! Thankfully, there’s a clever way out. The key observation is that in many real-world datasets, 99% of strings are shorter than 128 bytes, meaning the encoded length values are smaller than 128, in which case the length itself is also valid UTF-8 (in fact, it is ASCII).

This observation means we can optimize validating UTF-8 strings in Parquet pages by treating the length bytes as part of a single large string as long as the length value is less than 128. Put another way, prior to this optimization, the length bytes act as string boundaries, which require a UTF-8 validation on each string. After this optimization, only those strings with lengths larger than 128 bytes (less than 1% of the strings in the ClickBench dataset) are string boundaries, significantly increasing the UTF-8 validation chunk size and thus improving performance.

The actual implementation is only nine lines of Rust (with 30 lines of comments). You can find more details in the related arrow-rs issue: https://github.com/apache/arrow-rs/issues/5995. As expected, with this optimization, loading StringViewArray is almost 2x faster than loading StringArray.

Section 2.2: Be careful about implicit copies

After all the work to avoid copying strings when loading from Parquet, performance was still not as good as expected. We tracked the problem to a few implicit data copies that we weren’t aware of, as described in this issue.

The copies we eventually identified come from the following innocent-looking line of Rust code, where self.buf is a reference counted pointer that should transform without copying into a buffer for use in StringViewArray.

let block_id = output.append_block(self.buf.clone().into());

However, Rust-type coercion rules favored a blanket implementation that did copy data. This implementation is shown in the following code block where the impl<T: AsRef<[u8]>>will accept any type that implements AsRef<[u8]> and copies the data to create a new buffer. To avoid copying, users need to explicitly call from_vec, which consumes the Vec and transforms it into a buffer.

impl<T: AsRef<[u8]>> From<T> for Buffer {
    fn from(p: T) -> Self {
        // copies data here
	 ...
    }
}
impl Buffer { 
  pub fn from_vec<T>(data: Vec<T>) -> Self {
// zero-copy transformation
...
  }
}

Diagnosing this implicit copy was time-consuming as it relied on subtle Rust language semantics. We needed to track every step of the data flow to ensure every copy was necessary. To help other users and prevent future mistakes, we also removed the implicit API from arrow-rs in favor of an explicit API. Using this approach, we found and fixed several other unintentional copies in the code base—hopefully, the change will help other downstream users avoid unnecessary copies.

Section 2.3: Help the compiler by giving it more information

The Rust compiler’s automatic optimizations mostly work very well for a wide variety of use cases, but sometimes, it needs additional hints to generate the most efficient code. When profiling the performance of view construction, we found, counterintuitively, that constructing long strings was 10x faster than constructing short strings, which made short strings slower on StringViewArray than on StringArray!

As described in Section 1, StringViewArray treats long and short strings differently. Short strings (<12 bytes) directly inline to the view struct, while long strings only inline the first 4 bytes. The code to construct a view looks something like this:

if len <= 12 {
   // Construct 16 byte view for short string
   let mut view_buffer = [0; 16];
   view_buffer[0..4].copy_from_slice(&len.to_le_bytes());
   view_buffer[4..4 + data.len()].copy_from_slice(data);
   ...
} else {      
   // Construct 16 byte view for long string
   ByteView {
       length: len,
       prefix: u32::from_le_bytes(data[0..4].try_into().unwrap()),
       buffer_index: block_id,
       offset,
   }
}

It appears that both branches of the code should be fast: they both involve copying at most 16 bytes of data and some memory shift/store operations. How could the branch for short strings be 10x slower?

Looking at the assembly code using godbolt, we (with help from Ao Li) found the compiler used CPU load instructions to copy the fixed-sized 4 bytes to the view for long strings, but it calls a function, ptr::copy_non_overlapping, to copy the inlined bytes to the view for short strings. The difference is that long strings have a prefix size (4 bytes) known at compile time, so the compiler directly uses efficient CPU instructions. But, since the size of the short string is unknown to the compiler, it has to call the general-purpose function ptr::copy_non_coverlapping . Making a function call is significant unnecessary overhead compared to a CPU copy instruction.

However, we know something the compiler doesn’t know: the short string size is not arbitrary—it must be between 0 and 12 bytes, and we can leverage this information to avoid the function call. Our solution generates 13 copies of the function using generics, one for each of the possible prefix lengths. The code looks as follows, and checking the assembly code, we confirmed there are no calls to ptr::copy_non_overlapping , and only native CPU instructions are used. For more details, see the ticket.

fn make_inlined_view<const LEN: usize>(data: &[u8]) -> u128 {
     let mut view_buffer = [0; 16];
     view_buffer[0..4].copy_from_slice(&(LEN as u32).to_le_bytes());
     view_buffer[4..4 + LEN].copy_from_slice(&data[..LEN]);
     u128::from_le_bytes(view_buffer)
}
pub fn make_view(data: &[u8], block_id: u32, offset: u32) -> u128 {
     let len = data.len();
     // generate special code for each of the 13 possible lengths
     match len {
         0 => make\_inlined\_view::<0>(data),
         1 => make\_inlined\_view::<1>(data),
         2 => make\_inlined\_view::<2>(data),
         3 => make\_inlined\_view::<3>(data),
         4 => make\_inlined\_view::<4>(data),
         5 => make\_inlined\_view::<5>(data),
         6 => make\_inlined\_view::<6>(data),
         7 => make\_inlined\_view::<7>(data),
         8 => make\_inlined\_view::<8>(data),
         9 => make\_inlined\_view::<9>(data),
         10 => make\_inlined\_view::<10>(data),
         11 => make\_inlined\_view::<11>(data),
         12 => make\_inlined\_view::<12>(data),
         _ => {
           // handle long string
}}}

Section 2.4: End-to-end query performance

In the previous sections, we went out of our way to make sure loading StringViewArray is faster than StringArray. Before going further, we wanted to verify if obsessing about reducing copies and function calls has actually improved end-to-end performance in real-life queries. To do this, we evaluated a ClickBench query (Q20) in DataFusion that counts how many URLs contain the word "google":

 SELECT COUNT(*) FROM hits WHERE "URL" LIKE '%google%';

This is a relatively simple query; most of the time is spent on loading the “URL” column to find matching rows. The query plan looks like this:

 Projection: COUNT(*) [COUNT(*):Int64;N]
  Aggregate: groupBy=[[]], aggr=[[COUNT(*)]] [COUNT(*):Int64;N]
    Filter: hits.URL LIKE Utf8("%google%")
      TableScan: hits

We ran the benchmark in the DataFusion repo like this:

cargo run --profile release-nonlto --bin dfbench -- clickbench --queries-path benchmarks/queries/clickbench/queries.sql --iterations 3 --query 20 --path benchmarks/data/hits.parquet --string-view

With StringViewArray we saw a 24% end-to-end performance improvement, as shown in Figure 7. With the –string-view argument, the end-to-end query time is 944.3 ms, 869.6 ms, 861.9 ms (three iterations). Without –string-view, the end-to-end query time is 1186.1 ms, 1126.1 ms, 1138.3 ms. Figure 7: StringView reduces end-to-end query time by 24% on ClickBench Q20.

We also double-checked with detailed profiling and verified that the time reduction is indeed due to faster Parquet loading.

Conclusion

In this first blog post, we have described what it took to improve the performance of simply reading strings from Parquet files using StringView. While this resulted in real end-to-end query performance improvements, in our next post, we explore additional optimizations enabled by StringView in DataFusion, along with some of the pitfalls we encountered while implementing them.

Note: Thanks to InfluxData for sponsoring this work as a summer intern project

How Good is Parquet for Wide Tables (Machine Learning Workloads) Really?

Xiangpeng Hao, Andrew Lamb (InfluxData) — Tue, 18 Jun 2024 08:00:00 +0000

In this blog post, we quantify the metadata overhead of Apache Parquet files for storing thousands of columns, as well as space and decode time using parquet-rs, implemented in Rust. We conclude that while technical concerns about Parquet metadata are valid, the actual overhead is smaller than generally recognized. In fact, optimizing writer settings and simple implementation tweaks can reduce overhead by 30-40%. With significant additional implementation optimization, decode speeds could improve by up to 4x.

Figure 1: Metadata decode time for 1000 Parquet Float64 columns using parquet-rs. Configuring the writer to omit statistics improves decode performance by 30% (9.1ms → 6.9 ms). Standard software engineering optimization techniques improve the decode performance by another 40% (6.9ms → 4.1ms / 9.1ms → 6.4ms)

Introduction

Recent assertions have suggested that Parquet is not suitable for wide tables with 1000s of columns, often found in machine learning workloads. Proposals for new file formats, such as BtrBlocks, Lance V2, and Nimble¹, often accompany these assertions.

Usually, the stated rationale is that wide tables have “large” metadata, which takes a “long time” to decode, often longer than reading the data itself. Using Apache Thrift to store the metadata means the entire metadata payload must be decoded for each file, even when only a small subset of columns is required. It also appears to be common (though incorrect) to equate Parquet (the format) with a specific Parquet implementation (e.g., parquet-java) when evaluating performance.

Leaving aside the fact that many query systems cache information from the Parquet metadata in a form suited for faster processing, we wanted quantitative information on how much of the purported metadata overhead is due to limitations in the Parquet format vs. how much is due to less optimized implementations or poorly configured settings of Parquet writers.

Background

Parquet files include the metadata required to interpret the file. This metadata also instructs the reader to load only the portion of the file necessary to answer queries. More information on these techniques can be found in Querying Parquet with Millisecond Latency. Typical Parquet files are GBs in size, but many queries read only a small portion, so the metadata is often critical to quickly finding the required data.

Figure 2: Layout of Parquet files. The metadata is stored in the footer (at the end of the file) and contains the location of pages within the file and optional statistics such as min/max/null counts for each column chunk.

As shown in Figure 2, the structure of Parquet metadata mirrors that of the Parquet file: It contains entries for each row group, and each entry contains information for each column chunk within that row group. This means that the metadata size is O(row_group * column) and grows linearly with both the number of row groups and the number of columns.

In addition to the information required to decode each column’s data, such as starting offset and encoding type, the metadata can optionally store min, max, and null counts for each column chunk. Query engines, such as Apache DataFusion and DuckDB, can use these statistics to skip decoding row groups and data pages entirely.

The metadata is encoded in the Apache Thrift format, which is similar to protobuf. Thrift uses variable-length encoding to achieve high space efficiency. Still, the variable-length encoding requires Parquet readers to fetch and potentially examine the entire metadata footer before reading any content. For example, it is not possible to jump directly to the location in the metadata required to read a single-row group without starting at the beginning.

Reading Parquet metadata in parquet-rs’s ArrowReader requires three steps:

Load the metadata from storage to memory
Decode the thrift-formatted data into in-memory structures: ParquetMetadata.
Build the Arrow Schema from the Parquet SchemaDescriptor.

The time required to load the metadata from storage depends on the storage device and ranges from 100us (local SSD) to 200ms (S3)². As shown in Figures 4 and 6, decoding from Thrift into Rust structures is by far the most time-consuming activity once the data is in memory. This makes sense as the decoding inflates a tiny compact encoding (Thrift) into point-accessible in-memory Rust structures. Transforming the SchemaDescriptor into Arrow Schema also requires a small amount of CPU time.

Testbed

Implementation: We experimented with parquet–rs³, a Rust implementation of Parquet, version 51.0.0. We repeat each experiment five times for each file and report the average time of the last four executions to exclude the impact of caching. You can find the benchmark code here. We ran the benchmark on an AMD 7600X processor clocked at 5.4 GHz with a 32 MB L3 cache.

Workload: We generated several Parquet files with between 10 to 100,000⁴ Float64 columns, mimicking machine learning workloads. As we focus on the metadata, we simply write the same repeated value multiple times for the data. Each Parquet file contains ten-row groups, and because each row group includes all columns, the Parquet metadata encodes 10 * column_count individual ColumnChunk structures. To study the impact of including statistics, we tested three configurations: no statistics, chunk-level statistics, and page-level statistics (the default in parquet-rs).

Results

Figure 3: Metadata decode time and size for parquet-rs vs the number of Float64 columns in the Parquet file. Note both x and y axes are log scale.

Figure 3 plots the relationship between metadata size and decode time as the number of columns increases from 10 to 100,000. As expected, the metadata size and decode time are linearly proportional to the number of columns in the Parquet file.

Figure 4: Metadata decode time and size for parquet-rs for different statistics levels. The metadata decode time chart (left) also illustrates the time breakdown between Thrift decoding and creating the Arrow Schema (see Figure 6 for a more detailed breakdown).

Figure 5: Average per-column decode time and metadata size. The x-axis shows the stats level; the y-axis shows the time and size per column.

In Figures 4 and 5, we examined the impact of statistics on metadata decode speed and size. Specifically, we configured⁵ the parquet-rs writer in one of three modes:

none: No statistics (EnabledStatistics::None)
chunk: The writer stores min value, max value, and null count statistics for each column chunk, for each row group (EnabledStatistics::Chunk)
page: (the default setting of parquet-rs). In addition to the statistics written at the chunk level, the writer also writes structures from the Parquet Page Index, which can speed up query processing (EnabledStatistics::Page)

Figure 5 charts these settings’ average per-column decode time and metadata size impact. Note that we expect the impact of disabling statistics for string columns to be even more significant than our float-based measurements, as string statistics values are typically larger.

Our findings are as follows:

With no statistics, metadata decodes 30% faster and is 30% smaller than the default level.
Page-level statistics only add minor overhead on top of chunk-level stats⁶.
Building the Arrow schema takes negligible time.
Decoding Thrift takes twice as long as transforming Thrift structs to parquet-rs structs.
With minimal metadata (stats level none), each additional column adds 5us to decode time and 700 bytes to storage requirements.
Our measurements are consistent, and the error bars are small.
parquet-rs 51.0.0 can decode Parquet metadata at 100MB/s (10ms to decode each megabyte of metadata).

Our findings suggest that software optimization efforts focused on improving the efficiency of Thrift decoding and Thrift to parquet-rs struct transformation will directly translate to improving overall metadata decode speed.

Figure 6: Detailed analysis of metadata decode time breakdown.

Finally, we analyzed decoding using a profiler and plotted the results in Figure 6.

61% of the time is spent decoding and building FileMetaData, which includes the Parquet schema.
31% of the time is spent building RowGroupMetaData, which transforms decoded Thrift data structures into parquet-rs data structures.
7% of the time is spent building an Arrow schema.

Figure 7: Simple software engineering optimizations (e.g., better allocator, optimized in-memory layout, and SIMD acceleration) improved the decoding throughput by up to 75%.

Finally, we spent a few days prototyping simple engineering optimizations (e.g., better allocator, optimized in-memory layout, and SIMD acceleration) to improve the decoding performance. Figure 7 shows that with even minor code changes (less than 100 loc, no change in API), we could improve decode performance by up to 75%. Other community members have also discussed and prototyped several more involved changes, such as reducing allocations (~2x improvement) and a more optimized thrift decoder (another ~2x improvement)

Conclusion

For workloads where metadata size and decode speed are of utmost concern, configuring the Parquet writer not to write statistics⁷ improves speed and space by 30% with no other software changes.

While the Rust Parquet implementation is already reasonably fast for metadata decoding, the potential for significant speed improvements is within reach. By applying straightforward software engineering techniques, decoding speed can be enhanced by around a factor of 4. This investment in existing decoders is likely to yield a larger payoff than the creation of entirely new formats.

Finally, in a more extensive overall system, where it is common to read data from object storage such as S3, we believe that metadata fetch and parsing is unlikely to be a significant bottleneck. Given that first-byte access latencies of 100ms-200ms are expected in object stores, by appropriately interleaving fetch and decode, metadata parsing is likely to be a small part of the overall execution time.

Future work

There are several areas that we did not explore that deserve additional attention:

A similar performance comparison for other open source Parquet implementations (e.g. parquet-java and parquet-cpp)
A similar study of Parquet metadata size and decode speed for String / Binary columns. We expect the benefits from disabling statistics and optimized decoder implementations to be substantially higher for such columns because the values stored in the statistics are significantly larger.
A similar study with ‌newly proposed formats like Lance V2 Nimble would help us understand how much better they are at handling large numbers of columns and what other tradeoffs may exist. In particular, these new formats incorporate lightweight metadata/statistics (e.g., smaller, decoupled metadata) and/or allow partial decoding, i.e., decode only the projected column rather than the entire metadata, which should permit much faster decode times.

Acknowledgments

We would like to thank Raphael Taylor-Davies, Jörn Horstmann, and Paul Dix for their helpful comments on earlier versions of this post.

For more, see the discussion on the dev@parquet.apache.org mailing list.
See charts from Exploiting Cloud Object Storage for High-Performance Analytics
Caveat: we are biased being contributors and maintainers of parquet-rs
Note that with the default writer settings, our testbed ran out of memory when writing the 100,000 column Parquet file. We found that the issue was resolved by setting the data_page_row_count to 10,000. With the default (unlimited) data page row count, we found the Parquet writer consumed over 80GB of memory. We have started a discussion about changing this default as another common criticism of using Parquet with wide tables is that writers require a large memory buffer, but we think this may be due to the default writer settings.
By setting WriterProperties::statistics_enabled
Note that parquet-rs reader does not create Rust structs from the PageIndex structures by default, so the decode overhead would likely be higher if we were decoding this as well..
Though of course this may impact query performance for workloads that would benefit from statistics (e.g. they have predicates on the affected columns).

Making Most Recent Value Queries Hundreds of Times Faster

Nga Tran, Andrew Lamb (InfluxData) — Mon, 18 Mar 2024 08:00:00 +0000

This post explains how databases optimize queries, which can result in queries running hundreds of times faster. While we focus on one specific query type that is important to InfluxDB 3, the optimization process we describe is the same for any database.

Optimizing a query is like playing with Lego

You can come up with different structures when playing with the same set of Lego pieces, as shown in Figure 1. While you often use the same basic bricks to build whatever structure you want, there are times when you need a different type of shape (e.g., tiny star) for a specific project.

Figure 1: Two different structures built from the same basic Lego squares and rectangles

In a database, running a query means running a query plan, a tree of different operators that process and stream data. Each operator is like a Lego brick: depending on how they are connected, they compute the same result but with different performances. Much of query optimization involves swapping or moving existing operators around to form a better query plan, but on some rare occasions, a new special case operator is needed to do the job better.

Let’s walk through an example of optimizing a query by creating a specialized operator and recombining existing operators to form a query plan with superior performance.

Querying the most recent value(s)

As a time series database, one common use case is managing signal data from many devices. A common question is: “What is the signal last sent by a specified device (e.g., device number 10)?” The answer to this question (or variations of it) is often used to drive a UI or monitoring dashboard. Using SQL, a query that can answer this question is:

SELECT   …
FROM     signal
WHERE    device = 10
       AND time BETWEEN now() - interval 'X days' and now()
ORDER BY time DESC
LIMIT    1;

The filter time BETWEEN now() - interval 'X days' and now() narrows down the question a bit: “What is the signal last sent by device number 10 for the last X days?”

While this query is simple, actual queries can be more complicated. For example, “find the average value over the last five values,” so our solution must be able to handle these more general queries as well.

It is also important that these queries return results in milliseconds, because every device owner requests values frequently. One challenge is the owner does not know when the last signal happened—it could be five minutes ago or several months ago. Thus, the value of time range X can be very large and the query runtime long. Unless users take great care writing their query, it will read and process substantial data, increasing the query return time.

Unlike traditional relational or time series databases, InfluxDB 3.0 stores data in Parquet files rather than custom file formats and specialized indexes. Our mission was to make this class of queries run in milliseconds regardless of how large the time range X is without introducing special indexes.

Runtimes before and after improvements

Before explaining our approach, let us look at the results in Figure 2. Blue represents the normalized runtimes of the queries in different time ranges before the improvements, and green represents the ones after. Queries timeout after running for 30 units, so the actual runtimes of queries that reached 30 units were even higher. As the chart shows, our improvements made large-time-range queries run hundreds of times faster and brought the runtimes of all queries down to the level requested by our customers.

Figure 2: Query runtimes of before and after Improvements

Let’s move on to how we achieved this.

Query plan before improvements

Figure 3 shows a simplified version of the query plan before the improvements using a sort merge algorithm. We read a query plan from the bottom up. The input includes four files that four corresponding scan operators read in parallel. Each scan output goes through a corresponding sort operator that orders the data by descending timestamp. Four sorted output streams are sent to a merge operator that combines them into a single sorted stream and stops after the number of limit rows, which is 1 in this example. There are many more files in the signal table, but InfluxDB first prunes unnecessary files based on the filters of the query.

Figure 3: Query plan using sort merge algorithm

When files overlap, InfluxDB may need to‌ deduplicate data. Figure 4 shows a more accurate plan that sorts data of overlapped files, File 2 and File 3, together and deduplicates them before sending data to the sort and merge operators.

Figure 4: InfluxDB query plan using sort merge algorithm but grouping overlapped files first

The optimization described in the next section only depends on the operators at the top of the plan, and thus, the simplified plan in Figure 5 more clearly illustrates the solution. Note that we omitted many other details of the plan—for example, the Sort operator does not sort the entire file but simply retains the Top “K” rows due to the LIMIT 1 in the query.

Note that the data streams going into the merge operator do not overlap.

Figure 5: The top part of the plan that includes non-overlapped data streams to merge operator

Analyzing the plan and identifying improvements

Normally when merging multiple streams, all inputs must be known before producing any output (as the first row might come from any of the inputs). This implies in the above plan that we must read and sort all the input streams. However, if we know the time ranges of the streams do not overlap, we can simply read and sort the streams one by one, stopping once we find the required number of rows. Not only is this less work than‌ merging the data, but if the number of required rows is small, it is likely only a single stream must be read.

Thankfully, InfluxDB has statistics about the time ranges of the data in each file before reading them and groups overlapped files to produce non-overlapped streams, as shown in Figure 5. So, we can apply this observation to make a faster query plan without additional indexes or statistics. However, the behavior of reading streams one by one, stopping when the limit is hit is no longer a merge. We needed a new operator.

New query plan

With the observations above in mind, Figure 6 illustrates the new query plan:

The non-overlapped data streams are sorted by time, descending.
A new operator, ProgressiveEval, replaces the merge operator.

The new ProgressiveEval operator pulls data from its input streams sequentially and stops when it reaches the requested limit. The big difference between ProgressiveEval and merge operators is that the merge operator can only start merging data after all its input sort operators complete, while ProgressiveEval can start pulling data immediately after thefirst sort operator finishes.

Figure 6: Optimized Query plan reads data progressively, stopping early when the limit is reached

When the query plan in Figure 6 executes, it only runs the operators shown in Figure 7. InfluxDB has a pull-based executor (based on Apache Arrow DataFusion’s Execution), which means that when ProgressiveEval starts, it will ask the first sort, which in turn asks its inputs for data. The sort then performs the sort and sends the sorted results up to ProgressiveEval.

Figure 7: The needed execution operators if the latest signal of the specified device is in the latest-time-range file

Due to the device = 10 parameter, the query filters data while scanning, and we do not know in which file contains the latest signal of device number 10. In addition, because it takes time for each sort operator to complete, when ProgressiveEval pulls data from a stream, it also starts executing the next stream to prefetch data that is necessary if the first stream doesn’t contain the desired rows.

Figure 8 shows that when pulling data from Stream 1 of the first Sort, the second Sort executes simultaneously so that data from Stream 2 is ready if Stream1 does not include the requested data from device number 10. If the data from device number 10 is in Stream 1, ProgressiveEval stops as soon as it hits the limit and cancels Stream 2. If data from ProgressiveEval pulls data from Stream 2, it also begins pre-executing the Sort of Stream 3, and so on.

Figure 8: The actual execution operators if the latest signal of the specified device is in the latest-time-range file

Analyzing the benefits of the improvements

Let’s compare the original plan in Figure 5 and the optimized plan in Figure 8:

Returns Results Faster: The original plan must scan and sort all files that may contain data, regardless of the number of rows needed, before producing results. Thus, the longer the time range, the more files there are to read, and the slower the original plan is. This explains why our results show improvements for longer time ranges.
Fewer Resources and Improved Concurrency: In addition to producing data more quickly, the optimized plan requires far less memory and CPU—it typically will scan and sort only two files (the most recent one and pre-fetching the next most recent one). This means more queries can run concurrently with the same resources.

Type of queries that benefit from this work

At this time (March 2024), this optimization only works on one type of query, “What are the most/least recent values …?” In other words, the SQL of the query must include ORDER BY time DESC/ASC LIMIT n where ‘n’ can be any number and the time can be ordered ascending or descending. All other supported SQL queries will work but may not benefit from this optimization. We continue to work on improving them.

Conclusion

The optimization not only makes the most recent value queries faster but also reduces ‌resource usage and increases the concurrency level of the system. In general, if a query plan includes sort merge on potentially non-overlapped data streams, this optimization is applicable. We have found many query plans in this category and are working on improving them.

We would like to thank Paul Dix for suggesting this design based on the progressive scan behavior of Elastic in the ELK stack.

InfluxData Blog - Andrew Lamb

Optimizing SQL (and DataFrames) in DataFusion: Part 2

Always optimizations

Predicate/Filter Pushdown

Projection pushdown

Limit Pushdown

Expression Simplification / Constant Folding

Rewriting OUTER JOIN → INNER JOIN

Engine specific optimizations

Subquery Rewrites

Optimized expression evaluation

Algorithm Selection

Using Statistics Directly

Access path and join order selection

Overview

Heuristics and Cost-Based Optimization

Join Ordering in DataFusion

To summarize

Optimizing SQL (and DataFrames) in DataFusion: Part 1

Introduction

Query Optimizer background

SQL, DataFrames, LogicalPlan equivalence

Example of Query Optimizer

Query Optimizer implementation

2025: The Year of 1,000 DataFusion-Based Systems

The journey from 0 to 1,000 projects

2020-2023 early adopters, including InfluxDB 3

2023-2025: gaining momentum

Integration into the Open Data Lake

Streamlining Adoption for Downstream Users

Next Level Quality: Bashing Pesky Bugs

Pushing the Limits of Performance

The year ahead

​Apache DataFusion Meetup: Chicago December 2024 Recap

Apache DataFusion is Now the Fastest Single Node Engine for Querying Apache Parquet Files

A strong history of performance improvements

StringView

Parquet

Skipping partial aggregation when it doesn’t help

Optimized multi-column grouping

What’s next 🚀

Additional thanks

Conclusion

Using StringView / German Style Strings to Make Queries Faster: Part 2 - String Operations

Section 3: Faster String Operations

Section 3.1 Faster comparison

Section 3.2: Faster take and filter

Section 3.3: When to GC?

Section 3.4: The art of function inlining: not too much, not too little

Section 3.5: Buffer size tuning

Section 3.6: End-to-end query performance

Section 4: Faster String Aggregation

Section 5: StringView Pitfalls

Section 6: Conclusion and Takeaways

Using StringView / German Style Strings to Make Queries Faster: Part 1 - Reading Parquet

Section 1: What is StringView?

Section 2: Faster Parquet Loading

Section 2.1: From binary to strings

Section 2.2: Be careful about implicit copies

Section 2.3: Help the compiler by giving it more information

Section 2.4: End-to-end query performance

Conclusion

How Good is Parquet for Wide Tables (Machine Learning Workloads) Really?

Introduction

Background

Testbed

Results

Conclusion

Future work

Making Most Recent Value Queries Hundreds of Times Faster

Optimizing a query is like playing with Lego

Querying the most recent value(s)

Runtimes before and after improvements

Query plan before improvements

Analyzing the plan and identifying improvements

New query plan

Analyzing the benefits of the improvements

Type of queries that benefit from this work

Conclusion

Rewriting `OUTER JOIN` → `INNER JOIN`

Apache DataFusion Meetup: Chicago December 2024 Recap

Section 3.2: Faster `take` and `filter`