DataFusion is an in-memory query planning, optimization, and execution framework. DataFusion was created in 2017 and donated to the Apache Arrow project in 2019. DataFusion is written in Rust and takes advantage of Arrow’s in-memory data model for performance and compatibility with other projects.

The long term goal of DataFusion is to become an embedded query engine that can be used with any analytics application while providing SQL compatibility, Pandas type dataframe API, the ability to create execution plans via API, and provide best in class query performance across all of these different APIs.

Apache DataFusion features

DataFusion provides a number of features out of the box that make it ideal for developers working on analytics based applications. Here are some of the major features currently available through DataFusion: SQL query planner with support for multiple SQL dialects

Pandas style DataFrame API
Native support for Parquet, CSV, JSON, and Avro files. DataFusion can be extended with file formats via API
Support for object storage like S3 and other object storage services with S3 compatible APIs

DataFusion also has a strong roadmap with plans for adding the following features in the future:

Support for nested data structures like fields, lists, and structs
Query optimizations for group by and aggregate functions
Ability to read from remote file systems with local caching being required

Apache DataFusion use cases

DataFusion is modular by design which allows it to be embedded in larger applications and extended to fit the application’s specific needs. Here are some frequent ways that projects utilize DataFusion:

DataFusion is used as a SQL query planner and optimizer that can be mapped to different database query engines like PostgreSQL or MySQL
ETL data processing pipelines
For analytics tools that want to provide their end users with a DataFrame API or SQL interface
Applications that want to take advantage of the Apache Arrow ecoystem

Apache DataFusion and InfluxDB

InfluxDB’s latest storage engine is built on Apache Arrow and uses Apache DataFusion as its foundational query engine. This provides native SQL querying capabilities in InfluxDB. InfluxData engineers actively contribute to the development of Apache DataFusion.

Take charge of your operations and lower storage costs by 90%

Get Started for Free Run a Proof of Concept

No credit card required.

Apache DataFusion

What is Apache DataFusion?

Apache DataFusion features

Apache DataFusion use cases

Apache DataFusion and InfluxDB

Related resources

Free InfluxDB Training

Product & Solutions

Developers

Company

Apache DataFusion

What is Apache DataFusion?

Apache DataFusion features

Apache DataFusion use cases

Apache DataFusion and InfluxDB

Related resources

Column Databases Explained

Apache Arrow Explained

Apache Parquet

Free InfluxDB Training

Product & Solutions

Developers

Company

Sign up for the InfluxData newsletter

Follow Us