What is Apache DataFusion?
DataFusion is an in-memory query planning, optimization, and execution framework. DataFusion was created in 2017 and donated to the Apache Arrow project in 2019. DataFusion is written in Rust and takes advantage of Arrow’s in-memory data model for performance and compatibility with other projects.
The long term goal of DataFusion is to become an embedded query engine that can be used with any analytics application while providing SQL compatibility, Pandas type dataframe API, the ability to create execution plans via API, and provide best in class query performance across all of these different APIs.
Apache DataFusion features
DataFusion provides a number of features out of the box that make it ideal for developers working on analytics based applications. Here are some of the major features currently available through DataFusion: SQL query planner with support for multiple SQL dialects
- Pandas style DataFrame API
- Native support for Parquet, CSV, JSON, and Avro files. DataFusion can be extended with file formats via API
- Support for object storage like S3 and other object storage services with S3 compatible APIs
DataFusion also has a strong roadmap with plans for adding the following features in the future:
- Support for nested data structures like fields, lists, and structs
- Query optimizations for group by and aggregate functions
- Ability to read from remote file systems with local caching being required
Apache DataFusion use cases
DataFusion is modular by design which allows it to be embedded in larger applications and extended to fit the application’s specific needs. Here are some frequent ways that projects utilize DataFusion:
- DataFusion is used as a SQL query planner and optimizer that can be mapped to different database query engines like PostgreSQL or MySQL
- ETL data processing pipelines
- For analytics tools that want to provide their end users with a DataFrame API or SQL interface
- Applications that want to take advantage of the Apache Arrow ecoystem
Apache DataFusion and InfluxDB
InfluxDB’s latest storage engine is built on Apache Arrow and uses Apache DataFusion as its foundational query engine. This provides native SQL querying capabilities in InfluxDB. InfluxData engineers actively contribute to the development of Apache DataFusion.