Apache Arrow

Apache Arrow was founded in 2016 by developers of numerous open source data projects to bring together the database and data science communities to collaborate on a shared computational technology. It includes a language-agnostic software framework for developing data analytics applications that process columnar data. Its standardized column-oriented memory format is able to represent flat and hierarchical data for efficient analytic operations, reducing costs and is a more efficient approach when working with large sets of data.

The Arrow project is split into 2 parts:

  1. A set of specifications for memory format
  2. Standard libraries for key programming languages

Apache Arrow works with Apache Parquet, Apache Spark, NumPy, PySpark, pandas, and other data processing libraries and includes native libraries in C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust.


How Apache Arrow deframents Data Access

How Apache Arrow defragments Data Access


Advantages with Arrow

  • All systems utilize the same memory format
  • No overhead for cross-system communication
  • Interoperable (data exchange)
  • Embeddable (in execution engines, storage layers, etc)