Apache Arrow

Apache Arrow was founded in 2016 by developers of numerous open source data projects to bring together the database and data science communities to collaborate on a shared computational technology. It includes a language-agnostic software framework for developing data analytics applications that process columnar data. Its standardized column-oriented memory format is able to represent flat and hierarchical data for efficient analytic operations and reduced costs and is a more efficient approach when working with large sets of data. Columnar data representation can yield better compression and can also speed up certain queries because the compiler and CPU can do more parallel computing. It’s common for analytics systems to use Apache Arrow to process data stored in Apache Parquet files.

The Arrow project is split into 2 parts:

  1. A set of specifications for memory format
  2. Standard libraries for key programming languages

Apache Arrow works with Apache Parquet, Apache Flight SQL, Apache Spark, NumPy, PySpark, pandas, and other data processing libraries and includes native libraries in C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust.

How Apache Arrow deframents Data Access

How Apache Arrow defragments Data Access

Advantages with Arrow

  • All systems utilize the same memory format
  • No overhead for cross-system communication
  • Interoperable (data exchange)
  • Embeddable (in execution engines, storage layers, etc.)