How Apache Arrow is Changing the Big Data Ecosystem
Charles Mahler /
Jan 05, 2023
This article was originally published in The New Stack and is reposted here with permission.
Arrow makes analytics workloads more efficient for modern CPU and GPU hardware, which makes working with large data sets easier and less costly.
One of the biggest challenges of working with big data is the performance overhead involved with moving data between different tools and systems as part of your data processing pipeline.
Different programming languages, file formats and network protocols all have different ways of representing the same data in memory. The process of serializing and deserializing the same data into a different representation at potentially each step in a data pipeline makes working with large amounts of data slower and more costly in terms of hardware.
The solution to this problem is to create what could be seen as a lingua franca for data, which tools and programming languages could use as a common standard for transferring and manipulating large amounts of data efficiently. One proposed implementation of this concept that has started to gain major adoption is Apache Arrow.
What is Apache Arrow?
Apache Arrow is an open source project intended to provide a standardized columnar memory format for flat and hierarchical data. Arrow makes analytics workloads more efficient for modern CPU and GPU hardware, which makes working with large data sets easier and less costly.
Apache Arrow went live in 2016 and over time has grown in scope and features, many being formerly independent projects that were integrated into the core Arrow project, like DataFusion and Plasma.
The overall goal for Apache Arrow can be summarized as trying to do for OLAP workloads what ODBC/JDBC did for OLTP workloads, by creating a common interface for different systems working with analytics data.
Benefits of Apache Arrow
The primary benefit of adopting Arrow is performance. With Arrow, you no longer need to serialize and deserialize your data when moving it around between different tools and languages, because they can all use the Arrow format. This is especially useful at scale when you need multiple servers to process data.
Here’s an example of performance gains from Ray, a Python framework for managing distributed computing:
Not only is converting the data to the Arrow format faster than using an alternative for Python like Pickle, but the even bigger performance gains are when it comes to deserialization, which is orders of magnitude faster.
Due to Arrow’s column-based format, processing and manipulating data is also faster because it has been designed for modern CPUs and GPUs, so that data can be processed in parallel and take advantage of things like SIMD (single instruction, multiple data) for vectorized processing.
Arrow also provides for zero-copy reads, so memory requirements are reduced in situations where you want to transform and manipulate the same underlying data in different ways.
Bulk data ingress and egress via Parquet
Arrow integrates well with Apache Parquet, another column-based format for data focused on persistence to disk. Arrow and Parquet combined makes managing the life cycle and movement of data from RAM to disk much easier and more efficient.
Another benefit of Apache Arrow is the ecosystem. More functionality and features are being added over time, and performance is being improved as well. As you will see in upcoming sections, in many cases companies are donating entire projects to Apache Arrow and contributing heavily to the project itself.
Apache Arrow benefits almost all companies because it makes moving data between systems easier. This means that by adding Arrow support to a project, it becomes easier for developers to migrate or adopt that technology as well.
Apache Arrow features
Now, let’s take a look at some of the key features and different components of the Apache Arrow project.
Arrow columnar format
The Arrow columnar format is the core of the project and defines the actual specification for how data should be structured in-memory. From a performance perspective, the key features delivered by this format are:
- Data is able to be read sequentially.
- Constant time random access.
- SIMD and vector processing support.
- Zero copy reads.
There are multiple client libraries for several languages to make it easy to get started with Arrow.
Arrow Flight is an RPC (remote procedure call ) framework added to the project to allow easy transfer of large amounts of data across networks without the overhead of serialization and deserialization. The compression provided by Arrow also means that less bandwidth is consumed compared to less-optimized protocols. Many projects use Arrow Flight to enable distributed computing for analytics and data science workloads.
Arrow Flight SQL
Arrow Flight SQL is an extension of Arrow Flight for interacting directly with SQL databases. While it is still considered experimental, features are being added rapidly. Recently a JDBC driver was added to the project, which means that any database that supports JDBC (Java Database Connectivity) or ODBC (Microsoft Open Database Connectivity) can now communicate with Arrow data through Flight SQL.
DataFusion is a query execution framework donated to Apache Arrow in 2019. DataFusion includes a query optimizer and execution engine with support for SQL and DataFrame APIs. It is commonly used for creating data pipelines, ETL processes and databases.
Projects using Apache Arrow
Many projects are adding integrations with Arrow to make adopting their tool easier or embedding components of Arrow directly into their projects to save themselves from duplicating work.
- InfluxDB IOx — InfluxDB’s new columnar storage engine IOx uses the Arrow format for representing data and moving data to and from Parquet. It also uses DataFusion to add SQL support to InfluxDB.
- Apache Parquet — Parquet is a file format for storing columnar data used by many projects for persistence. Parquet has support for vectorized reads and writes to and from Arrow.
- Dask — Dask is a parallel computing framework that makes it easy to scale Python code horizontally. Dask uses Arrow for accessing Parquet files.
- Ray — Ray is a framework that allows data scientists to process data, train machine learning models, then serve those models in production using a unified tool. Ray relies on Apache Arrow for moving data between components with minimal overhead.
- Pandas — Pandas is one of the most popular data analysis tools in the Python ecosystem. Pandas is able to read data stored in Parquet files by using Apache Arrow behind the scenes.
- TurboDC — TurboDC is a tool based on the ODBC interface that allows data scientists to efficiently access data stored in relational databases via Python. Arrow makes this more efficient by allowing the data to be transferred in batches rather than as single records.
A big trend in many different areas of software development is eliminating lock-in effects by improving interoperability. In the observability and monitoring space we can see this with projects like OpenTelemetry and in the big data ecosystem, we can see a similar effort with projects like Apache Arrow.
Developers who take advantage of Apache Arrow will not only save time by not reinventing the wheel, but will also gain access to the entire ecosystem of tools also using Arrow, which can make adoption by new users easier.