Introduction to Apache Arrow
Anais Dotis-Georgiou /
Jan 09, 2023
A look at what Apache Arrow is, how it works, and some of the companies using it as a critical component in their architecture.
Over the past few decades, leveraging big datasets required businesses to perform increasingly complex analysis. Advancements in query performance, analytics, and data storage are largely a result of greater access to memory. Demand, manufacturing process improvements, and technological advances all contributed to cheaper memory. Lower memory costs spurred the creation of technologies that support in-memory query processing for OLAP (online analytical processing) and data warehousing systems. OLAP is any software that performs fast multidimensional analysis on large volumes of data.
One project that is an example of these technologies is Apache Arrow. In this article, you will learn what Arrow is, its advantages, and how some companies and projects use Arrow.
What is Apache Arrow?
Apache Arrow is a framework for defining in-memory columnar data that every processing engine can use. It aims to be the language-agnostic standard for columnar memory representation to facilitate interoperability. Several open source leaders from companies including Impala, Spark, and Calcite developed it. Among the co-creators is Wes McKinney, creator of Pandas. He wanted to make Pandas interoperable with data processing systems, a problem that Arrow solves.
Apache Arrow technical breakdown
Apache Arrow achieved widespread adoption because it provides efficient columnar memory exchange. It provides zero-copy reads (the CPU does not copy data from one memory area to a second one), which reduces memory requirements and CPU cycles.
Because Arrow has a column-based format, processing and manipulating data is also faster (more on this in a later section). Its builders designed Arrow for modern CPUs and GPUs, so that it can process data in parallel and take advantage of things like single instruction/multiple data (SIMD), vectorized processing, and vectorized querying.
Companies and projects using Apache Arrow
Apache Arrow powers a wide variety of projects for data analytics and storage solutions, including:
Apache Spark is a large-scale, parallel processing data engine that uses Arrow to convert Pandas DataFrames to Spark DataFrames. This enables data scientists to port over POC models developed on small datasets to large datasets.
Apache Parquet is a columnar storage format that’s extremely efficient. Parquet uses Arrow for vectorized reads. Vectorized readers make columnar storage even more efficient by batching multiple rows in a columnar format.
InfluxDB is a time series data platform. The new storage engine uses Arrow to support near unlimited cardinality use cases, querying in multiple query languages (including Flux, InfluxQL, SQL, and more to come), and to offer interoperability with BI and data analytics tools.
Pandas is a data analytics toolkit built on top of Python. Pandas uses Arrow to offer read and write support for Parquet.
Apache Arrow and InfluxDB
InfluxData recently announced the arrival of its new storage engine built on the Apache ecosystem. Specifically, developers wrote the new engine in Rust on top of Apache Arrow, Apache DataFusion, and Apache Parquet. Apache Arrow helps InfluxDB achieve near unlimited cardinality or dimensionality use cases by providing efficient columnar data exchange. Imagine that we write the following data to InfluxDB:
However, the engine stores the data in a columnar format like this:
Or, in other words, it stores the data like this:
1i, 2i, 3i, 4i, 1i; null, null, null, true, null; tagvalue1, tagvalue2, null, tagvalue1, null; null, null, null, tagvalue3, tagvalue3, null; null, null, null, tagvalue4, null; timestamp1, timestamp2, timestamp3, timestamp4, timestamp5.
Storing data in a columnar format allows the database to group like-data together for cheap compression. Specifically, “Apache Arrow defines an inter-process communication (IPC) mechanism to transfer a collection of Arrow columnar arrays (called a “record batch”) as described in this FAQ. Additionally, time series data is unique because it usually has two dependent variables. The value of your time series is dependent on time, and values have some correlation with the values that preceded them. This attribute of time series means that InfluxDB can take advantage of the record batch compression to a greater extent through dictionary encoding. Dictionary encoding allows InfluxDB to eliminate storage of duplicate values, which frequently exist in time series data. It also enables vectorized query instruction using SIMD instructions.
Contributions to Apache Arrow and the commitment to open source
In addition to a free tier of InfluxDB Cloud, InfluxData offers OSS versions of InfluxDB under a permissive MIT license. Open source offerings provide the community with the freedom to build their own solutions on top of the code and the ability to evolve the code, which creates opportunities for real impact. When smart people have access to good tools, they create impactful solutions. However, the true power of open source becomes apparent when developers not only provide open source code but also contribute to popular projects. Cross-organizational collaboration generates some of the most popular open source projects like TensorFlow, Kubernetes, Ansible, or Flutter to name a few. InfluxDB’s storage engineers contributed a lot of work to Apache Arrow.
The InfluxDB IOx team manages the weekly release of https://crates.io/crates/arrow and https://crates.io/crates/parquet releases. They also help author and support DataFusion blog posts like: Apache Arrow DataFusion Project Update and June 2022 Rust Apache Arrow and Parquet 16.0.0 Highlights.
Other contributions to Arrow include:
- Performance improvements at all levels.
- Making the Arrow crate safe by default and additional error checking.
To take advantage of all the advancements with the new InfluxDB storage engine, sign up here.
If you would like to contact the InfluxDB engine developers, join the InfluxData Community Slack and look for the #influxdb_iox channel.
Check out the following content on the new storage engine to learn more:
I hope this blog post inspires you to explore InfluxDB Cloud. If you need any help, please reach out using our community site or Slack channel. I’d love to hear about what you’re trying to achieve and what features you’d like the task system in InfluxDB to have.
Finally, if you’re developing a cool IoT application on top of InfluxDB, we’d love to hear about it, so make sure to share it on social using #InfluxDB!
Feel free to reach out to me directly in our community slack channel to share your thoughts, concerns, or questions. I’d love to get your feedback and help you with any problems you run into! Or share your story and get a free InfluxDB hoodie.