Introduction to Apache Arrow

By Anais Dotis-Georgiou / Product
Jan 09, 2023

Navigate to:

A look at what Apache Arrow is, how it works, and some of the companies using it as a critical component in their architecture.

Over the past few decades, leveraging big datasets required businesses to perform increasingly complex analysis. Advancements in query performance, analytics, and data storage are largely a result of greater access to memory. Demand, manufacturing process improvements, and technological advances all contributed to cheaper memory. Lower memory costs spurred the creation of technologies that support in-memory query processing for OLAP (online analytical processing) and data warehousing systems. OLAP is any software that performs fast multidimensional analysis on large volumes of data.

One project that is an example of these technologies is Apache Arrow. In this article, you will learn what Arrow is, its advantages, and how some companies and projects use Arrow.

What is Apache Arrow?

Apache Arrow is a framework for defining in-memory columnar data that every processing engine can use. It aims to be the language-agnostic standard for columnar memory representation to facilitate interoperability. Several open source leaders from companies including Impala, Spark, and Calcite developed it. Among the co-creators is Wes McKinney, creator of Pandas. He wanted to make Pandas interoperable with data processing systems, a problem that Arrow solves.

Apache Arrow technical breakdown

Apache Arrow achieved widespread adoption because it provides efficient columnar memory exchange. It provides zero-copy reads (the CPU does not copy data from one memory area to a second one), which reduces memory requirements and CPU cycles.

Because Arrow has a column-based format, processing and manipulating data is also faster (more on this in a later section). Its builders designed Arrow for modern CPUs and GPUs, so that it can process data in parallel and take advantage of things like single instruction/multiple data (SIMD), vectorized processing, and vectorized querying.

Companies and projects using Apache Arrow

Apache Arrow powers a wide variety of projects for data analytics and storage solutions, including:

Apache Spark is a large-scale, parallel processing data engine that uses Arrow to convert Pandas DataFrames to Spark DataFrames. This enables data scientists to port over POC models developed on small datasets to large datasets.
Apache Parquet is a columnar storage format that’s extremely efficient. Parquet uses Arrow for vectorized reads. Vectorized readers make columnar storage even more efficient by batching multiple rows in a columnar format.
InfluxDB is a time series data platform. The new storage engine uses Arrow to support near unlimited cardinality use cases, querying in multiple query languages (including Flux, InfluxQL, SQL, and more to come), and to offer interoperability with BI and data analytics tools.
Pandas is a data analytics toolkit built on top of Python. Pandas uses Arrow to offer read and write support for Parquet.

Apache Arrow and InfluxDB

InfluxData recently announced the arrival of its new storage engine built on the Apache ecosystem. Specifically, developers wrote the new engine in Rust on top of Apache Arrow, Apache DataFusion, and Apache Parquet. Apache Arrow helps InfluxDB achieve near unlimited cardinality or dimensionality use cases by providing efficient columnar data exchange. Imagine that we write the following data to InfluxDB:

Name: measurement1
field1	field2	tag1	tag2	tag3	time
1i	null	tagvalue1	null	null	timestamp1
2i	null	tagvalue2	null	null	timestamp2
3i	null	null	tagvalue3	null	timestamp3
4i	true	tagvalue1	tagvalue3	tagvalue4	timestamp4
1i	null	null	null	null	timestamp5

However, the engine stores the data in a columnar format like this:

1i	2i	3i	4i	1i
null	null	null	true	null
tagvalue1	tagvalue2	null	tagvalue1	null
null	null	tagvalue3	tagvalue3	null
null	null	null	tagvalue4	null
timestamp1	timestamp2	timestamp3	timestamp4	timestamp5

Or, in other words, it stores the data like this:

1i, 2i, 3i, 4i, 1i; 
null, null, null, true, null; 
tagvalue1, tagvalue2, null, tagvalue1, null; 
null, null, null, tagvalue3, tagvalue3, null; 
null, null, null, tagvalue4, null; 
timestamp1, timestamp2, timestamp3, timestamp4, timestamp5.

Storing data in a columnar format allows the database to group like-data together for cheap compression. Specifically, “Apache Arrow defines an inter-process communication (IPC) mechanism to transfer a collection of Arrow columnar arrays (called a “record batch”) as described in this FAQ. Additionally, time series data is unique because it usually has two dependent variables. The value of your time series is dependent on time, and values have some correlation with the values that preceded them. This attribute of time series means that InfluxDB can take advantage of the record batch compression to a greater extent through dictionary encoding. Dictionary encoding allows InfluxDB to eliminate storage of duplicate values, which frequently exist in time series data. It also enables vectorized query instruction using SIMD instructions.

Contributions to Apache Arrow and the commitment to open source

In addition to a free tier of InfluxDB Cloud, InfluxData offers OSS versions of InfluxDB under a permissive MIT license. Open source offerings provide the community with the freedom to build their own solutions on top of the code and the ability to evolve the code, which creates opportunities for real impact. When smart people have access to good tools, they create impactful solutions. However, the true power of open source becomes apparent when developers not only provide open source code but also contribute to popular projects. Cross-organizational collaboration generates some of the most popular open source projects like TensorFlow, Kubernetes, Ansible, or Flutter to name a few. InfluxDB’s storage engineers contributed a lot of work to Apache Arrow.

The InfluxDB IOx team manages the weekly release of https://crates.io/crates/arrow and https://crates.io/crates/parquet releases. They also help author and support DataFusion blog posts like: Apache Arrow DataFusion Project Update and June 2022 Rust Apache Arrow and Parquet 16.0.0 Highlights.

Other contributions to Arrow include:

Fast, memory-efficient sorts (and a blog about it), which other implementations already picked up.
Performance improvements at all levels.
- apache/arrow-rs#1248
- apache/arrow-rs#2646
Making the Arrow crate safe by default and additional error checking.
- apache/arrow-rs#817

Conclusion

To take advantage of all the advancements with the new InfluxDB storage engine, sign up here.

If you would like to contact the InfluxDB engine developers, join the InfluxData Community Slack and look for the #influxdb_iox channel.

Check out the following content on the new storage engine to learn more:

I hope this blog post inspires you to explore InfluxDB Cloud. If you need any help, please reach out using our community site or Slack channel. I’d love to hear about what you’re trying to achieve and what features you’d like the task system in InfluxDB to have.

Finally, if you’re developing a cool IoT application on top of InfluxDB, we’d love to hear about it, so make sure to share it on social using #InfluxDB!

Feel free to reach out to me directly in our community slack channel to share your thoughts, concerns, or questions. I’d love to get your feedback and help you with any problems you run into! Or share your story and get a free InfluxDB hoodie.

Navigate to:

Try InfluxDB Cloud

Stop flying blind

Introduction to Apache Arrow

By Anais Dotis-Georgiou / Product
Jan 09, 2023

Navigate to:

What is Apache Arrow?

Apache Arrow technical breakdown

Companies and projects using Apache Arrow

Apache Arrow and InfluxDB

Contributions to Apache Arrow and the commitment to open source

Conclusion

Ready to get started?

InfluxDB 3 Core & Enterprise GA: The Next Generation Time Series Platform for Developers is Here

Data Lakes and Warehouses

InfluxDB for Industrial IoT:
A Live Demonstration

Time Series Databases Explained

Network Monitoring

Time Series Data Analysis: Definitions and Best Techniques in 2025

Product & Solutions

Developers

Company

Navigate to:

Try InfluxDB Cloud

Stop flying blind

Get Updates

Introduction to Apache Arrow

By Anais Dotis-Georgiou / Product Jan 09, 2023

Navigate to:

What is Apache Arrow?

Apache Arrow technical breakdown

Companies and projects using Apache Arrow

Apache Arrow and InfluxDB

Contributions to Apache Arrow and the commitment to open source

Conclusion

Ready to get started?

InfluxDB 3 Core & Enterprise GA: The Next Generation Time Series Platform for Developers is Here

Data Lakes and Warehouses

InfluxDB for Industrial IoT: A Live Demonstration

Time Series Databases Explained

Network Monitoring

Time Series Data Analysis: Definitions and Best Techniques in 2025

Product & Solutions

Developers

Company

Sign up for the InfluxData newsletter

Follow Us

By Anais Dotis-Georgiou / Product
Jan 09, 2023

InfluxDB for Industrial IoT:
A Live Demonstration