Apache Arrow, a specification for an in-memory columnar data format, and associated projects: Parquet for compressed on-disk data, Flight for highly efficient RPC, and other projects for in-memory query processing will likely shape the future of OLAP and data warehousing systems. This will mostly be driven by the promise of interoperability between projects, paired with massive performance gains for pushing and pulling data in and out of big data systems. With object storage like S3 as the common data lake, OLAP projects need a common data API, which Parquet represents. For data science and query workloads, they need a common RPC that is optimized for pulling many millions of records to do more complex analytical and machine learning tasks.
In this post, I’ll cover each of these areas and why I think the Apache Arrow umbrella of projects represents the common API around which current and future big data, OLAP, and data warehousing projects will collaborate and innovate. I’ll conclude with some thoughts on where these projects are and where things might be going.
Apache Arrow is an in-memory columnar data format. It is designed to take advantage of modern CPU architectures (like SIMD) to achieve fast performance on columnar data. It is ideal for vectorized analytical queries. The Arrow specification gives a standard memory layout for columnar and nested data that can be shared between processes and query processing libraries. It’s the base level building block for working with in-memory (or MMAP’d) data that ties everything together. It’s designed for zero-copy semantics to make moving data around as fast and efficient as possible.
Persistence, bulk data and Parquet
Data is the API. I’m sure others said it before me, but I said it myself back in 2010 when talking about service-oriented design and building loosely coupled systems that interacted through message buses like RabbitMQ and later Kafka. What I meant then was that the data you passed through message queues served as the API for services that integrated with each other through those queues. And like APIs, that data needs to have a common serialization format and its schema should be versioned. More broadly, I was trying to highlight the importance of data interchange, which is as important for data processing and analytic systems as it is for services.
Historically, the primary interface points that OLAP and data warehousing systems shared was some form of SQL (often limited) as a query interface and ODBC as the RPC API. SQL worked well enough on the query side, but ODBC isn’t very efficient for bulk data movement either in or out of these systems (UPDATE: it looks like this is improved by the work with turbodbc, but still not as good as native Arrow can deliver). Bulk data movement quickly becomes a must-have to do machine learning, data science, or to do more complex things outside of SQL’s capabilities. Early on, bulk data movement wasn’t a big concern for interoperability because every OLAP system had to care about its own storage and generally operated as an island. However, as data lakes like Hadoop became common, OLAP systems needed a common format to use other backend storage systems.
The need for a common file persistence format was accelerated by the rise of object storage and S3 like APIs as the shared data lake. Systems needed a way to push and pull data from object storage in bulk. While CSV and JSON files are user-friendly, they’re terribly inefficient in both size and throughput. Apache Parquet emerged out of Cloudera as a common file format for compressed columnar and hierarchical data. Parquet is now ubiquitous as an accepted format for input (and often for output) for data within big data systems like Hive, Drill, Impala, Presto, Spark, Kudu, Redshift, BigQuery, Snowflake, Clickhouse and others.
Development over the last few years have brought more implementations and usability for Parquet with increasing support for reading and writing files in other languages. Currently, it looks like C++, Python (with bindings to the C++ implementation), and Java have first class support in the Arrow project for reading and writing Parquet files. R looks like it has great support for reading,
but I’m not sure on the write side of things (UPDATE: R’s write support is great too as it uses the same C++ library). Ruby looks like it has bindings as well. There’s a partial Rust implementation for reading Parquet files. Adding first class support in the top languages will be a necessary step to cement Parquet’s ubiquity as a persistence format for compressed columnar data. However, the performance gains of Parquet along with the promise of interoperability with other systems will incentivize developers to get this done. Wes McKinney’s post from Oct 2019 comparing Parquet performance is a good review of what the gains look like.
RPC and moving data to where you need it
Now that we have a common format to move compressed data around that is persisted, we need a more efficient format for shipping this data around without the overhead of serialization and de-serialization. This is the goal of Apache Arrow Flight, a framework for fast data transport. It defines a gRPC API that ships Arrow Array data wrapped in FlatBuffers to describe metadata like the schema, dictionaries and the breaks between record batches (collections of Arrow Array data that match the same schema and row count). I can’t describe it better than Wes McKinney’s post I linked above so go read that if you need to know more.
There are a few exciting things about Flight. First, it’s fast. To try things out I wrote a simple test program in Rust to send out flight data to a Python consumer. The schema for the test data I generated looks like this:
let fields: Vec<Field> = vec![ Field::new("host", DataType::Utf8, false), Field::new("region", DataType::Utf8, false), Field::new("usage_user", DataType::Float64, true), Field::new("usage_system", DataType::Float64, true), Field::new("timestamp", DataType::Int64, false), ];
We have a table of data with five fields: two string, two float64, and one int64. I had the program generate 100 record batches each containing 100,000 rows for a total of 10M records. Each record batch represented test data from a single host, coming from one of two regions. I hooked this up to a simple Flight API that would get this data. Here’s an example skeleton of a Flight server written in Rust. I then had a Python script inside a Jupyter Notebook connect to this server on localhost and call the API. The code is incredibly simple:
cn = flight.connect(("localhost", 50051)) data = cn.do_get(flight.Ticket("")) df = data.read_pandas()
In three lines of Python (not including imports) we’ve connected to the Flight server, performed a get, and read that result into a Pandas DataFrame. The empty string I’m passing to flight.Ticket is where you can put a payload (like a JSON blob) for your API server to interpret the request and do a query or something like that. For the purposes of this test I just want to return the data I have in memory.
The get request on my localhost took about 311ms +/- 50ms on a few different runs. The data over the network comes in at around 452MB. The reading of that data into a Pandas Dataframe took about 1s +/- 100ms. So now we have 10 million records ready to work with in our favorite data science toolkit in less than 1.5 seconds. Given that my example data has many repeated strings, we could probably do better by using the Dictionary type in Arrow and DictionaryBatch in Flight. Rust support for that is currently lacking, but I’m working on submitting that myself over the coming week or so.
The second great thing about Flight is the potential integrations with other systems. It can act as a standard RPC across many languages helping in tasks like data science and distributed query execution. While Arrow is an Apache project with open governance, my outsider’s view is that it is driven primarily by the efforts of Wes McKinney and his team at Ursa Labs, a not-for-profit he established to push Arrow forward. Yes, there are others participating and contributing like folks from Dremio and Cloudera, among others, but Ursa’s efforts look like the driving force. Also of note, Hadley Wickham of R data science fame, is an advisor.
I bring Wes’ and Hadley’s involvement up because that means you have two of the biggest contributors in open source data science contributing to Arrow and its projects. This means that Python (Wes is the creator of Pandas) and R (Hadley is the creator of Tidyverse) have first class support for both Arrow and Flight and will evolve with both as they continue to get developed. This makes services that support Arrow and Flight an obvious choice for data scientists everywhere.
Querying fast, in memory
Much more nascent and yet to be fully developed are the efforts to add query functionality in a library that uses Arrow as its format. In March of 2019, Wes and other Arrow contributors wrote up some design requirements for a columnar query execution engine that uses Arrow. This was after Dremio contributed a project called Gandiva to live under the Arrow umbrella. Gandiva is an LLVM-based analytical expression compiler for Arrow.
It’s worth noting that both of these projects are written in C++. This means that like Arrow, they can be made to be embedded in any higher level language of your choice. There’s also a Rust native query engine for Arrow named DataFusion that was contributed last year by Andy Grove. Because it’s written in Rust it can also be made to embed in other higher level languages.
The efforts to build an Arrow native query engine are very early, but I think they close the loop on the set of functionality that the Arrow set of projects provide. Not only can you send large data sets back and forth across services and persist them, you’ll also be able to work with and query them with great performance, in-process in your language of choice.
These query engines also represent an area where more OLAP and Data Warehouse vendors may end up contributing and collaborating to share effort. Running a data processing system at scale is incredibly hard and requires a significant amount of code outside the query engine. This means that vendors will be able to contribute to a liberally licensed query engine without giving away their core intellectual property.
Where things are now and what it means
The Parquet format seems to be the most mature thing from this suite of tools. However, support for reading and writing Parquet files is still in early stages for many languages.
The Arrow in-memory columnar format, while also being around for a little while, is still maturing quickly. Their recent introduction of the Arrow C Data Interface should make it easier for library authors in other languages to interface with Arrow in-process libraries and projects.
The Flight protocol is still evolving and support is very early in various languages (Flight was introduced in October of last year). The query engines are even earlier than that.
Despite all of this, I’m very excited about where things could go with these projects. Together, they represent a set of tools that can be used to build interoperable components for data science, analytics, and working with data at scale. They’re also primitives that could be used to build a new kind of data warehouse and distributed OLAP engine.
The open source big data ecosystem has been dominated over the last 15 years by tools written in Java. Having a base layer of libraries and primitives written in C/C++ and Rust opens things up for a completely new ecosystem of tools. I’m betting that Rust will become a language of choice for building services that make Arrow and its parts accessible over the network and in distributed services. OLAP won’t just be about large scale analytical queries, but also about near real-time queries answered in milliseconds.
Developer tools and technology is a game of constant disruption. Using object storage like S3 as the data lake completely disrupted the Hadoop ecosystem in a short amount of time. Containerization and Kubernetes completely disrupted the orchestration and deployment ecosystem over night. It’s possible that another disruption in the data warehouse/big data ecosystem is coming with the maturity of Arrow and its related projects. It’ll be driven by a desire for increased interoperability and a new set of tools that can be embedded anywhere. Pair that with new design requirements for these systems to run in environments like Kubernetes, and we may be looking at a very different set of leaders in the space in the next five years.