Aggregating Millions of Groups Fast in Apache Arrow DataFusion
TLDR Grouped aggregations are a core part of any analytic tool, creating understandable summaries of huge data volumes. Apache Arrow DataFusion’s parallel aggregation capability is 2-3x faster in version 28.0.0 for queries with a large number (10,000 or more) of groups....
Querying Parquet with Millisecond Latency
We believe that querying data in Apache Parquet files directly can achieve similar or better storage efficiency and query performance than most specialized file formats. While it requires significant engineering effort, the benefits of Parquet’s open format and broad ecosystem support...
Rust Object Store Donation
Today we are happy to officially announce that InfluxData has donated a generic object store implementation to the Apache Arrow project. Using this crate, the same code can easily interact with AWS S3, Azure Blob Storage, Google Cloud Storage, local files,...