Aggregating Millions of Groups Fast in Apache Arrow DataFusion
TLDR Grouped aggregations are a core part of any analytic tool, creating understandable summaries of huge data volumes. Apache Arrow DataFusion’s parallel aggregation capability is 2-3x faster in version 28.0.0 for queries with a large number (10,000 or more) of groups....
InfluxDB 3.0: System Architecture
InfluxDB 3.0 (previously known as InfluxDB IOx) is a (cloud) scalable database that offers high performance for both data loading and querying, and focuses on time series use cases. This article describes the system architecture of the database. Figure 1 shows...
Metrics, Logs and Traces: More Similar Than They Appear?
This article was originally published in The New Stack and is reposted here with permission. They require different approaches for storage and querying, making it a challenge to use a single solution. But InfluxDB is working to consolidate them into one....
Querying Parquet with Millisecond Latency
We believe that querying data in Apache Parquet files directly can achieve similar or better storage efficiency and query performance than most specialized file formats. While it requires significant engineering effort, the benefits of Parquet’s open format and broad ecosystem support...
Rust Object Store Donation
Today we are happy to officially announce that InfluxData has donated a generic object store implementation to the Apache Arrow project. Using this crate, the same code can easily interact with AWS S3, Azure Blob Storage, Google Cloud Storage, local files,...
Using Rustlang's Async Tokio Runtime for CPU-Bound Tasks
This article was originally published in The New Stack on January 14, 2022 and is being republished here with permission. Despite the term async and its association with asynchronous network I/O, this blog post argues that the Tokio runtime at the heart of...