How InfluxData and Dremio Leverage the Apache Ecosystem
By Anais Dotis-Georgiou / Sep 11, 2023 / InfluxDB
InfluxData and Dremio have always been at the forefront of embracing open source solutions to enhance their product offerings. This post discusses how both companies currently leverage the Apache Ecosystem and describes the downstream impact these powerful technologies have on their offerings.
InfluxData created and maintains InfluxDB, a time series platform. Users leverage InfluxDB 3.0 for a variety of use cases and industries including IoT Monitoring, DevOps Monitoring, FinTech, AggTech, Manufacturing, and more.
Dremio is an open-source, data-as-a-service platform developed to simplify and accelerate data analytics. It enables business users to curate and analyze their data for business intelligence (BI) and other use cases in a self-service manner.
Advantages of the Apache Foundation
Both companies are part of the Apache Software Foundation (ASF), which provides significant benefits to the broader technology and open source communities.
Open source software: ASF is one of the largest hubs for open source software. Open source projects allow developers to modify, distribute, and study the source code, which can lead to a more rapid innovation, more secure and high-quality code, and a greater degree of customization.
Wide range of projects: ASF hosts hundreds of projects that span various domains. This diverse range of projects enables developers and organizations to find tools that fit their specific needs.
Community-driven development: Community volunteers develop Apache projects. This collaborative development model can lead to more comprehensive and innovative software solutions, as developers from around the world can contribute different perspectives and skills.
Stable and reliable: Software under the Apache license tends to be more stable and reliable because of the collaborative nature of its development and thorough review processes.
Commercial-friendly license: The Apache License is considered one of the most commercially friendly licenses available. It allows users to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software under the terms of the license, even for commercial purposes. Furhtermore, the license does not require a derivative work to be distributed under the same license.
Promotes standards: Apache projects often become the de facto standards in their respective domains, promoting interoperability and reducing fragmentation in the software industry.
Mentorship and incubator programs: ASF operates an Incubator project that helps new projects develop into successful, Apache Top-Level Projects. This provides mentorship and resources to developers looking to start a new project, which further aids innovation and growth in the open source world.
The Apache Software Foundation plays a significant role in promoting open source software, fostering innovation, and setting industry standards, all of which benefit the entire software development community, from individual developers to businesses like InfluxData and Dremio.
InfluxDB 3.0 and the Apache ecosystem
With the release of InfluxDB 3.0, InfluxData rebuilt the core of its database using the Apache Arrow ecosystem. This enabled the company to enhance performance and capabilities by integrating several key, powerful technologies into its platform, including:
Apache Arrow is a framework for defining in-memory columnar data. Apache Arrow enables InfluxDB to virtually eliminate data cardinality limits. This development, in turn, enables InfluxDB to write all types of time series data, including metrics, logs, traces, and events. Arrow memory optimizations and efficient data exchange help InfluxDB deliver best-in-class performance on analytics. Highly performant analytics require the efficient in-memory columnar storage that Arrow provides. If you’re interested in seeing how InfluxDB can handle all of your time series data, check out the following post: OpenTelemetry Tutorial: Collect Traces, Logs & Metrics with InfluxDB 3.0, Jaeger & Grafana.
Parquet is a durable, column-oriented file format. Parquet offers interoperability with almost all modern machine learning (ML) and analytics tools that offer best-in-class analytics. DataFusion supports both a SQL and a DataFrame API for logical query plans, query optimization, and execution engine against Parquet files. InfluxDB 3.0 supports SQL queries in this data ecosystem (Arrow, Parquet, and Flight). Parquet enables interoperability because many of the most popular languages, such as C++, Python, and Java, have first-class support in the Arrow project for reading and writing Parquet files. I’m personally extremely excited about the Python support. In the future you’ll be able to query Parquet files directly from InfluxDB and convert them into Pandas DataFrames and vice versa. This means that InfluxDB interoperates with other tools that support Parquet, like Tableau, PowerBI, Athena, Snowflake, DataBricks, Spark, and more. Currently you can use the InfluxDB 3.0 Python Client Library to query InfluxDB 3.0 and return Pandas DataFrames.
Arrow Flight is a “new general-purpose client-server framework to simplify high performance transport of large datasets over network interfaces.” It enables users to query InfluxDB 3.0 with SQL, thereby lowering the barrier to adoption and letting developers focus on gaining insights from their data rather than writing queries. For examples on how to query InfluxDB 3.0 directly with the Arrow Flight SQL Client, see this repo.
DataFusion is an “extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.” DataFusion is used to execute logical query plans, optimize queries, and serve as an execution engine that is capable of parallelization using threads. Many different projects and products use DataFusion, and it has a thriving community of contributors who add broad SQL support and sophisticated query optimizations. DataFusion supports both a Postgres-compatible SQL and DataFrame API. This means the new InfluxDB engine will support a large community of users from broader ecosystems that use SQL (and eventually Pandas DataFrames).
Dremio and the Apache ecosystem
Dremio is an advanced data lakehouse platform known for its scalability and capacity for direct querying across diverse data sources. It combines various open source technologies such as Apache Arrow, Apache Parquet, Apache Calcite, and Apache Iceberg to facilitate efficient and speedy data analysis.
Dremio’s SQL interface
Dremio employs Apache Calcite for its SQL interface to parse and optimize SQL queries. Apache Calcite, a dynamic open-source framework, is tasked with SQL parsing, planning, and query optimization. This integration allows Dremio to offer a highly adaptable and efficient SQL interface to data connected to Dremio.
Dremio’s caching layer
Further, Apache Parquet’s effective storage and quick access patterns bolster Dremio’s caching layer. Parquet’s columnar data storage permits superior compression and efficient I/O, making it the ideal format for Dremio’s caching layer.
Dremio’s data processing
The standard columnar in-memory format of Apache Arrow functions as Dremio’s internal data representation. This mechanism ensures rapid and seamless data transportation between systems without requiring serialization or deserialization. Arrow’s optimized data organization, coupled with its SIMD instructions and memory mapping, empowers Dremio to leverage modern CPU capacities, thereby improving data processing efficiency and performance.
The high-speed client-server framework of Apache Arrow Flight facilitates swift data transport for Dremio. By circumventing data serialization or deserialization, Arrow Flight prevents excess data copy, thereby enhancing Dremio’s processing speed. Arrow Flight’s multi-language support and security features make it a reliable and versatile tool for data exchange across various environments.
Moreover, Gandiva, an execution kernel developed for Arrow by Dremio, enables just-in-time compilation of optimized assembly code for rapid execution of low-level operations, like sorting and filtering. Dremio incorporated this feature into its SQL engine to supercharge performance for analytical tasks.
The synergistic effect of Apache Arrow, Arrow Flight, and Gandiva forms the core of Dremio’s data processing efficiency, reinforcing its reputation as a powerful open lakehouse platform.
Dremio’s data reflections features
Dremio’s Data Reflections leverage Apache Iceberg tables to create optimized data representations. These Data Reflections, similar to a mix of indexes and materialized views, are utilized to accelerate queries. Reflections are appropriately partitioned and sorted to match query access patterns, which allows for efficient data scanning and partition pruning. In essence, the amalgamation of Apache Iceberg and Dremio’s Sonar query engine significantly enhances query performance for large data sets, using principles of partitioning, pruning, and sorting.