Resources for Tasks in InfluxDB 3.0
Anais Dotis-Georgiou /
Feb 05, 2024
If you’re an InfluxDB v2 user, you might be wondering what happened to the task engine in InfluxDB 3.0. The answer is that we removed it in order to support broader interoperability with other task tools. V3 enables users to leverage any existing ETL tool rather than being locked into the limited capabilities of the Flux task engine.
Additionally, InfluxDB 3.0 prioritizes query and write performance to enable you to query, transform, and write large datasets with confidence and ease. However, having more choices requires more initial decision-making. In this post, we’ll highlight some third-party ETL tools and describe the advantages of each. This isn’t designed to be an exhaustive comparison of every existing ETL tool. Rather, I’ll focus on tools that we have existing examples for.
Note: All of these approaches and tools use the InfluxDB v3 Python client library. This client library contains methods for querying and writing Pandas and Polars to simplify ETL processes and gives users access to the many Python libraries available for that workload.
Quix is a complete solution for building, deploying, and monitoring event-streaming applications using Kafka and Python. Quix is designed specifically for processing time series data and comes in both cloud and on-prem offerings. Its UI simplifies the processing, building, and maintenance of event streaming and ETL processes.
Some advantages of Quix include:
- Plugins for querying and writing data from/to your InfluxDB v3 instance and integrating InfluxDB v3 into your Quix pipeline.
- Quix can orchestrate any container.
Some resources for getting started with Quix and InfluxDB 3.0 include:
- Quix Community Plugins for InfluxDB: Build Your Own Streaming Task Engine: A blog post breaking down what each InfluxDB v3 plugin for Quix does, providing a use case you can try out yourself.
- Simplify Stream Processing with Python, Quix, and InfluxDB: An on-demand webinar that explains how to quickly deploy production-ready applications in Quix with InfluxDB for real-time analytics.
- Saving the Holidays with Quix and InfluxDB: The OpenTelemetry Anomaly Detection Story: An on-demand webinar about crafting a scalable time series data pipeline with OpenTelemetry, Quix, and InfluxDB. This demo repo also covers performing anomaly detection on OTEL data.
- quix-anomaly-detection-example: A repo that contains an example of how to use Quix with InfluxDB to perform anomaly detection.
Mage is an open source data pipeline tool for transforming and integrating data. In essence, it’s an open source alternative to Apache Airflow. It also contains a UI that simplifies the ETL creation process. Mage clearly documents how to deploy on AWS, Azure, DigitalOcean, and GCP with Terraform and Helm Charts.
To summarize, some of the advantages of using Mage include:
- Mage is open source.
- Mage has the following features:
- Orchestration: schedule and manage data pipelines for observability
- Notebook editor: interactive Python, SQL, and R editor for coding data pipelines
- Data integration: synchronize data from 3rd-party sources with your internal destinations
- Streaming: ingest and transform real-time data
- dbt: build, run, and manage your dbt models with Mage
Some resources for getting started with Mage and InfluxDB 3.0 include:
- Mage.ai for Tasks with InfluxDB: A blog post highlighting how to set up a simple downsampling task with Mage and InfluxDB v3.
- Mage for Anomaly detection with InfluxDB and Half-space Trees: A blog post on performing anomaly detection with Mage and InfluxDB v3.
- ETL Made Easy: Best Practices for Using InfluxDB and Mage.ai: An on-demand webinar on best practices for using Mage as an ETL tool with InfluxDB—includes a demo on anomaly detection with Mage and InfluxDB v3.
- Mage Documentation
- Mage_Demo: A containerized repo highlighting the anomaly detection use case.
AWS Fargate is a serverless compute engine for containers that works with both Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes (EKS). With Fargate, you can run containers without the need to provision, configure, or scale virtual machine clusters. It also enables flexible resource management and configuration. This allows you to fine-tune container performance, making Fargate ideal for complex data processing.
To summarize, some of the advantages of using AWS Fargate include:
- Serverless Simplicity: Fargate abstracts the underlying infrastructure, allowing developers to deploy containers without worrying about provisioning, scaling, or managing EC2 instances.
- Cost Efficiency: Fargate charges users based on the resources consumed by the containers, providing cost savings by eliminating the need to maintain idle EC2 instances.
Some resources for getting started with AWS Fargate and InfluxDB 3.0 include:
- ricks-downsampler: A repo that contains a containerized downsampler complete with scheduling options and some monitoring.
- Saving AWS Costs by using Fargate Scheduling: A blog post that compares the costs associated with workloads of different sizes and frequency.
Function as a Service (FaaS) tools are event-driven, serverless computing platforms. Examples include AWS Lambda, Google Cloud Functions, and Azure Functions. Some advantages and considerations when using FaaS tooling are:
- They let developers run code without provisioning or managing servers.
- They include automatic scale-up.
- They allow users to focus on developing complex analytics and data science logic. However, there is less granular control over the computing environment.
- If a task is intermittent or has a variable load, FaaS won’t charge for idle compute resources. If the workload is consistent, Fargate (see below) might be the more cost-effective option. Similarly, Fargate might be optimal if your task has a long execution time (e.g., greater than 10 minutes).
While InfluxData has yet to create a list of PoCs with FaaS tooling, you’ll want to leverage the InfluxDB v3 Python client library to query, transform, and write your data. Here are some resources for getting started with the Python client library:
- InfluxDB 3.0 Python Client
- Client Library Deep Dive: Python (Part 1): Part one of a two-part blog on the features of the InfluxDB 3.0 Python client library.
- Client Library Deep Dive: Python (Part 2): Part two of a two-part blog on the features of the InfluxDB 3.0 Python client library.
- Writing Polars: ReadME.md on how to write Polars. To query and return a Polar DataFrame use the following code:
import polars as pl
from influxdb_client_3 import InfluxDBClient3
client = InfluxDBClient3(
sql = 'SELECT * FROM caught LIMIT 10'
table = client.query(database="pokemon-codex", query=sql, language='sql', mode='all')
df = pl.from_arrow(table)
I hope this post helps you jumpstart migrating your tasks to InfluxDB 3.0 and taking advantage of its increased interoperability, ETL, and data-pipelining-specific tools. Get started with InfluxDB Cloud 3.0 here. If you need help, please reach out via our community site or Slack channel.