Multi-Data Source Flux Introduced in InfluxDB 2 Alpha 14

Navigate to:

As our co-founder Paul wrote extensively in InfluxDB 2.0’s Alpha Release announcement, we believe Flux will play an important role in combining time series data with many different data sources. Flux’s first official multi-data store step is InfluxDB 2.0 Alpha 14’s from SQL release. This Flux addition is significant given the increasing number of data stores and the need to cater to them.

It's a multi-data store world

Years ago, as a consultant for Wily Technology during the nascent APM industry explosion, I dealt with hundreds of applications in varying states of distress. Time and again, I would see applications ignoring the nature of their particular data, which invariably led to slowdowns and errors from data stores that were abused in ways they simply were not suited for.

As different application datasets scale, they are imbued with characteristics that make them suitable for specific data store handlers. Working with, storing, and processing each dataset with tools and storage mechanisms built specifically for them results in well-built systems that are maintainable and scale with growth.

Here are some examples of different datasets and data store types built for them:

  • Relational data like an e-commerce store's inventory, customer orders, and customer data and their relationships are best handled by an RDBMS system built to model that relationship.
  • Asset information related to IoT devices is finding great success with specialized data stores like IBM's Maximo and SAP's Intelligent Asset Management. Asset metadata is available via an exposed integration API from these applications.
  • Dependency data such as business processes, integrated applications, and underlying network topology is best handled by a [graph database](https://www.influxdata.com/graph-database/)
  • Time series data such as the millions of streaming data points from stacks, sensors and systems are best handled by a time series database.

In the distant past, some argued using different data stores put too much strain on programmers to learn different data tools. Before today’s widespread and uniform API use that may have been true. But nowadays, developers are quite comfortable using data source drivers and APIs to pull data from different places to mold them together.

Relying on developers to shape and crunch data works great for purpose-built custom software solving specific problems. But there are never enough developers to go around, and general-purpose data analytics and visualization tools have proven their value for many analysts and data scientists. Single data source analytic tools work at times, but analysts often want to merge data from different data stores and frequently find pulling data sources together is a monumental task.

Providing a tool to flexibly merge together data sources and analyze them across time is one of Flux’s primary use cases. Open source Flux is meant to empower every query and visualization tool so that they may bring together related data sets to generate insights using a common, powerful and unified language.

Using time series data with data from SQL data stores

Flux’s first implementation of a function that can extract data out of non-InfluxDB data stores is InfluxDB 2.0 Alpha 14’s SQL.from() function. Let’s dive into a specific IoT use case to explore this function.

IoT asset data store

Recently, I have had several InfluxDB deep dives with customers using InfluxDB in IoT use cases. These IoT practitioners have tens of thousands of devices and sensors streaming all the data they capture into InfluxDB, and dealing with all that data presents some unique challenges.

Without a way to connect different data stores, many IoT practitioners reuse all their metadata as tags on their time series metrics so they can crunch metadata-based time series analysis such as:

“What is the error rate reported over the last day for each device type?”

Of course, prior to SQL.from(), the only way to have InfluxDB answer that question is to add device type as a tag to every time-series metric coming from their sensors.

This works ok but has issues reacting to changes in metadata sets over time. For example, if an IoT device maker initially tags time series metrics with a certain security classification but the classification changes, only new data shows up when querying with the new classification as the old data has the old value in the tag.

A better solution is a clean separation of data by keeping metadata in a relational store and time-series data in InfluxDB. The key to making a two data store solution work is a Flux language that can seamlessly query both time series data and relational data.

Fishtank health service vendor

To flesh out an example, let's suppose there is a vendor that provides a service for automatically monitoring and adjusting customer fish tank water health. This vendor leases thousands of IoT devices that monitor many aspects of fishtank water and streams their data to InfluxDB.  In this example, instead of tagging the streaming metrics with all the metadata, the vendor tags every metric with only the unique device ID the data originates from. The ID ties the device's metrics to sensor metadata that might include information such as the device type, purchaser's customer id, installation street city and state, device revision number, device name, etc. Our fishtank vendor has implemented a homegrown asset tracker in a MySQL database to keep track of all this relational information. Many of this vendor's customers are large corporate customers with dozens of fish tanks.  Whenever a customer calls, it's very important for their customer success representative to know the pH balance of their customer's fish tanks over the last hour and where they are located. Here is what a Flux script might look like for finding one customer's fishtank pH balances:
import "sql"

phBalances = from(bucket: "PH_Balances")
  |> range(start: now-60)
  |> filter(fn: (r) => r._measurement == "water_sensor" )

deviceIds = sql.from(driverName: "mysql", dataSourceName: "username:password@tcp(host)/dbname", query: "SELECT customer_name, customer_location, device_id AS device FROM DeviceTable WHERE customer_name = 'InfluxData'")

join(tables: {t1: phBalances, t2: deviceIds}, on: ["device"])
	|> aggregateWindow(every: v.windowPeriod, fn: mean)
 	 |> yield(name: "mean")
This result of this script is a table that might look like this: The data in green is from MySQL, the orange data is from InfluxDB, and the blue data is the joined data in both data stores. Several things to note from this simple example: First, this is a very naive example as the first query that builds the phBalances table pulls the entire data set from InfluxDB.  There is an implicit dropping of data when the two data tables join on device id. However, until the join completes, the whole dataset is in query memory and it is easy to imagine tens of millions of datapoints overwhelming all available memory. Second, there is no special need for a join per se. What we are really trying to do is filter down the pH Balances data set to a subset of metrics I am interested in. The Flux team has several initiatives underway that simplify these kinds of queries so I can directly use the SQL result in a filter. Lastly, the example Flux script is all low-level SQL and Flux. It requires a user to piece together their own rather complete prior knowledge of the data stores. Flux's long-term plan is to deliver libraries and tooling that could ease low-level knowledge prerequisites and provide a more approachable experience; for example, ORM tooling or data store visualizations to help users identify, merge, and shape their data and analysis.

The multi-data store road ahead

Flux has entered an exciting new chapter as a multi-data store time-series-focused programming language with the debut of SQL.from(). This initial SQL release enables pulling data from MySQL and Postgres, but we have a long list of data stores we would like to pull data from and write data to. The list of possible connectors on the Flux team's radar is long and dynamic. High on the list are SQLite, MS SQL Server, connectors for our open source Google Cloud Compute partners and, of course, for Google's data sources as well such as BigQuery. But we want to hear from you — what top data sources do you want Flux to connect to? Head over to Flux's community forum and let us know!