InfluxData Blog - Zoe Steinkamp

Chasing the Skies: Monitoring Flights with InfluxDB

Anais Dotis-Georgiou, Zoe Steinkamp (InfluxData) — Tue, 04 Jun 2024 08:00:00 +0000

We’re in an era where every move is monitored, where fans track every step of Taylor Swift’s jet-setting lifestyle. While some may dream of spotting celebrities at 30,000 feet, aviation enthusiasts and tech professionals see a broader horizon for flight traffic and monitoring.

Enter InfluxDB, the robust time series database transforming how we monitor flights in real-time. Whether it’s keeping tabs on a pop star’s private plane or ensuring commercial flights run on schedule, InfluxDB offers unprecedented insights into aviation data.

In this blog post, we’ll learn how to leverage FlightAware and InfluxDB Cloud 3 to monitor private/GA flight and airport delays. Then, we’ll use Grafana to build a flight monitoring dashboard. Find the corresponding repo here.

Requirements

To run this example, you’ll need the following:

You’ll also need to gather authentication credentials from your InfluxDB and FlightAware accounts. Follow these docs to learn more about how to get these credentials:

Bucket
Token
InfluxDB URL
FlightAware API Key

Please note that InfluxDB Cloud 3 offers a free tier, but you’ll need to pay for a Standard FlightAware tier to access that API. However, you can also leverage test JSON files and a test python script in the repo to try this example out before paying for live data from FlightAware.

Using the InfluxDB v3 Python Client Library to get flight data

At the core of this example is a Python Script, flightAware.py. This Python script continuously monitors and collects data about flights within specific geographic boundaries using the FlightAware AeroAPI and InfluxDB, a time-series database. The script fetches live flight data every 5 minutes. Here’s a breakdown of its key components:

Environment and Libraries: First, we import all libraries and modules, such as os, requests, pandas, and influxdb_client_3. We also import datetime, timezone, and custom secret modules for secure the handling of sensitive information like API keys.
Configure Variables: Next, we set up variables for database (db), organization (org), and API (url) configuration. We also retrieve authentication tokens (token) and API keys (apikey) using environment variables and a custom secret.py file. This example assumes you created a secret.py and stored your credentials there.
Configure the InfluxDB v3 Python Client: We initialize an InfluxDBClient3 object with the specified host URL, token, and organization details for data storage.

python
influxdbClient = InfluxDBClient3(host=url, token=token, org=org)

API Configuration and Headers: Using the API key, define the FlightAware AeroAPI endpoint and authentication header. Specify the parameters for the API request, including a geographic bounding box for flight tracking and a limit on the number of pages returned.
Time Conversion Function: Include a function convert_to_utc to ensure all timestamps are in UTC. This facilitates easier data management and visualization in Grafana.
Main Loop for Data Collection: This executes a perpetual loop that sends requests to the FlightAware AeroAPI to retrieve flight data within the specified parameters. It also checks if the response is successful and extracts flight details, specifically ignoring flights without a declared destination. For each valid flight, it constructs a detailed dictionary including flight identifiers, timing, geographical coordinates, and other pertinent flight data. The script also handles waypoints, extracting only the first and last for simplicity.
Data Handling and Storage: Here, we merge all collected flight data into a single dictionary and convert it into a Pandas DataFrame. Then, we write this DataFrame to InfluxDB, specifying measurement and tagging configurations.

python
 for d in flight_data:
                        merged_data.update(d)
                    flight_df = pd.DataFrame([merged_data])
                    flight_df['timestamp'] = flight_df['last_position']
                    influxdbClient._write_api.write(bucket=db, record=flight_df, data_frame_measurement_name='flight', data_frame_tag_columns=['ident', 'fa_flight_id'], data_frame_timestamp_column='timestamp')

Visualizing flight data in Grafana

Now you can import the FlightAware Grafana Dashboard JSON to build a Flight Monitoring Dashboard in Grafana. For more information on importing a dashboard from JSON in Grafana please see this documentation. To query data from an InfluxDB Cloud 3 free-tier account, leverage the Grafana InfluxDB v3 Data Source.

Here we can see the planes in our specific area, as visualized by the map on the top left panel in this dashboard. For this example, we focus on Las Vegas. We can also look at aircraft types, destination cities, and origin cities (top right panels). We also monitor stats like average altitude, number of flights in the air, and average ground speed. We also have an Airplane Ground speed time series visualization to see how planes land and take off.

I don’t know much about different aircraft, but I imagine larger airplanes have a steeper landing and take-off speed. Finally, we also visualize the flights by major airlines, with Southwest Airlines having the most flights in Las Vegas at that time.

Additional resources and conclusion

We hope this tutorial helps you get started visualizing flight data with InfluxDB. We also want to encourage you to take a look a the following related resources to learn more about how to leverage InfluxDB with Grafana and Python:

Client Library Deep Dive: Python (Part 1)
Client Library Deep Dive: Python (Part 2)
InfluxDB 3 Python Client Update: Adding Polars Support
Grafana Unleashes Official InfluxDB V3 Data Source: A Quick-start Guide to Configuration and Usage
Alerting with Grafana and InfluxDB Cloud Serverless

Note: Another cool thing about this demo is that the DevRel team at InfluxData can run it entirely with Github Actions, so if you see InfluxData at a conference booth running this demo, it’s likely we’re running it remotely.

As always, get started with InfluxDB Cloud 3 here. If you need help, please contact our community site or Slack channel.

Best Practices for Collecting and Querying Data from Multiple Sources

Zoe Steinkamp (InfluxData) — Mon, 14 Aug 2023 07:35:00 +0000

This article was originally published in The New Stack and is reposted here with permission.

In today’s data-driven world, the ability to collect and query data from multiple sources has become a very important consideration. With the rise of IoT, cloud computing and distributed systems, organizations face the challenge of handling diverse data streams effectively. It’s common to have multiple databases/data storage options for that data. For many large companies, the days of storing everything in the singular database are in the past.

It is crucial to implement best practices for efficient data collection and querying to maximize your datastores’ potential. This includes optimizing data ingestion pipelines, designing appropriate schema structures and utilizing advanced querying techniques. On top of this, you need data stores that are flexible when querying data back out and are compatible with other data stores.

By adhering to these best practices, organizations can unlock the true value of their data and gain actionable insights to drive business growth and innovation. This is where InfluxDB, a powerful time series database, comes into play. InfluxDB provides a robust solution for managing and analyzing time-stamped data, allowing organizations to make informed decisions based on real-time insights.

Understanding different data sources

When it comes to data collection, it is crucial to explore different data sources and understand their unique characteristics. This involves identifying the types of data available, their formats and the potential challenges associated with each source. After identifying the data sources, selecting the appropriate data ingestion methods becomes essential. This involves leveraging APIs, utilizing Telegraf plugins or implementing batch writes, depending on the specific requirements and constraints of the data sources.

It’s very important to keep in mind data space and speed. For example, we find with IoT data these are top concerns. Ensuring data integrity and consistency throughout the collection process is of utmost importance. So, too, is having backup plans for data loss, stream corruption and storage at the edge. This involves implementing robust mechanisms to handle errors, to handle duplicate or missing data and to validate the accuracy of the collected data. Additionally, implementing proper data tagging and organization strategies results in efficient data management and retrieval. By tagging data with relevant metadata and organizing it in a structured manner, it becomes easier to search, filter and analyze effectively.

It’s helpful to note here that most data storage solutions come with their own recommendations for how to begin collecting data into the system. For InfluxDB, we always suggest Telegraf, our open source data ingestion agent. Or for language specific needs we suggest our client libraries written in Go, Java, Python, C# and JavaScript. The important takeaway here is to go with recommended and well-documented tools. While it might be tempting to use a tool you are already familiar with, if it’s not recommended you might be missing out on those mechanisms for handling problems.

Effective data modeling

Effective data modeling is a crucial aspect of building robust and scalable data systems. It involves understanding the structure and relationships of data entities and designing schemas that facilitate efficient data storage, retrieval and analysis. A well-designed data model provides clarity, consistency and integrity to the data, ensuring its accuracy and reliability. The most important piece when dealing with multiple data sources is determining your “connector”, or your data piece that connects your data together.

For example, let’s look at a generator that has two separate datasets: one in a SQL database storing the units stats and one in the InfluxDB database that has real-time data about the battery capacity. You might need to identify a faulty generator and its owner based on these two data sets. It might seem like common sense that you would have some kind of shared ID between these two data sets. But when you are first modeling your data, the concern is less about being able to combine data sets and more about the main data use case and removing unnecessary data. Also the other question is: how unique is your connector and how easy will it be to store? For this example, the real-time battery storage might not have easy access to a serial number. That might need to be a hardcoded value added to all data collected from the generator.

Furthermore, as data evolves over time and variations occur, it becomes essential to employ strategies to handle these changes effectively. This may involve techniques such as versioning, migration scripts or implementing dynamic schema designs to accommodate new data attributes or modify existing ones.

For example, if our generator adds new data sets, it’s important that we add our original connector to that new data. But what if you are dealing with an existing data set? Then it gets trickier. You might have to go back and retroactively implement your connector. In this example, maybe the app where people register their generator and view their battery information, you require them to manually enter their serial number. This allows you to tag them as the owner, and you can run analysis on their device from a distance to determine if it’s within normal range.

Obviously this is a very simple example, but many companies and industries use this concept. The idea of data living in a vacuum is starting to disappear as many stakeholders expect to access multiple data sources and have an easy way to combine the data sets. So let’s start to dive into how to combine data sets once you have them. Let’s continue from our previous example with InfluxDB and a SQL database, a common use case for combining data.

When it comes to querying your data, and especially when it comes to combining data sets, there are a couple of recommended tools to accomplish this task. First is SQL, which is broadly used to query many data sources, including InfluxDB. And when it comes to data manipulation and analysis, a second tool, Pandas, is useful for flexible and efficient data processing. Pandas is a python library that is agnostic to the data it accepts, as long as it’s within a pandas data frame. Many data sources document how to convert their data streams into a pandas dataframe because it is such a popular tool.

The following code is an example of a SQL query in InfluxDB, which returns the average battery level over the past week for this specific device (via serial number):

This query would happen on the app side. When a user logs in and registers their generator’s serial number, that enables you to store the data with a serial number tag to use for filtering. For the readability of this query, it’s easier to imagine all generator data goes into one large database. In reality, it’s more possible that each serial number would be a unique data storage, especially if you wanted to offer customers the chance to “Store your data longer for a fee”, which is a common offer for some businesses and use cases, like residential solar panels.

Now, this is just one query, but an app developer would likely write several such queries to cover averages for the day and week, and to account for battery usage, battery levels and most recent values, etc. Ultimately, they hope to end up with between 10 and 20 values that they can show to the end user. You can find a list of all these functions for InfluxDB here.

Once they have these values they can combine all those data points with their SQL database that houses customer data, things like name, address, etc. They can use the InfluxDB Python client library to combine their two datasets in Pandas.

This is an example of what that join would look like in the end. When it comes to joining, Pandas has a few options. In this example, I’m using an inner join because I don’t want to lose any of the data from my two data sets. You would probably need to rename some columns, but overall this query results in a combined data frame that you can then convert as needed for use.

You can imagine how data scientists might use these tools to run anomaly detection on the datasets to identify faulty devices and alert customers to the degradation and needed repairs. If there is a charge for storing data, users can also combine this data with a financial data set to confirm which customers pay for extended storage time and possibly receive extra information. Even in this simple example, there are many stakeholders, and at scale the number of people who need to access and use multiple data sets only expands.

Key takeaways

With so much data in the world, the notion of storing everything in a single database or data store may seem tempting. (To be clear, you may want to store all of the same type of data in a single database, e.g., time series data.) While it can be a viable solution at a small scale, the reality is that both small- and large-scale companies can benefit from the cost savings, efficiency improvements and enhanced user experiences that arise from utilizing multiple data sources. As the industries evolve, engineers must adapt and become proficient in working with multiple data stores, and the ability to seamlessly collect and query data from diverse sources becomes increasingly important. Embracing this approach enables organizations to leverage the full potential of their data and empowers engineers to navigate the ever-expanding landscape of data management with ease.

Build A Plant Monitoring Tool With IoT: A Beginner-Friendly Tutorial

Zoe Steinkamp (InfluxData) — Fri, 02 Jun 2023 07:00:00 +0000

This article was originally published in The New Stack and is reposted here with permission.

Creating an Internet of Things (IoT) app to monitor a house plant is a pragmatic starting place to learn about data that changes over time. It’s useful for anyone who loves the idea of indoor gardening but forgets to check their plants regularly.

This project is accessible to everyone from students working on a science fair project to botanists monitoring exotic plant nurseries. There are many ways to monitor a house plant, but this is the high-tech way — with sensors and advanced software systems.

For this project, we’ll use InfluxDB, a time series platform that specializes in storing sequential data as it appears over time. This is useful when comparing data, creating alerts for specific thresholds and monitoring events in the physical and virtual worlds alike.

The full list of supplies, a schematic drawing for your breadboard (with microcontroller of choice) and the source code are all available from my git repository if you want to follow along for your own edification or desperately need to keep a houseplant alive.

The architecture

An IoT sensor tracks my plant’s health metrics at timed intervals. We classify the data it collects as time series data because it includes a timestamp, which appears as the first column in storage. The IoT sensors generate data about the plant’s health. Then I use Telegraf, an open source collection agent, and the Python client library to collect that data and send it to the storage layer, InfluxDB. The Plotly graphing library provides the data visualization. I coded the project in Python and it uses the Flask library for routing.

Getting started

Instruments:

A plant to monitor
A Particle Boron or similar microcontroller
At least one IoT sensor for your plant
A breadboard
Jump wires

I use four sensors to generate the following five data points:

Air temperature
Humidity
Light
Soil temperature
Soil moisture

Microcontroller

I use the Boron microcontroller and set the device up through the company website. To receive data from the microcontroller itself, I connect it to my laptop via a USB cable. Microcontroller setup depends on which microcontroller you selected for use. Your microcontroller might provide other connection options, including Bluetooth or small server access over TCP/IP.

Follow the instructions provided by your microcontroller’s manufacturer until you receive input from your sensors. You don’t need to make sense of the data yet; just make sure your microcontroller is sending raw data.

InfluxDB

Sign into InfluxDB. Create a bucket, which is where InfluxDB stores data. We will connect to InfluxDB via an API. The next step is to [create the required credentials and tokens] (https://docs.influxdata.com/influxdb/v2.7/security/tokens/.

Code

Writing data into InfluxDB is straightforward and starts with the client library. I use InfluxDB’s Python client library for this project. The code below is an example of how you can write code to send raw data from your microcontroller’s sensors to InfluxDB.

def write_to_influx(self,data):
	p = (influxdb_client.Point("sensor_data")
                    	.tag("user",data["user"])
                    	.tag("device_id",data["device"])                     
                    	.field(data["sensor_name"], int(data["value"])
                    	))
	self.write_api.write(bucket=self.cloud_bucket, org=self.cloud_org, record=p)
	print(p, flush=True)

Querying data

Queries return tables similar to the one below.

Before I can query my data, I need to initialize the Flight SQL client.

from flightsql import FlightSQLClient

Followed by:

# This is our flight client setup, it’s how we will query from IOX
    	# we need to remove the Https:// from our host
    	host = host.split("://")[1]
    	self.flight_client = FlightSQLClient(host=host,
                     	token=token,
                     	metadata= {'bucket-name': bucket}
        	             )

    	self.cloud_bucket = bucket
    	self.cloud_org = org

Below is a basic SQL query to retrieve data from InfluxDB.

SELECT {sensor_name}, time FROM sensor_data WHERE time > (NOW() - INTERVAL '2 HOURS') AND device_id='{deviceID}'

Before we can retrieve and read the data, we have to convert it to Pyarrow format. The code below is a function that includes the query and connection to Flight SQL to retrieve the data.

def querydata(self, sensor_name, deviceID) -> DataFrame:   	

        query = self.flight_client.execute(f"SELECT {sensor_name}, time FROM sensor_data WHERE time > (NOW() - INTERVAL '2 HOURS') AND device_id='{deviceID}'")

        # Create reader to consume result
    	reader = self.flight_client.do_get(query.endpoints[0].ticket)

    	# Read all data into a pyarrow.Table
    	Table = reader.read_all()
    	print(Table)

   	# Convert to Pandas DataFrame
    	df = Table.to_pandas()
    	df = df.sort_values(by="time")
    	print(df)
    	return df

You can call the previous query and substitute the variables for your selections, including bucket, sensor and device. The returned result allows you to graph your incoming data. The return df method pulls our data out in a data frame format.

Data frames

Pandas DataFrames are two-dimensional data structures that enable fast data analysis and processing. We convert our data to a DataFrame to make it easier to work with in Python. There are a few other data output options to choose from if you prefer a different style.

@app.callback(Output("store", "data"), [Input("button", "n_clicks")])
def generate_graphs(n):
# Generate graphs based upon pandas data frame. 
    df = influx.querydata( "soil_temperature", graph_default["deviceID"] )
    soil_temp_graph = px.line(df, x="time", y="soil_temperature", title="Soil Temperature")

    df = influx.querydata( "air_temperature", graph_default["deviceID"] )
    air_temp_graph= px.line(df, x="time", y="air_temperature", title="Air Temperature")

    df = influx.querydata( "humidity", graph_default["deviceID"] )
    humidity_graph= px.line(df, x="time", y="humidity", title="humidity")

    df = influx.querydata( "soil_moisture", graph_default["deviceID"] )
    soil_moisture= px.line(df, x="time", y="soil_moisture", title="Soil Moisture")

    df = influx.querydata( "light", graph_default["deviceID"] )
    light_graph= px.line(df, x="time", y="light", title="light")

The graphing library expects you to return a dataframe for visualization. This is the end result of querying for the data points. The images below are hard-coded graphs that illustrate the data points. Different tabs display different graphs and track separate metrics. This is just a small portion of the project’s capabilities.

Conclusion

Check out my presentation centered around Plant Buddy for a more in-depth discussion on this project and the InfluxDB ecosystem at large. Our community page has other great examples of exciting projects. Now get started with InfluxDB and build something cool!

Best Practices to Build IoT Analytics

Zoe Steinkamp (InfluxData) — Mon, 01 May 2023 07:00:00 +0000

This article was originally published in The New Stack and is reposted here with permission.

Selecting the tools that best fit your IoT data and workloads at the outset will make your job easier and faster in the long run.

Today, Internet of Things (IoT) data or sensor data is all around us. Industry analysts project the number of connected devices worldwide to be a total of 30.9 billion units by 2025, up from 12.7 billion units in 2021.

When it comes to IoT data, keep in mind that it has special characteristics, which means we have to plan how to store and manage it to maintain the bottom line. Making the wrong choice on factors like storage and tooling can complicate data analysis and lead to increased costs.

A single IoT sensor sends, on average, a data point per second. That totals over 80,000 data points in a single day. And some sensors generate data every nanosecond, which significantly increases that daily total.

Most IoT use cases don’t just rely on a single sensor either. If you have several hundred sensors, all generating data at these rates, then we’re talking about a lot of data. You could have millions of data points in a single day to analyze, so you need to ensure that your system can handle time series workloads of this size. Otherwise, if your storage is inefficient, your queries are slow to return, and if you don’t configure your analysis and visualization tools for this type of data, then you’re in for a bad time.

In this article, I will go over six best practices to build efficient and scalable IoT analytics.

1. Start your storage right

Virtually all IoT data is time series data. Therefore, consider storing your IoT data in a time series database because, as purpose-built solutions for unique time series workloads, it provides the best performance. The shape of IoT data generally contains the same four components. The first is simply the name of what you’re tracking. We can call that a measurement, and that may be temperature, pressure, device state or anything else. Next are tags. You may want to use tags to add context to your data. Think about tags like metadata for the actual values you’re collecting. The values themselves, which are typically numeric but don’t have to be, we can call fields. And the last component is a timestamp that indicates when the measurement occurred.

Knowing the shape and structure of our data makes it easier to work with when it’s in the database. So what is a time series database? It’s a database designed to store these data values (like metrics, events, logs and traces) and query them based on time. Compare this to a non-time series database, where you could query on an ID, a value type or a combination of the two. In a time series database, we query based entirely on time. As a result, you can easily see data from the past hour, the past 24 hours and any other interval for which you have data. A popular time series database is InfluxDB, which is available in both cloud and open source.

2. High-volume ingestion

Time series data workloads tend to be large, fast and constant. That means you need an efficient method to get your data into your database. For that we can look at a tool like Telegraf, an open source ingestion agent meant to run as a cron job to collect time series metrics. It has more than 300 plugins available for popular time series data sources, including IoT devices and more general plugins like execd, which you can use with a variety of data sources.

Depending on the database you choose to work with, other data ingest options may include client libraries, which allow you to write data using a language of your choice. For instance, Python is a common option for this type of tool. It’s important that these client libraries come from your database source so you know they can handle the ingest stream.

3. Cleaning the data

You have three options when it comes to cleaning your data: You can clean it before you store it, after it’s in your database or inside your analytics tools. Cleaning up data before storage can be as simple as having full control over the data you send to storage and dropping data you deem unnecessary. Oftentimes, however, the data you receive is proprietary, and you do not get to choose which values you receive.

For example, my light sensor sends extra device tags that I don’t need, and occasionally, if a light source is suddenly lost, it sends strange, erroneous values, like 0. For those cases, I need to clean up my data after storing it. In a database like InfluxDB, I can easily store my raw data in one data bucket and my cleaned data in another. Then I can use the clean data bucket to feed my analytics tools. There’s no need to worry about cleaning data in the tools, where the changes wouldn’t necessarily replicate back to the database. If you wait until the data hits your analytics tools to clean it, that can use more resources and affect performance.

4. The power of downsampling

Cleaning and downsampling data are not the same. Downsampling is aggregating the data based on time. For example, dropping a device ID from your measurement is cleaning, while deriving the mean value for the last five minutes is downsampling. Downsampling is a powerful tool in that, like cleaning data, it can save you storage costs and make the data easier and faster to work with.

In some cases, you can downsample before storing it in its permanent database, for example, if you know that you don’t need the fine-grained data from your IoT sensors. You can also use downsampling to compare data patterns, like finding the average temperature across the hours of the day on different days or devices. The most common use for downsampling is to aggregate old data.

You monitor your IoT devices in real time, but what do you do with old data once new data arrives? Downsampling takes high-granularity data and makes it less granular by applying means, averages and other operations. This preserves the shape of your historical data so you can still do historical comparisons and anomaly detection while reducing storage space.

5. Real-time monitoring

When it comes to analyzing your data, you can either compare it to historical data to find anomalies, or you can set parameters. Regardless of your monitoring style, it’s important to do so in real time so that you can use the incoming data to make quick decisions and take fast action. The primary approaches for real-time monitoring include using a built-in option in your database, real-time monitoring tools or a combo of the two.

Regardless of the approach you choose, it’s critical for queries to have quick response times and minimal lag because the longer it takes for your data to reach your tools, the less real time it becomes. Telegraf offers output plugins to various real-time monitoring solutions. Telegraf is configured to work with time series data and is optimized for InfluxDB. So if you want to optimize data transport, you might want to consider that combination.

6. Historical aggregation and cold storage

When your data is no longer relevant in real time, it’s common to continue to use it for historical data analysis. You might also want to store older data, whether raw or downsampled, in more efficient cold storage or a data lake. As great as a time series database is for ingesting and working with real-time data, it also needs to be a great place to store your data long term.

Some replication across locations is almost inevitable, but the more you can prevent that, the better, outside of backups, of course. In the near future, InfluxDB will offer a dedicated cold storage solution. In the meantime, you can always use Telegraf output plugins to send your data to other cold storage solutions.

When working with IoT data, it’s important to use the right tools, from storage to analytics to visualization. Selecting the tools that best fit your IoT data and workloads at the outset will make your job easier and faster in the long run.

Cleaning and Interpreting Time Series Metrics with InfluxDB

Zoe Steinkamp (InfluxData) — Fri, 31 Mar 2023 07:00:00 +0000

This article was originally published in The New Stack and is reposted here with permission.

A look at how to use Flux for data cleansing and analytics through the browser and via Visual Studio.

Time series data is data you want to analyze and monitor over time. For example, you might want to know the water levels over the course of the day for a plant, or how much sunlight it receives and when. This is a simple but easy-to-understand example. Obviously on a larger scale the stakes can be higher. You could be monitoring server infrastructure in a data center or pressure of a machine on a factory floor.

These are times when failure and real-time reactions can be extremely important to avoid an emergency. Time series data is commonly metrics, normally from IoT devices or server infrastructure.

Metrics are normally data that arrives in a constant stream, a value every second, but sometimes it can be more random. Raw time series metrics data can benefit from cleanup and normalization before exposing it for broader use and storage. When dealing with large amounts of time series metrics, it can be helpful to standardize the ways in which others can search through that data for specific time frames using easy-to-understand tags. There are many types of time series metrics, but for this blog post, we will focus on metrics from our internal storage engine, provided by one of our site reliability engineers (SREs).

For this tutorial, I will use InfluxDB’s time series data platform. The core of InfluxDB is a highly performant time series database that is great when processing millions of data per seconds, but it also comes with data collectors and scripting languages. This technical session focuses on using Flux, a data-processing and querying language used by InfluxDB. Flux has many of the capabilities of a query language like SQL, but it also comes prebuilt with analyzation and data science capabilities. Later, we will also use Flux to create alerts and downsample tasks. In the future we will also include SQL integration, which will allow for a new way to query your data.

Flux is already built into Influx, so there is no need for extra installation. Examples of how to leverage Flux for data cleansing and analytics through the browser and via Visual Studio will be demonstrated. You can check out the Visual Studio extension here for more details. But you can also use the command line or the cloud UI to interact with your data.

We will start with a simple Flux query to get more familiar with how Flux is written and works. Your bucket is your database name. Each bucket can be customized on that data it accepts and the length of time it retains that data. First notice the range() function. Since this language is for time series, logically we need to query for data from a time range. In this example our range is about 20 seconds (from the start and stop times). You could choose to have no range, but that will return all the data in your bucket.

This is data from our measurement called “node_points_total.” It has a set of counters, which increment every time a point gets written. If it’s a good point, the “ok” gets incremented; if it’s a “bad” point, the corresponding point gets incremented. Here we’re calculating the total number of points that were successfully written. The top of the function is filtering down to a specific node and host we want to monitor. We search for points with four statuses (ok, denied, error and dropped). Then we pivot the data and calculate the percent that were ok with the map() function. The map() function generates a “percentage good” number, which will allow you to say “99.98% of all points written were OK.”

This is the result from the query being run. As you can see, we are visualizing the result.

The next example is going to use a larger range, four days of data. That’s a lot! We are using an aggregateWindow() that will give us the sum of our values for every hour (we will return 96 results, one for each hour of the past four days). This is not only a faster query, but if we were to look at the data result in a table, it would also be easier to read only 96 results instead of 2 million!

This is the last query graphed. Here we see a small drop at 99.88% “ok” writes over the span of a few days. Our service-level agreement (SLA) for cloud is 99.9% monthly availability. But this is an example of a specific day that had a small incident. Overall, we still meet our SLA goals, but it may be good to notice even minor dips. This allows our SRE team to take action when needed.

This following Flux is a simplified version of what we call downsampling. Downsampling is the process of reducing raw high-precision data to lower precision aggregates. We combine the use of the aggregateWindow() function with the to() function to write the downsampled data to a new bucket. In this example, we get the mean value for every 10-minute interval of data and store that data point in a new bucket.

This can be helpful for three problems:

Making it easier to run analysis on a smaller data set and gain high-level insights from historical data.
Leaning up erroneous data.
Storing a smaller data set for the long term and reducing your overall instance size while retaining the overall shape of your data.

Finally we will take a look at alerting on our data set. This first part is all about filtering down to the field we want to monitor. We specifically want to filter for the “ok” status.

We are using the quantile function to determine when a value is outside the 95th percentile (95p). The compression in our quantile compression is how detailed we would want the data to be when determining the 95p. The larger the compression, the slower it would take to calculate. You can find more information in the docs in how to set up the quantile function.

This code is checking that value and setting the type and level to “critical” or “info.” From there we also create a Slack message that sends an alert if the status type is “critical.” We have an alert function here that is being called, but it’s not needed to go in depth on it. The bottom line is you can use Flux to build these alert tasks and receive alerts if your data is out of a defined boundary or is doing OK. To learn more about creating alerts and notifications with Flux, take a look at the following documentation.

These are just a few of the simpler capabilities for cleaning and interpreting your time series metrics in InfluxDB. We have a large amount of Flux documentation and examples you can reference depending on your use case and needs. We love to see what people are building in our open source community and look forward to connecting with you in our community forums and Slack channel to see what amazing projects you build!

Data Visualizations with InfluxDB: Integrating plotly.js

Zoe Steinkamp (InfluxData) — Sat, 29 Jan 2022 13:31:43 -0700

One of the great features of the InfluxData cloud platform is that it comes out of the box with all the tools you need to quickly read and write your data to the database. Here, we’ll walk through creating data visualizations with InfluxDB and plotly.js, a JavaScript graphing library built on top of d3.js and stack.gl. (If you’re instead looking for a tutorial on visualizing time series data with Chart.js and InfluxDB, click the link.)

Getting started

Before we start visualizing, we need to set up an instance of InfluxDB on our local machines. When you create your account on Influx you will want to create a bucket you can store your time series data inside and pull from. When you have your bucket you can go to the /client-libraries/javascript-node page and you can grab a token that is valid to read and write your bucket. You will also find examples on how to install, setup, and query on this page.

Querying data from InfluxDB

When you start your project you will want to run these commands, this will initialize the node project, install the influx client and install express.

npm init -y influx-node-app
npm i @influxdata/influxdb-client
npm install express --save

Once you’re all set up, and InfluxDB is running, you can set up a file to query the database and grab some data. I’m going to do this with Node/Express using the influx-js client library. For this example, I’ll be querying for data that Telegraf is already collecting for my computer. If you have the Telegraf installed and running, you should be able to do the same. I would expect your package.json file to look like this, keeping in mind i have my main file as app.js:

In my app.js file, I have the following. As you can see you will need to add your org email and your secret token, it will need to be added to a token file that you don’t push up to GitHub publicly. Also pay attention to what your influx url is, mine is set to us-west:

Finally in the rest of my app.js file:

This is my app.get function to grab my cpu total usage. I am using a flux query with a few filter functions to get the exact data I’m looking to display. I am filtering down with my flux query because it is easier to filter the data coming in then do filter in our application. My flux query was built using the query builder in the InfluxDB Cloud. This made it easy to build my simple query and then by switching to script editor I could copy the flux script.

From there, the data is called into our script.js file.

Here I have two fetch functions that call for two specific data sources with two separate Flux queries. We unpack the data and specifically we are looking for the time and value to be our x and y lines when graphed. We return trace data which is our graph with an array of x and y variables. Finally once we have received all the data from influx and prepared our graph object we call for Plotly to be graphed.

It’s important if you add more fetch data calls that you add them to our promise so it waits to make the Plotly graphs until all the data has been returned. And finally we make our Plotly line graph with two lines.

Setting up HTML and CSS

In the public folder directory you will find the index.html, styles.css, and scripts.js file. We will be using these files to display the plotly graphs. In our index.html you will see where we append the graph from script.js:

Aside from incorporating jquery and your script file, you’ll need to incorporate the Plotly CDN. Alternatively, you can install the plotly.js library as an npm module. For more information on that, please visit the “Getting Started” guide on the plotly.js website.

Let’s add the following CSS to our styles.css file for a bit of flair:

If you restart your server at this point and navigate to localhost:3000, you should see something similar to this:

And voila! You’ve got yourself a custom time series visualization, using InfluxDB and plotly.js! You can check out the source code on GitHub or feel free to ping me an email at zsteinkamp@influxdb.com.

InfluxData Blog - Zoe Steinkamp

Chasing the Skies: Monitoring Flights with InfluxDB

Requirements

Using the InfluxDB v3 Python Client Library to get flight data

Additional resources and conclusion

Best Practices for Collecting and Querying Data from Multiple Sources

Understanding different data sources

Effective data modeling

Key takeaways

Build A Plant Monitoring Tool With IoT: A Beginner-Friendly Tutorial

The architecture

Getting started

Microcontroller

InfluxDB

Code

Tags

Querying data

Data frames

Conclusion

Best Practices to Build IoT Analytics

1. Start your storage right

2. High-volume ingestion

3. Cleaning the data

4. The power of downsampling

5. Real-time monitoring

6. Historical aggregation and cold storage

Cleaning and Interpreting Time Series Metrics with InfluxDB

Data Visualizations with InfluxDB: Integrating plotly.js

Getting started

Querying data from InfluxDB

Setting up HTML and CSS