InfluxData Blog - Katy Farmer

Monitoring & Alerting in InfluxDB Cloud 2.0

Katy Farmer (InfluxData) — Fri, 20 Sep 2019 06:13:27 -0700

We’re here to talk about monitoring and alerting in InfluxDB Cloud 2.0. We’re trying to make learning from your data easy, and in that spirit, we’re going to walk through setting up alerts and notifications in InfluxDB Cloud 2.0. It only takes a few minutes, but first, let’s talk about the fundamentals.

What are checks?

Basically, a check is a conditional. For example, if my donut level is 0, assign CRIT. If my battery level is below 25%, assign WARN. A check is formed by building a query that will assign a status (e.g., “CRIT”, “INFO”, etc.) based on specific conditions.

What are notifications?

Notifications are the way we find out about the checks our system is running. We receive notifications by configuring a notification endpoint (like Slack). When I receive a Slack message saying “Donut levels are CRITICAL”, that is a notification.

The process

Checks and notifications form alerting for InfluxDB Cloud 2.0, an essential part of the monitoring workflow. Now that we’ve passed the vocabulary test, let’s practice setting up checks and notifications. The following instructions assume you already have a data collector set up. If you don’t have a data collector yet, read this blog to get started. Once our system is successfully collecting data, it’s time to set up alerting.

Use the “Set up alerting” graphic on your home tab to get started. This is a shortcut to the Monitoring & Alerting tab in the left-side navigation.

Create checks

InfluxDB Cloud knows we need a little help getting started, so go ahead and click the happy little Create a Check button.

The page we see should look a lot like the Data Explorer; we’re using the same logic to build a query for our check.

The first step is to build the query that will run inside the check. In the example above, we’re querying the average CPU usage of a particular host. Once you have your query, click 2. Check to finish the check.

First, name the check. Just like variable names, go with something descriptive. The rule in the screenshot is called “CPU Usage”. I might name my other checks something equally simple like “Donut Inventory” or “Shark Tank Water Level”.

<figcaption> Or Shark Tank Donut Inventory?</figcaption>

We also schedule the check here. If I’m gathering CPU data at 10-second intervals (like in the example shown), I might want to schedule a check for every 30 seconds to make sure I’m not missing very much data. If I’m monitoring my donut inventory, I probably only have to check every hour, as my donut intake is limited.

Now, we set the message we want to attach to the check. The message template can use string interpolation, so the text we see above: Check: ${ r._check_name } is: ${ r._level } would evaluate to something like Check: CPU Usage is WARN. Because we can use any of the columns from our query, we can be as specific as we want with this message.

The last step is to set the conditions. Let’s take a closer look at the menu.

One of the best things about this setup is that we can adjust all of the values and conditions to see how often that condition is met in the current data. This helps us make sure that the status messages match the reality of our data. If my donut level is always CRIT, eventually I’ll stop taking it as seriously.

After you’ve set the query and the check, click that green checkmark in the top right because you’re done!

All checks will show in the Monitoring & Alerting tab with a toggle switch so that it’s easy to enable/disable.

If you’re like me, you want a little assurance that your check is working. Hover over your check and select View History.

This will display a list of all results of the check, regardless of the status level.

Now we know that our check is working, and we can move on to setting up notifications.

Create notifications

From the same Monitoring & Alerting tab, you’ll see two more columns for Notification Endpoints & Notification Rules.

Notification endpoints

If you’re using the free tier of Cloud 2.0, then Slack is the only Notification Endpoint available right now. That’s alright because Slack is super easy to set up. Make sure you have Slack incoming webhooks enabled. All we have to do is provide our Slack webhook URL and name the endpoint.

Notification rules

We’ve done most of the work. Let’s get a notification already! There’s just one last step: creating the notification rule. This step highlights the difference between checks and notifications.

We could be running checks every 15 seconds in our system, but that doesn’t mean we necessarily want a notification. In the Notification Rule, we can determine the interval at which the notification rule should evaluate whether to actually send a notification. In the example above, my notification rule runs every 5 min and only sends me a Slack message if the status is CRIT. The rest of the data from the check is available to me in the history (and the _monitoring database), but I only want to be notified when I’m out of donuts — or my CPU usage is so high that my server is about to crash.

The message template works exactly the same as the status message template in our check. We can interpolate here so that we can provide as much data as possible.

Summary

We did it! Beginning to end, it only took me a few minutes to get a Slack notification about my perilous CPU usage. Set up your own alerts using InfluxDB Cloud 2.0 and tell us what you think!

Telegraf Configuration in InfluxDB 2.0

Katy Farmer (InfluxData) — Fri, 06 Sep 2019 07:00:02 -0700

Welcome to the time series world of tomorrow, InfluxData 2.0. We’re getting started with 2.0 by explaining the fundamentals. Today, we’re talking about what happens behind the scenes when you configure Telegraf as a data collector.

Collecting data is the first step in creating useful information, which makes it a vital step in any of our systems.

<figcaption> All vital steps should have rainbow connectors, in my opinion</figcaption>

Configuring Telegraf

When we click on “Configure a Data Collector”, we’re transported (through the magic of JavaScript) to the ‘Load Data’ page, where we can configure Telegraf.

There’s a whole tab for Telegraf because a lot of people rely on multiple Telegraf configurations to power their data collection. From here we select “Create Configuration”, which gives us a few options for services to monitor that are prebuilt for us. I like to start with System, as it’s easy to set up and verifies that Telegraf is working. We can also specify which bucket we want Telegraf to write to, so we can create a bucket just for these system metrics if we want to. If you don’t see what you want to monitor listed, you can still use Telegraf to gather those metrics, but you won’t be able to use the configurator to do it. We’ve described how to configure Telegraf manually for use with InfluxDB 2.0 here.

On the next step, name your configuration. Again, I usually go for something descriptive like “katy’s system” or “my precious”. If that’s not enough, you can also add a description.

Now that we’ve got the navigation out of the way, we’re on to the juicy part of the Telegraf configuration. On the final step, we have 3 tasks to complete.

Install Telegraf (must be at least Telegraf 1.9.2). Telegraf remains separate from the InfluxData 2.0 platform so that it can be easily installed and deployed in all of the places we want to monitor. Need help with Telegraf installation? Check out this guide.
Configure the API token. InfluxDB 2.0 uses token-based authentication, so we need to set an environment variable to store our token. Storing this token gives Telegraf permission to access the InfluxDB API so we can write the data we're collecting into InfluxDB. I use bash, so I saved my token in my .bashrc file.
We only have one task left between us and data collection. We need to start the Telegraf service using the right config file. Remember when we chose to monitor "System"? That determined what our Telegraf config file looks like
- - On the initial Telegraf setup, we can test our connection to make sure data is being written into our bucket. I love this feature because I love confirmation that things have gone correctly, both on my end and the platform.

Don’t worry if you move on from this list and need it later — we can always see the setup instructions again by clicking the link next to the configuration in the Telegraf tab.

A deeper dive

I love an intuitive UI, and the front-end team at Influx has done a smashing job leading us through configuring Telegraf. So what’s happening behind the scenes? For Telegraf to start sending metrics to InfluxDB, it needs two things: an input plugin and an output plugin. When we configure Telegraf from the UI, it enables the InfluxDB output plugin automatically, so we don’t have to specify the output (we can always change the output plugin later if we need to).

The output portion of the config file should look like this:

[[outputs.influxdb_v2]]	
  ## The URLs of the InfluxDB cluster nodes.
  ##
  ## Multiple URLs can be specified for a single cluster, only ONE of the
  ## urls will be written to each interval.
  ## urls exp: http://127.0.0.1:9999
  urls = ["http://localhost:9999"]

  ## Token for authentication.
  token = "$INFLUX_TOKEN"

Although the token is not technically part of the output, we still need it for a successful Telegraf connection.

When we chose “System” as the environment to monitor, a Telegraf config file was generated for us that had a group of input plugins enabled. This includes cpu, disk, diskio, mem, net, processes, swap, and system. All of these help to build a complete picture of the system they are running on. There is no configuration required for this set of input plugins, which means they start collecting metrics right away.

Do you want to know what the input section of the Telegraf config looks like? I’m sure you do.

## Organization is the name of the organization you wish to write to; must exist.
  organization = "DevRel-Katy"

  ## Destination bucket to write into.
  bucket = "test"
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics.
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states.
  report_active = false
[[inputs.disk]]
  ## By default stats will be gathered for all mount points.
  ## Set mount_points will restrict the stats to only the specified mount points.
  # mount_points = ["/"]
  ## Ignore mount points by filesystem type.
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "overlay", "aufs", "squashfs"]
[[inputs.diskio]]
[[inputs.mem]]
[[inputs.net]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]

If we had chosen Redis, for example, we would have to provide some information before it can generate a working config file. If we click on the list of plugins, we can then add any server information or authentication we need to get things up and running.

<figcaption> Running things locally since 1987</figcaption>

When we run the command in step 3 (telegraf --config http://localhost:9999/***/***), we’re running the Telegraf service and telling it where to find the config file for the right input and output (in our case, system and InfluxDB, respectively). The config file itself lives within the platform (in BoltDB), which is why the configuration instructions conveniently give us the file’s address. We can also view the actual toml file by clicking the name of the Telegraf config (in the Telegraf tab) and even download it for safekeeping or reuse.

If we make changes to our Telegraf config after we’ve started collecting data, we have to restart the Telegraf service for the changes to take effect. If you can’t remember the instructions for running it, you can visit the Telegraf plugins list to see the instructions again.

Time to awesome

Telegraf remains separate from the InfluxData 2.0 platform so that we can keep using it in all of the places we find useful; after all, it’s nice to have an open source agent to rely on. The goal in 2.0 is to make sure that we’re maximizing developer happiness, which means that integrating Telegraf with the platform should be as easy as 1, 2, 3.

<figcaption> See what I did there?</figcaption>

Red Flags of High Cardinality in Databases

Katy Farmer (InfluxData) — Tue, 03 Sep 2019 07:00:26 -0700

High cardinality describes databases with distinct values. For instance, if every line item had a unique ID number, description, email address, etc. A column with many repeated values would be described as having low cardinality. Not everyone calculates cardinality in the same way, so it’s important to analyze your methodology before implementing.

Let’s face facts: cardinality is not an easy concept. There’s a reason we see many different definitions for it across the world wide web — just like bursting into song in public, context is important. Not everyone wants to be in a musical, and not everyone computes cardinality in the same way. Regardless of how difficult it is to understand, some of us still have to work with and around cardinality, especially if, like me, you work in the land of databases. Let’s untangle the basics of cardinality.

What is cardinality?

On the most basic level, cardinality is the number of unique sets of data in a database. For example, I have two dogs — Bear and Freddie — and at any given time, they are doing one of three things: sleeping, barking, or chewing.

In Figure 1, we have a list of dogs and their associated status. In this case, there are six unique combinations in my pet data: 2 dogs each associated with 3 statuses. The cardinality of this data set is 6 (2 dogs x 3 statuses).

When we put data into a database, we create relationships between different aspects of data. With a cardinality of 6, my dogs and their activities have pretty straightforward relationships. Let’s add a little more context. It would look something like this:

We still have two dogs and three possible statuses, but we have 5 rooms in the house.

2 dogs x 3 statuses x 5 locations = 30

This is an overestimation of the cardinality because each dog isn’t associated with each status and the math to be more precise is a little (a lot?) more complicated, but it’s a close estimation for our small example.

To put it in the most straightforward terms possible, cardinality meaning in terms of databases really comes down to two things. Data cardinality is the one that is most relative to query performance. As stated, this is an examination of how many unique values are present in a column.

More importantly, we can see how adding more unique attributes increases the total number of unique combinations. If I went by the shelter and adopted another dog today, the cardinality would jump to 45 (3 dogs x 3 statuses x 5 locations).

The unlisted status: frolicking

Now imagine that instead of dogs, we’re tracking thousands of satellites all around the world, each sending back a status, location, sensor data, and a timestamp. We could easily hit millions or even a billion cardinality.

Another example of this would take the form of a credit card company with two tables. The first shows a person who has a credit card, and the second shows the card individually. If a person can only have one credit card, obviously this would be a standard one-to-one relationship. If that person is allowed to sign up for multiple cards, it would be a one-to-many relationship because they would be connected to many different entries on the other table.

Do I need to worry about cardinality?

Most of the time, we don’t have to compute cardinality ourselves — and that’s a good thing! There are lots of articles about how cardinality is computed if you’re into that sort of math (it’s called set theory!), and those calculations are built into how databases work. We don’t need to spend time calculating cardinality, but we do need to be aware of the relationships we build between data because eventually, cardinality can affect the performance and stability of our database.

The more complex our data is, the more expensive it is to write, store, and retrieve it from our database. There are two easy steps to find out if the cardinality is an issue in your database:

Find out what is considered high cardinality for your database. Go to the community forums and docs!
Use your database's tools to find out the cardinality of your data (here's an example of how to find cardinality in InfluxDB)

What is the cardinality of a set?

The most widely used and accepted data cardinality definition involves how many values are in a set. Within the larger context of databases, this refers to the total number of unique values in a table column as compared to the number of rows in the same table. Note that for the purposes of this discussion, repeated values are not something to concern yourself with.

It’s equally important to understand that cardinality database status is never really expressed as a number — it’s not like you’re looking at a value on a scale of 1 to 10 or anything like that. To keep things as straightforward as possible, people simply talk about “low” or “high” cardinality. Low cardinality refers to a database that has a lot of repeated values like status flags, Boolean values, or gender. In contrast, high cardinality refers to a database that has a large number of distinct values such as ID numbers, user names or email addresses.

All of this is important to know, as cardinality ultimately influences the query execution plan of the database. Different plans may be used to try to unlock the best performance depending on whether high or low cardinality is present.

How to find the cardinality of a set

For as complicated as the topic of cardinality is, thankfully the process of how to find cardinality really couldn’t be more straightforward. It’s also one that you can repeat for any finite set of elements that you’re working with.

All you need to do is count the total number of values in the set and identify this as your prime, cardinal number. Then, you can use data cardinality and other processes to further define the relationships between those values in the set — but that’s largely a different matter altogether.

Note that the order of values that appear in the set does not impact the cardinality in any way. They can be arranged in literally any order and it wouldn’t impact the cardinality of the set at all. Likewise, it’s important to understand that two different sets may have identical cardinality but that doesn’t mean they’re equal. They can have the same number of values and still be different if they don’t have identical values present between them. It all depends heavily on the databases in question and the information that you’re currently working with.

What do I do if I have high cardinality?

It’s worth saying that even if you have high cardinality, you might not need to do anything. Having high cardinality data isn’t a bad thing, and knowing that our data is complex can help us find issues specifically tied to this. If you have performance or stability issues in your database, then it’s worth trying to lower the cardinality to fix those problems.

The first question you can answer is: do you need every unique value that you’re storing? For example, you might be able to insert data every minute instead of every 5 seconds without losing the patterns in your data. Another option is to expire data after a specified window of time to keep the dataset smaller.

If neither of those is an option, and your data is always going to be complex, make sure you’re using a database that is made for high cardinality data.

Summary

We’ve been talking about cardinality generally, but there’s one more factor to think about: the way data is organized in the database can affect cardinality. The problem is that the way data is organized changes depending on the database, which makes it hard to cover all of the ways it can help.

Hopefully, this explanation is enough to get you started learning about cardinality. Yes, it’s complicated, but it’s not unknowable. We don’t have to be database architects to understand the concept or why it matters. Cardinality is a way to measure the complexity of our data so that we can better understand the relationships between different aspects of data. This helps us build smarter relationships and design more stable systems. Go out into the internet and read more about cardinality!

What is the Difference between Metrics and Events?

Katy Farmer (InfluxData) — Thu, 14 Mar 2019 08:00:04 -0700

We gather all types of data from our systems when we adopt monitoring technologies and tools. We might, for example, want to see application metrics, database logs and network traffic side-by-side. We don’t always talk about the differences in these types of data, so today we’re covering a question I get asked most often: what is the difference between metrics and events?

<figcaption> See if you can spot the differences!</figcaption>

Technical differences

Metrics and events are two different types of time series data: regular and irregular, respectively. Regular data (metrics) are evenly distributed across time and can be used for processes like forecasting. Irregular data (events) are unpredictable, and while they still occur in temporal order, the intervals between events are inconsistent, which means that using them for forecasting or averaging could lead to unreliable results.

The basic difference is metrics occur at regular intervals and events don’t. Imagine I’m monitoring my personal website — I want to track the response codes to make sure the site is available, so I collect them at frequent intervals. I could then query those response code metrics to figure out what percentage of the time my site was down (because it was too popular). But I also want to know when a user clicks on an ad. I don’t know when or if this click will happen, so collecting at a regular interval doesn’t make sense. If I have 12 clicks for the past year, the average will be one click a month regardless if they could have all happened October (the peak of my popularity).

<figcaption> It’s time series meets gossip!</figcaption>

In order to use event data for forecasting or averages, it has to be transformed into regular data. If you’re interested in modeling time series data, I recommend reading this blog on shaping and analyzing your data. If you’re using InfluxDB, you can see an example of working with irregular time series data here.

Because metrics and events are different types of data, this changes how the database can efficiently store and compress the data being ingested (e.g. different compression algorithms might be needed for different types of data). This is why at InfluxDB, we emphasize the ability to track both metrics and events — not every system can do both, and not every system is optimized for both. Ideally, our database does its job and we don’t have to worry about the ways it handles data. We can send metrics and events into InfluxDB without knowing or caring about how the database differentiates between the two.

Practical differences

The way we can interact with data changes depending on whether it’s regular or irregular, so sometimes we do need to know whether the data we’re collecting are metrics or events. For example, metrics can be used for aggregates since we have data that is evenly spaced across time. We don’t want to use irregular data to find aggregates because they won’t be distributed across time evenly, and they’ll return some useless results.

Monitoring metrics and events

I want to keep track of my piggy bank closely. Right now, there’s only one metric I care about: total funds. Anyone can put money into my piggy bank, so I want to report the total funds at a one-minute interval. This means that every minute, my database will receive a data point with the timestamp and the amount of total funds in my piggy bank.

<figcaption> Her name is Oinky and she is a fine accountant.</figcaption>

Now, I want to track specific events for my piggy bank: deposits and withdrawals. When a deposit occurs, my database will receive a data point with the “deposit” tag, the timestamp and the amount of the deposit. Similarly, when a withdrawal occurs, my database will receive a data point with the “withdrawal” tag, the timestamp and the amount of the withdrawal.

This very simple dataset makes sure that the total funds reported by my piggy bank match the total deposits and withdrawals. This is the same way my parents balanced their checkbook, and the same way I used to close out the cash register during my retail career.

Imagine now that this is the same basic idea behind online banking. We could add more metadata to add detail to the events, like attaching a user ID to a deposit or withdrawal.

Conclusion

Metrics and events are complementary. The ability to monitor both is more necessary than ever, and it shouldn’t take a data scientist to be able to do it (data scientists are pretty cool, though).

Original artwork by Katy Farmer

Flux Windowing and Aggregation

Katy Farmer (InfluxData) — Wed, 09 Jan 2019 09:00:18 -0700

Today, we’re talking about queries. Specifically, we’re talking about Flux queries, the new language being developed at InfluxData. You can read about why we decided to write Flux and check out the technical preview of Flux.

If you’re an InfluxDB user, you’re probably using InfluxQL to write your queries, and you can keep writing it as long as you want. However, if you’re looking for an alternative or you’ve hit some of the boundaries of InfluxQL, it’s time to start learning Flux.

My finest work

I’ve been learning InfluxQL for the past year, so I wasn’t necessarily excited to add another DSL to my mental load. After all, we pride ourselves on developer happiness at Influx, so I want to make sure that learning Flux is a) simple and b) worth the time it takes to learn.

I’m going to test that out by learning how to window and aggregate data, which I already know how to do in InfluxQL (we’ll go over the full InfluxQL query at the end).

The Data

Because I am very hip and very cool, I’m going to use a script that routes cryptocurrency exchange information into InfluxDB. Like most people, I want to keep up on the trends in crypto so I know whether I should regret not buying any.

The Problem

We can get a lot of information about cryptocurrencies, but I only care about a few things:

What did the seller ask for?
What did the buyer want to pay?
At what price did the trade occur?

Essentially, I want to know how much the actual trading price differs from expectations. If the trade price is significantly less than the seller paid, then the whole crypto nonsense is done and I can move on. If the trade price is significantly higher than the buyer wanted, then of course I understand the crypto-craze - I’ve always said that crypto is the future.

We need to figure out the best way to represent the results we want. Let’s shape the data with Flux.

The Solution

Here is a sample of the data I’m collecting in the “Raw Data” view in Chronograf:

What’s important to know about this data? The top row displays the data type, which is always good to know, but especially useful in this raw data view for debugging. The second row, “group”, signifies which of the data are group keys, which is a list of columns for which every row has the same value (more about that later).

There are also a few Flux-specific columns to understand as well. Within the table schema, there are four universal columns: “_time” (the timestamp of the record), “_start” (the inclusive lower time bound of all records), “_stop” (the exclusive upper time bound of all records), and “_value” (the value of the record). In our case, the start time is 48 hours ago and the stop time is now - 48 hours.

The other columns represent the schema I have defined for cryptocurrency data: “exchange” and “symbol” are tags while “_field” and “_measurement” represent the specific field and measurement of that record.

To start writing my query, I read this Flux guide, which explains the concepts of buckets, pipe forward operator, and tables. Essentially, every Flux query needs three things: a data source, a time range, and data filters.

Based on the guide, I started with this query:

from(bucket: "crypto/autogen")
  |> range(start: -48h)
  |> filter(fn: (r) => r._measurement == "prices" and (r._field == "last" or r._field == "lowestAsk" or r._field == "highestBid"))

The first line is my data source, the second is the time range, and everything that follows are data filters.

This query asks my database, “crypto”, for specific fields associated with the measurement “prices”: the last successful trade price (“last”), the lowest price a seller was willing to accept (“lowestAsk”), and the highest price a buyer would pay (“highestBid”) for the last 48 hours.

This is a good start, but the results were overwhelming. The resulting data looks the same as the sample above—but Chronograf displays a warning: “Large response truncated to first 103K rows.”

When it comes to visualizing data, we don’t want to do unnecessary work. There are so many cryptocurrencies on the market that even limiting the data to 48 hours was still too much to parse, and definitely too much to visualize. Truth be told, I don’t care about all of the currencies—as much as I love the idea of Dogecoin, Bitcoin is the trendsetter.

Let’s look at the same data, but only for Bitcoin.

from(bucket: "crypto/autogen")
  |> range(start: -48h)
  |> filter(fn: (r) => r._measurement == "prices" and (r._field == "last" or r._field == "lowestAsk" or r._field == "highestBid"))
  |> filter(fn: (r) => r.symbol == "USDT_BTC")

These results are much more manageable. Now we can more easily examine the results. Browsing through the raw data, we can see there are three separate tables (0-2) returned by this query: one for each series.

Table 0 represents the unique combination of the symbol (Bitcoin), the exchange (Poloniex), the measurement (prices) and the field (highestBid).

Table 1 represents the unique combination of the symbol (Bitcoin), the exchange (Poloniex), the measurement (prices) and the field (last).

Table 2 represents the unique combination of the symbol (Bitcoin), the exchange (Poloniex), the measurement (prices) and the field (lowestAsk).

In this case, only the field is changing, so that limits the number of series. Here is what the results look like in a line graph in Chronograf.

Windowing Data

We don’t necessarily need to see every data point represented in our results, especially if we’re interested in trends and changes over time. In this case, we want to window and aggregate our data, meaning that we want to split our output by some window of time and then find the average for that window. First, I hopped over to the official Flux docs to the window() function to add the last line of this Flux query.

from(bucket: "crypto/autogen")
  |> range(start: -48h)
  |> filter(fn: (r) => r._measurement == "prices" and (r._field == "last" or r._field == "lowestAsk" or r._field == "highestBid"))
  |> filter(fn: (r) => r.symbol == "USDT_BTC")
  |> window(every: 1h)

Let’s see what that changed about our raw data.

First, although we can’t see it in the screenshots above, the “_start” and “_stop” columns have changed to reflect the time bounds. If I hover over “_start” and “_stop”, respectively, I see the following time stamps.

2019-01-03T18:00:00Z 2019-01-03T19:00:00Z

Where before our time started 48 hours in the past and stopped now (not now, but you know,now()), our start and stop times have changed to reflect the time window we specified in our query.

Following the table column, we’ve got tables 0-23. If you’re thinking that number should be higher, you’re right, but we’re missing some data due to outages. Each table represents an hour window for each unique series (just like the combinations listed under the tables above).

Now look at the difference in our line graph in Chronograf.

Each color in the graph represents a distinct window of time; in our case, each color represents a one-hour window. This visualization in particular helped me to understand the Flux output. Flux returns tables of data that are based on the time windows—although we see them visualized side-by-side, data points in different colored sections of the graph live in separate tables entirely because they are separate time series. Remember, each time series in a graph gets its own unique color, and the colors highlight the start and stop times of each series.

Aggregating Data

Aggregate functions are slightly different in Flux when combined with windowing. When we apply a function like mean() to our current query, the function is applied to each 1-hour window, reducing each of the returned tables to a single row containing the mean value. The number of tables returned remains the same, but the tables only contain one row with the aggregate.

from(bucket: "crypto/autogen")
  |> range(start: -48h)
  |> filter(fn: (r) => r._measurement == "prices" and (r._field == "last" or r._field == "lowestAsk" or r._field == "highestBid"))
  |> filter(fn: (r) => r.symbol == "USDT_BTC")
  |> window(every: 1h)
  |> mean()

Our raw data view shows that each table is a single row of data for each field in each hour, which is good because that means that mean() did what we expected. The columns in our table are different: we no longer have a “_time” column because there’s no individual timestamp for this summary of data.

One area I had to research was the default value for mean(). When we don’t specify what we are averaging, it defaults to “_value”. Every Flux table includes a “_value” column, and when we perform an aggregate like mean, we transform the values listed there.

Now that we have all of the averages we need, we want our data back in fewer tables. In Flux, we can unwindow our data to combine relevant values by using the group() function.

from(bucket: "crypto/autogen")
  |> range(start: -48h)
  |> filter(fn: (r) => r._measurement == "prices" and (r._field == "last" or r._field == "lowestAsk" or r._field == "highestBid"))
  |> filter(fn: (r) => r.symbol == "USDT_BTC")
  |> window(every: 1h)
  |> mean()
  |> group(columns: ["_time", "_start", "_stop, "_value"], mode: "except")

We’re back to 3 tables (0-2) that give us the hourly averages for the completed trade prices, the lowest asks from sellers, and the highest bids from buyers.

This call to group() transforms our data in a few important ways. Remember when I said we would get back to group keys? This is it! By grouping data, we define the group keys. When we group by “_field” and “_measurement” (the columns left when we exclude “_time”, “_start”, “_stop”, and “_value”), we define the group key as [_field, _measurement]. In the screenshot above, both “_field” and “_measurement” list true in the group row. In practice, this means that the result of our query outputs a table for every unique combination of “_field” and “_measurement”.

This process of grouping the data and defining the group keys represents unwindowing our data.

A note about group() The syntax of the group() function has changed since the community has been using Flux. The original syntax of group() looks like this:

|> group(by: ["_field", "_measurement"])

Here is what the regrouped data looks like in our Chronograf visualization.

This Chronograf visualization gives us an easy way to see the trends in the Bitcoin market for the past two days. If I had more data, I could do a full historical analysis to truly understand if I missed out on my crypto-opportunity.

A Better Query

The Flux query we’ve written does exactly what I want, but it’s a little long. When I investigated further, I found a helper function that obscures some of the more confusing logic.

from(bucket: "crypto/autogen")
  |> range(start: -48h)
  |> filter(fn: (r) => r._measurement == "prices" and (r._field == "last" or r._field == "lowestAsk" or r._field == "highestBid"))
  |> filter(fn: (r) => r.symbol == "USDT_BTC")
  |> aggregateWindow(every: 1h, fn:mean)

The aggregateWindow() function includes windowing and un-windowing the data so that we don’t have to worry about the format of the data while we’re writing our query. I love a good helper function, especially when it’s clear from the name and syntax what it’s doing. The data returned looks exactly the same, and we saved ourselves a few lines.

Summary

We’ve written the Flux query that we want - now let’s compare it to the same query written in InfluxQL.

SELECT 
mean("highestBid") AS "mean_highestBid", 
mean("last") AS "mean_last", 
mean("lowestAsk") AS "mean_lowestAsk" 
FROM 
"crypto"."autogen"."prices" 
WHERE 
time > now() - 48h 
AND 
"symbol"='USDT_BTC' 
GROUP BY 
time(1h) 
FILL(none)

This InfluxQL query is still readable, especially if you’re SQL-fluent. Given that our query isn’t too complex, the queries end up pretty similar in Flux and InfluxQL. I do particularly like using the relative time ranges in Flux because it’s more concise without losing readability.

Flux - Likes

Readability is an important part of the programming languages I choose, and Flux gets an A in this. I liked being able to parse a query, even if I’m not a Flux expert. The functional nature of Flux lets me keep transforming the returned data as much as I need, which is especially convenient given the large datasets we’re dealing with.

Also, because the technical preview is already out, I can ask other users the problems and solutions they’ve run into on the community site.

Flux - Dislikes

There’s really only one dislike for me, and that was understanding the shape of the data being returned. It took me a while to understand how the tables in the returned data were being formed, and why they were coming back as tables. That being said, I don’t think this is a flaw in Flux, but rather in the conceptual overviews I read beforehand. Understanding the table structure is vital to writing good Flux queries, so we need to make sure there are more resources available to explain the concepts.

Conclusion

We did it! We wrote a useful Flux query that allows us to compare the asking and selling prices of Bitcoin. The good news is that I think it’s okay that I don’t have any cryptocurrency funds. Probably. Unless the value increases again. While we may not have come to any meaningful conclusions about cryptocurrency, using Flux to explore my data allowed me to easily make meaningful comparisons.

Me and my hoard of imaginary money

How Predefined Dashboards in InfluxData's Chronograf Make Metrics Simple

Katy Farmer (InfluxData) — Thu, 13 Dec 2018 09:00:24 -0700

I talk about metrics a lot. Usually, I’m muttering about the importance of monitoring while I watch YouTube clips of cats not making jumps. Now, we’re going to talk about one way we can access those metrics easily. Just as I’m an animal lover who sometimes wants to see cats fail, I’m a command line lover who sometimes needs to visualize data.

The Problem

Visualizing data is a particularly hard problem, but we don’t want to have to think about that when we use visualization tools. We want our data available as quickly and easily as possible. Given that, I was particularly excited to see the latest release of Chronograf (1.7.3), which includes improved onboarding that gave me just what I was looking for.

The Experience

I’m using MySQL as the database for a few of my Rails apps, and I want a dashboard to visit when things inevitably go wrong (after all, these apps were developed by past Katy, and she can’t be trusted). There are some specific metrics I care about regarding my database, and I don’t necessarily want to write a query when things are starting to go wrong.

Collecting Metrics with Telegraf

Metrics about the database live in the internal performance database, which for MySQL is the performance_schema database (and the more human-readable sysdatabase). Now, we could query this database to find what we’re looking for, but we can expedite this process by using the Telegraf MySQL plugin to send those metrics straight to InfluxDB. If you’re thinking that it sounds silly to send metrics from one database to another, you’re right — but not if our plan is to keep that data long-term or build useful visualizations from it. Keep in mind that what we want is a dashboard to view when things are behaving dangerously.

Building the Dashboard

We’ve done the hard part, which was pretty easy: Telegraf is sending our MySQL metrics to InfluxDB. Now let’s do the even easier part: building the dashboard in Chronograf.

After installing Chronograf (1.7.3 or higher), we get a warm welcome, which is new as of the 1.7.x line.

Through these onboarding steps, we can configure our InfluxDB settings or leave them as the default for now.

When we reach the “Dashboards” section, we can see that there are suggested dashboards for our Source; in this case, both MySQL and System (which collects things like local CPU usage, memory, etc.) are suggested. And we didn’t have to do anything besides send those metrics in via Telegraf. So, let’s set up both! Who knows what we (okay, I) might break?

All that’s left in the onboarding is to set up Kapacitor, but we can leave the defaults in place for now and continue onward.

We’re done setting up Chronograf. We clicked, like, five times, so hopefully, we’re not so tired we can’t click just a few more times. All we have to do is visit the Dashboards tab to see our predefined dashboards in action.

Both of the dashboards we selected during setup are listed (MySQL and System), but let’s make sure they actually work.

Look at all of these beautiful metrics. We can see the number of MySQL connections, queries per second, bytes sent and received per second, and lots more. All I had to do was leverage the metrics I was already collecting with Telegraf.

Let’s check on the System dashboard.

As usual, my local machine needs to be restarted but is otherwise working like a champ.

Summary

It’s easier than ever to set up predefined dashboards in Chronograf, which I’ve been waiting for. While the feature existed before, it was a bit clunky and out of the way. Now, I can do what I’m best at: clicking through defaults. There are also predefined dashboards for metrics from Kubernetes, Redis, Apache and more so we can monitor the services that matter to us with as little work as possible. Set up the newest version of Chronograf with InfluxDB and tell me how it works for you—I’ll be watching clips of dogs missing food in slo-mo.

How Database Indexes Really Work

Katy Farmer (InfluxData) — Mon, 08 Oct 2018 09:00:06 -0700

I have a growing love of databases that leads me to ask a lot of questions about how they work, and my recent obsession is database indexes.

Previously, I knew that if I wanted a particular column or field to be faster than the rest of them, I indexed it. That was as much as my brain could handle when I was first learning to code, but growing as a developer means expanding my knowledge of the fundamentals, like what exactly a database index is and why it exists.

<figcaption> There are lots of ways to grow as a developer</figcaption>

What is a Database Index?

Imagine we’re walking through San Francisco looking for our friend Chris. We know she lives in San Francisco, but we don’t know her address. While each building has a unique address that allows us to find it easily, we need a way to tie the person we’re looking for, Chris, with her unique address. In this case, we have to knock on every door in San Francisco until we find her. That’s not very efficient (and probably not a good idea). But if we had a directory tying Chris to her address, 123 Nunya Lane, we could walk straight there.

This is the same principle as a database index. A database index is a type of data structure, like an array or a hash. It’s just one way we can organize data. In this example, we would have an index of names that pointed to addresses.

<figcaption> You never know who will answer the door in SF</figcaption>

So why do we need a data structure inside of our database, which is, you know, a big data structure?

Why Do Databases Need Indexes?

We keep so much stuff in our databases—literally anything we think we might need later, from user credentials all the way to the latitude and longitude of the pizza being delivered to our house. Without an index, the database is stuck knocking on every door in San Francisco, or searching through every record in a linear fashion. Sometimes, this works just fine. Then again, some databases store hundreds of millions of records, so searching linearly could take ten steps or 525,600 or 300 million. We have to consider the potential number of steps. Without indexes, the database can end up with extremely slow queries as it searches each record for a match, which can then cause a buildup of waiting queries. Latency and overall response time would increase, and anyone waiting on the results of those queries either has to get a hobby, or more likely, use a different application.

What Should Be Indexed?

When deciding to add indexes to our database, we need to consider our data. Indexing every column or field can also have negative effects. If we create ten indexes, writing a single record to the database turns into 11 writes: one to the database, and one to each of the indexes (assuming that record includes all of the indexed columns/fields). As a guiding principle, we want to index the data that is looked up most frequently. The cost of writing to the index is offset by the improved performance of a significant number of our database queries.

What is the Difference Between Relational Database Indexes and NoSQL Database Indexes?

In relational databases, indexes are created by column. We can choose any column or even a combination of columns to create our index.

NoSQL databases can have indexes, too! There is far less convention in the world of non-relational databases because of the variety of databases, but most of them have excellent docs on how to index data. I like InfluxDB’s indexing in which data inserted as a tag is indexed and data inserted as a field is not, so I don’t have to think about the index more than that if I don’t want to.

Summary

Database indexes are a fundamental part of understanding how our databases spend time and resources, and creating them encourages us to understand more about our applications and the data they produce. I feel smarter already.

Eventual Consistency: Anti-Entropy

Katy Farmer (InfluxData) — Tue, 21 Aug 2018 13:16:51 -0700

In this blog series, we’re going to explore eventual consistency, a term that can be hard to define without having all the right vocabulary. This is the consistency model used by many distributed systems, including InfluxDB Enterprise edition. There are two concepts required to understand eventual consistency: the hinted handoff queue and anti-entropy, both of which deserve special attention.

Note:

Part I of this series goes into depth on the concept of eventual consistency and why it matters in distributed computing. You can read Part I here for a refresher.

Part II

What is anti-entropy?

If you read up on the Hinted Handoff queue in Part I of this series, you already know how the Hinted Handoff queue can save data during a data node outage and help you ensure eventual consistency, but there are a lot of ways for things to go wrong in distributed systems. Despite our best efforts, there are still ways to lose data, and we want to minimize this whenever possible. Enter the second half of maintaining eventual consistency: anti-entropy (AE).

If we’re against entropy, we should know a little bit about what it is. According to the internet and my science-minded friends, entropy is defined by the second law of thermodynamics. Basically, ordered systems tend toward a higher state of entropy over time; therefore, the higher the entropy, the greater the disorder. We are against disorder in our time series data, hence anti-entropy.

Delicious Physics

Forget any intimidation factor the word itself has anti-entropy is simply a service we can run in InfluxDB Enterprise to check for inconsistencies. We know that when we ask for information from a distributed system, the answer we receive may not be consistently returned. Because of the wide variety of ways in which “drift” can be introduced, we need a hero that can identify and repair underlying data discrepancies. AE can be that hero.

Example 1

Let’s bring back our classic cluster: InfluxDB Enterprise with 2 data nodes and a database with a replication factor = 2.

The system is healthy and happy, sending data along to be stored and replicated. This is the happily ever after for our data, but sometimes we have to work to make that happen. Distributed systems change frequently, and dealing with that change is often what disturbs consistency in the first place.

One of the most common changes in a system is hardware, so let’s explore one path where the new and improved AE can make a difference. Let’s say Node 2 has some bad hardware. Maybe it’s defective or just old, but it gives up the ghost in the middle of the night (because of course it will be in the middle of the night).

Just a little defective

When Node 2 goes offline, any new writes are sent to the HHQ, where they wait for Node 2 to become available again. Reads get directed to Node 1, which has all of the same data as Node 2 (because of our RF = 2).

This is the origin story of our hero, Anti-Entropy, which was developed as a solution to all of the edge cases we could think of, and hopefully, lots that we haven’t.

In our example, our first priority is getting Node 2 back online so that it can resume its rightful place in the system reading and writing data. We can use the ‘replace-node’ command in InfluxDB Enterprise to rejoin Node 2 with its new hardware.

In this case, AE checks the combination of the replication factor and shard distribution to see if all the shards that should exist are appropriately replicated. In this case, since Node 2 has a new, fast, empty, and defect-free SSD, all the shards that exist on Node 1 are copied to Node 2 and any data waiting in the HHQ is quickly drained. Our AE hero has ensured both nodes will return the same information and the appropriate number of replicas exist. Huzzah!

Example 2

But the HHQ can’t keep holding data forever—it has some practical limits. In InfluxDB Enterprise, it defaults to 10GB, meaning that if the size exceeds 10GB, the oldest points will get dropped to make room for newer data. Alternately, if data sits in the HHQ for too long, (default in InfluxDB Enterprise is 168hrs), it will be dropped. The HHQ is meant for temporary outages and fixes that can be quickly addressed, so it shouldn’t fill indefinitely. It addresses the most common scenarios, but the HHQ can only bear so much of the burden.

In scenarios that have longer “failures”, there is more room for data drift between the two nodes that we want to be identical. If a node is down and goes undetected for an extended period of time, the HHQ could exceed storage, time limit, concurrency, or rate limits, in which case the data it was meant to forward on vanishes into oblivion. Not ideal. Of course, there are a large number of potential edge cases that could happen: the goal of the HHQ and AE service is to provide a way to ensure eventual consistency with minimal effort from humans.

In other systems, once Node 2 disappears, it becomes the user’s responsibility to make sure that node is repaired and brought back into a consistent state, probably by manually identifying and copying data. Let’s be real: who has time for that? We have jobs to do and waffles to eat.

Or both!

Starting in InfluxDB Enterprise 1.5, AE examines each node in the cluster to see if it has all the shards the meta store says it should have the difference is that if any shards are missing, AE copies existing shards from another node that has the data. Any missing shards get copied automatically by the service. Starting with InfluxDB Enterprise 1.6, the AE service can be instructed to review the consistency of data contained within shards across the nodes. If any inconsistencies are found, AE can then repair those inconsistencies.

In our second example, the AE service would compare Nodes 1 and 2 against a digest built from the shards on the data nodes. It would then report that Node 2 was missing information, and then use that same digest to find out which information it was supposed to have. Then it will copy information from the good shard, Node 1, to fill it in on Node 2. Bam! Eventual consistency.

In more basic terms, the AE service now identifies missing or inconsistent shards and repairs them. This is self-healing at its best. Instead of worrying about the current state of our cluster, we can investigate what caused the failure (in this case, we might have been sleeping or eating waffles, although it’s not always so straightforward).

There are some important things to know about AE. AE can only perform its heroism when there is at least one copy of the shard still available. In our example, we have an RF of 2, so we can rely on Node 1 for a healthy shard to copy. If Node 2 has a partial copy of that shard, those shards are compared and any missing data is then exchanged between the nodes to ensure that a consistent answer is returned. If a user chooses to have an RF of 1, they are choosing to save on storage, but missing out on high availability and subject to a more limited query volume. It also means that AE won’t be able to make repairs because there’s no source of truth left once the data is inconsistent. Another caveat is that AE will not compare or repair hot shards, meaning that the shard can’t have active writes. Hot shards are more prone to change, and at any given moment, arrival of new data affects AE’s digest comparison. When a shard becomes cold or inactive, the data isn’t changing, and the AE service can more accurately compare the digest.

Summary

Eventual consistency is a model that promises high availability, and if our data is available all of the time, it needs to be accurate all of the time. Like any good superhero duo, the HHQ and AE are better together, fighting crimes of data inconsistency in the background so that we can trust our data and get on with the things that matter to us (i.e., waffles).

Waffles are very important to me

Measuring Success in Game Development

Katy Farmer (InfluxData) — Tue, 31 Jul 2018 12:32:20 -0700

I’m still on my personal mission to explore and understand the value of metrics as a fundamental part of tech. We’ve entered an age of instrumentation, and I feel like I’ve been left behind. I’ve seen metrics used to monitor complex systems, analyze behavior (including that of software, hardware and humans) and track the changes in machine learning algorithms. Sensor data can tell me when there’s too much carbon monoxide in the air, or if my online store crashes on Black Friday. Metrics can help keep factory workers safe and track my grocery delivery. Those are all pretty dang impressive, and it’s easy enough to see the value when other people explain their metrics to me.

I decided that the easiest way for me to explore the value of metrics in my own work was to instrument an application that had value to me. Enter “Paper Pirates”: a browser-based, Node.js game I built that has no purpose but to be fun (find the repo here). There is no winning “Paper Pirates”—the only winning move is not to play.

I chose a web game for a few reasons: 1) I love games. They’re so much better than reality. 2) Game development is really hard. Even in a simple game like this, there are a lot of opportunities for glitches and unexplained behavior.

Games are a goldmine of metrics, and there are plenty of stats I can gather to make sure my app is running as expected. This is the heart of DevOps, but I want to explore a slightly different angle. In this article, I’m going to focus on the ways that metrics helped me tune the game performance and improve gameplay. Let’s go!

Gathering metrics from “Paper Pirates” is easy; I used the InfluxDB Node.js client. Essentially, I listen for certain events, and when they occur, I send a value of 1 to InfluxDB. In this case, I don’t need a specific value—I just need to know that the event occurred and when.

All of the interactions with InfluxDB live in index.js. Looking at the code, you’ll see I’m gathering metrics to help build leaderboards, measure missile accuracy and overall player details. For now, we’re going to focus on this block:

socket.on("enemyfire", results => {
    client.writeMeasurement("events", [
      {
        tags: { sessionID, gameID, enemyID: results.enemyID, event: "enemyMissile" },
        fields: {
          enemyFires: 1,
        }
      }
    ])
  })
});

One of the most important parts of a game is player interaction with the world, which is, quite honestly, one of the last things that occurred to me—until I played the game the first time. This is how the enemy ships fire:

export const enemiesFire = () => (dispatch, getState) => {
  const { enemies } = getState();

  _.each(enemies, enemy => {
    const roll = _.random(4000);

    if (roll <= 10) {
      Client.emit("enemyfire", { enemyID: enemy.id })
      dispatch({ type: "ENEMYFIRE", payload: enemy });
    }
  });
};

There’s a lot going on here. Essentially, we’re dealing with the probability that an enemy will fire. A random number is chosen between 0 and 4000, and only when it is less than 10 will an enemy fire a missile. When I wrote this the first time, I had no idea what numbers would work, so I guessed. We refer to the results of this first experiment as The Void.

<figcaption> The place where you’re alone with your mistakes</figcaption>

The enemies were firing so often that just generating the missiles caused everything else on the canvas to lag. Then the player died immediately as a boatload of missiles slammed into them.

This led me to change the number in the rand function a few times until I found something that didn’t kill the browser. However, one of the toughest parts of game development is figuring out what feels right. Things like how responsive the controls are, how quickly the sprites react or even how precise the collision detection is all add up to a player enjoying (or not) the experience. So finding the right rate of enemy fire was important to me, even if it was hard.

Enter metrics! Revisiting the first code snippet, I marked each enemy missile as an event that increments a count being sent to my instance of InfluxDB. This allows me to track meaningful aggregates correlated with the random numbers I’ve selected in my code.

In this way, metrics drove the improvement of gameplay in “Paper Pirates” and made me a more informed developer. The first time I wrote the random firing, it looked like this:

_.each(enemies, enemy => {
    const roll = _.random(1000);

    if (roll <= 10) {
      Client.emit("enemyfire", { enemyID: enemy.id })
      dispatch({ type: "ENEMYFIRE", payload: enemy });
    }

Tracking the events, I determined that the reason it destroyed the entire game was that it was firing about 3,500 times in 10 seconds. “Paper Pirates” just couldn’t handle that kind of load. As I played the game and tracked the fires, I determined that about one enemy fire per second felt best to me and made it possible for the player to survive for longer than a minute.

<figcaption> The finest vessel on the sea</figcaption>

Another reason I needed aggregates is that there are multiple enemies on screen. The game is currently limited to four enemies at a time, but they all fire at slightly different rates, which, in the simplest terms, means I have no idea what’s going to happen. Gathering the number of enemy fire events eases my personal mental burden of trying to figure it out. It also allows me to perform all kinds of aggregates on it, although I’m mainly concerned with the average.

Understanding the value of metrics isn’t going to be the same for everyone. Yes, I want to know that “Paper Pirates” is up and running and metrics and events can help with that. But I also want to know that when it is up and running, it’s also the best it can be.

Being able to fine-tune gameplay is one part of creating a game that people actually want to play, and measuring how that gameplay changes over time as my code changes is a way that I can become a better game developer.