InfluxData Blog - Dave Patton

Chronograf Dashboard Definitions

Dave Patton (InfluxData) — Tue, 07 Nov 2017 04:00:11 -0700

If you have used Chronograf, you have seen how easy it is to create graphs and dashboards. And in fact your colleagues have probably come over to your laptop to marvel at the awesomeness of your dashboards and asked you how they too can share in the awesomeness of your Chronograf dashboard. And maybe your answer was “wow I don’t know how to share my awesome dashboard with you.” Well, worry no more. In this article, I am going to show you how to download your dashboard and how others can upload your dashboard to their instance of Chronograf.

In talking to customers, a common question I get is what sort of things should we be looking at when monitoring our InfluxDB Enterprise cluster? Our fantastic support team probably hears that question more than they care to. So they have created a list of common queries that will help monitor and troubleshoot your cluster. I have listed those queries at the bottom of this article. In addition to the queries, it would be great to have a dashboard that was always running these queries.

So I have created my dashboard, I’m happy and it looks great:

So how do you get a copy of my dashboard? Well, Chronograf has a great REST API. If you want to take a look at what’s available, just go to: http://[chronoserver]:8888/docs In order to get the dashboard, there are a few steps to follow:

First, find the id of the dashboard.

To do this, you will have to list out all the dashboards and find the id of the dashboard in question. (I know this is not ideal, and we are looking to make this easier.) To do that, you will have to make a GET request to: http://[chronoserver]:8888/chronograf/v1/dashboards This will return a JSON array of all the dashboard definitions you have:

{
    "dashboards": [
        {
            "id": 2,
            "cells": [ … cell definitions …],
            "templates": [],
            "name": "My Awesome Dashboard",
            "links": {
                "self": "/chronograf/v1/dashboards/2",
                "cells": "/chronograf/v1/dashboards/2/cells",
                "templates": "/chronograf/v1/dashboards/2/templates"
            }
        },
        {
            "id": 3,
            "cells": ["cells": [ … cell definitions …],
            "templates": [],
            "name": "InfluxDB Monitor",
            "links": {
                "self": "/chronograf/v1/dashboards/3",
                "cells": "/chronograf/v1/dashboards/3/cells",
                "templates": "/chronograf/v1/dashboards/3/templates"
        }
            }

In this case, I can see that “My Awesome Dashboard” has the id of ‘2’. Now let’s get the dashboard. You could either select and copy from the above output, but I have found that is sometimes error-prone and we don’t want to spend time debugging our JSON for missing curly brace. The JSON for our dashboard in this case would be available at: http://[chronoserver]:8888/chronograf/v1/dashboards/2 You can either paste the URL into browser and save the JSON or form the command line:

$ curl -i -X GET http://localhost:8888/chronograf/v1/dashboards/2 > MyAwesomeDashboard.json

Now send that file to your buddy. When they get it, they can upload it to their Chronograf server with the following from command line:

$ curl -i -X POST -H "Content-Type: application/json" \
http://[chronoserver]:8888/chronograf/v1/dashboards \
-d @/path/to/MyAwesomeDashboard.json

And voila. Now your buddy has a copy of your dashboard to use. Simple, quick and easy. Now go write some code.

Influx Vagrant Boxes

Dave Patton (InfluxData) — Fri, 03 Nov 2017 04:00:16 -0700

One of the great things about InfluxDB and the TICK Stack as a whole is its ease of use. InfluxData provides downloads for a variety of operating systems and architectures and even an official Docker image. But what if I just want to spin up a quick TICK Stack to test something out like a TICKscript or a new Telegraf plugin I am building (hint at future blog article)? Enter Vagrant. For those of you who have never used Vagrant, it is a tool for building and managing virtual machine environments in a single workflow. It was created by HashiCorp and more can be read about it at: https://www.vagrantup.com

In order to spin up a Vagrant TICK Stack, you will need the following:

A Vagrantfile
A bootstrap script
Any files you want to test such as config files or your great new plugin.

The Vagrantfile

The Vagrantfile defines the parameters of our VM such as memory, mount points, network etc. I have provided links below to a Github project to fully deploy TICK on Vagrant. Although this blog post is not a tutorial on Vagrant there are some things to note. We are using Centos7 as the OS for our Vagrantbox. You can change this to Ubuntu if you like by changing: config.vm.box = "centos/7" to config.vm.box = "ubuntu/xenial64" We have also defined a private network for the VM: config.vm.network "private_network", ip: "192.168.70.101" And a shared/synced folder which will allow us to share files from our host to our VM. In this case, we are sharing the data folder on our local machine which will be mounted on our VM at /vagrant config.vm.synced_folder "data/", "/vagrant", type: "virtualbox" Lastly we are allocating 4GB of memory. This can be modified by changing the following line: vb.memory = "4096"

Bootstrap Script

Our bootstrap script is just a bash script that installs the TICK Stack and sets up our environment. One thing you will need to change are the versions of the TICK Stack you want to install. Set this at the top of the file:

TELEGRAF_VERSION=telegraf-1.3.5-1.x86_64.rpm
INFLUX_VERSION=influxdb-1.3.2.x86_64.rpm
CHRONO_VERSION=chronograf-1.3.6.1.x86_64.rpm
KAPACITOR_VERSION=kapacitor-1.3.1.x86_64.rpm

The rest of this script downloads, installs and starts each component for you. It will also move any configuration files you might have in the data dir to /etc/[component]. This way you can use custom config files. As an example for Telegraf, we have the following:

# Install Telegraf
wget -nv -O $TELEGRAF_VERSION https://dl.influxdata.com/telegraf/releases/$TELEGRAF_VERSION
yum localinstall -y $TELEGRAF_VERSION
if [ ! -f /vagrant/telegraf/telegraf.conf ]; then
    echo "Found telegraf.conf.  Installing."
	mv /vagrant/telegraf/telegraf.conf /etc/telegraf
fi
systemctl start telegraf

Lastly, we are also installing NodeJS. I use Node a lot for quickly developing other tools I might need such as a Telegraf Traffic Generator or an HTTPListener. If you don’t like Node, feel free to comment this out.

Using It

Once you have your Vagrantfile and bootstrap script the way you want them, there are two ways to bring up your VM. The first is to simply use: $ vagrant up The second way is to use the wrapper scripts in the project: $ ./up.sh This will bring the box up and create an initial snapshot called “initial” of the VM. If, after you have been doing things, you want to reset your VM to its initial state use: $ ./restore.sh Once the VM has started, you start using and testing your TICK Stack. Chronograf should already be running, so open up a browser and start using that, or you can ssh into the VM by using: $ vagrant ssh

Conclusion

As you see, it’s pretty easy to use Vagrant to spin up a full TICK Stack on a single VM. I think once you start using this you will find it very useful for any development or testing purposes. All the scripts and files I mention in this article are available on Github at: https://github.com/dp1140a/InfluxSandbox

Now start coding.

TICKscript Templates

Dave Patton (InfluxData) — Wed, 01 Nov 2017 04:00:03 -0700

Kapacitor is an integral piece of the InfluxData platform, and in fact as the platform continues to develop, we are looking to Kapacitor to do more and more in terms of data processing and workloads. At the core of these workloads are alerts and downsampling or aggregations. I am often asked by customers as to what is the best way to create TICKscripts for these two types of workloads. So I thought I would take a moment here and provide you with some TICKscript templates you can use as starting points for an alert and a downsample.

Alerts

Alerts are at the core of what Kapacitor does and is the most common type of workload for Kapacitor in the wild. I’m going to let you in on a secret for creating great alert scripts: Use Chronograf to stub them out. Chronograf has the ability to visually create alerts.

In the image above, I have created an alert on CPU usage and set a threshold of 80%. When the alert triggers, I want to it make an HTTP POST. Pretty simple alert.

Once you have created your alert, you can view and copy the actual TICKscript. Go back to the Alert Rules page and select the “Edit TICKscript” button for your alert.

Prior versions will just show the script contents, but if you are using the latest version of Chronograf (at least 1.3.10) this will open up the new TICKscript editor, which you can use to start writing more TICK code.

Downsamples

The second most common workload for Kapacitor is using it to downsample or aggregate data. In the Kapacitor docs, there is a guide for using Kapacitor as a Continuous Query Engine, but it’s not exactly clear that this is the same thing as aggregation and downsampling. So let’s discuss it here as well. In general, our recommendation is to run aggregation or downsampling TICK scripts as BATCH scripts. The script snippet below will run the query every five minutes and grab the last 5 minutes worth of CPU data from our Telegraf database. It will then take the mean of that data and store it back into the DB. Assuming that Telegraf is reporting every 30 seconds, this basically says that we are going to downsample from 30 second resolution to 5 minute resolution data:

batch
   |query('SELECT mean(usage_idle) as usage_idle FROM "telegraf"."autogen".cpu')
      .period(5m)
      .every(5m)
      .groupBy(*)
   |influxDBOut()
      .database('telegraf')
      .retentionPolicy('autogen')
      .measurement('cpu_idle_5minute')
      .precision('s')

What if I wanted to downsample from 30 second to, say, an hour? In that case, your script should look like:

batch
   |query('SELECT mean(usage_idle) as usage_idle FROM "telegraf"."autogen".cpu')
      .period(1h)
      .every(1h)
      .groupBy(*)
   |influxDBOut()
      .database('telegraf')
      .retentionPolicy('autogen')
      .measurement('cpu_idle_1hour')
      .precision('s')

You can see that the general rule for downsampling is that the period() should equal every().

Conclusion

We have now given you some templates to use for creating TICKscripts for the two most common Kapacitor use cases: Alerts and Downsampling. Of course there is a lot more you can do with Kapacitor. As a best practice, when developing a lot of TICKscripts, I would highly recommend that you setup a Github repo and/or a Wiki for folks to use to get these templates and to use as a guide for different scripts. It’s also helpful for Dashboard definitions, but that is another article.

Now go write some code.

Multiple Data Center Replication with InfluxDB

Dave Patton (InfluxData) — Fri, 29 Sep 2017 04:00:06 -0700

Introduction

Disaster Recovery and multi-datacenter replication of InfluxDB and Kapacitor are two frequently asked-about topics. In this post, I cover some of the suggested patterns for accomplishing this. Additionally, I will discuss the pros and cons of each approach and how they can be combined.

In general, there are two patterns that can be used for multi-datacenter replication of data into InfluxDB. The first is to replicate data upon ingest into InfluxDB to the second datacenter cluster. The second pattern is to replicate data from one cluster to another cluster on the backend.

Replication on Ingest

The first set of patterns to discuss is replication of data upon ingest to InfluxDB. This is probably the easiest to setup and the most useful for all new data that is coming into our cluster form external sources. Most of these patterns rely in some form on Telegraf.

Telegraf is the Swiss Army Knife of the TICK stack and can be used as an agent, a data aggregator or to help setup data ingest pipelines. Telegraf uses input and output plugins. There are about 100+ plugins as of the writing of this post. For a full discussion of Telegraf please refer to the Telegraf docs.

In most cases, we recommend the use of Telegraf in almost all deployments of Influx. At the very least Telegraf can be used to batch all write requests to the database; which is something you should always be sure to do.

Telegraf

Figure 1. Telegraf Replication

The first pattern to discuss is using Telegraf by itself to replicate all data upon ingest to both clusters. To set this up, we specify the URL of both clusters in the Telegraf config file. We are going to make use of the InfluxDB Telegraf output plugin. In the outputs.influxdb section, we would set the following:

## Cluster 1
[[outputs.influxdb]]
  urls = ["http://cluster1:8086"] # URL of the cluster load balancer
  database = "telegraf" # Name of the DB you want to write to
  retention_policy = "myRetentionPolicy"
  ## Write consistency (clusters only), can be: "any", "one", "quorum", "all"
  write_consistency = "any"
  timeout = "5s"
  content_encoding = "gzip"

## Cluster 2
[[outputs.influxdb]]
  urls = ["http://cluster2:8086"] # URL of the cluster load balancer
  database = "telegraf" # Name of the DB you want to write to
  retention_policy = "myRetentionPolicy"
  ## Write consistency (clusters only), can be: "any", "one", "quorum", "all"
  write_consistency = "any"
  timeout = "5s"
  content_encoding = "gzip"

This config specifies that we want to write to two separate InfluxDB clusters. One thing to note is that the URL you specify should either be the URL of a load balancer in front of the cluster or a list of the URLS of each datanode in the cluster. In this case, our config would look like this:

## Cluster 1
[[outputs.influxdb]]
  urls = ["http://Cluster1DataNode1:8086", 
	  "http://Cluster1DataNode2:8086"] # URLs of the cluster data Nodes
  database = "telegraf" # Name of the DB you want to write to
  retention_policy = "myRetentionPolicy"
  ## Write consistency (clusters only), can be: "any", "one", "quorum", "all"
  write_consistency = "any"
  timeout = "5s"
  content_encoding = "gzip"

## Cluster 2
[[outputs.influxdb]]
  urls = ["http://Cluster2DataNode1:8086", 
	  "http://Cluster2DataNode2:8086"] # URLs of the cluster data Nodes
  database = "telegraf" # Name of the DB you want to write to
  retention_policy = "myRetentionPolicy"
  ## Write consistency (clusters only), can be: "any", "one", "quorum", "all"
  write_consistency = "any"
  timeout = "5s"
  content_encoding = "gzip"

With a list of DataNode URLs, Telegraf will write each batch to one of the URLs in the list NOT to all of them. The DB will handle replicating each batch to the requisite datanodes.

## Cluster 1
[agent]
   metric_buffer_limit = 123456789 # Buffer size in bytes

This pattern is the easiest to set up. In the case of a network partition between Telegraf and the clusters, Telegraf will attempt to rewrite the failed writes. Additionally, it will store data to be written in an in-memory buffer. The size of this buffer can be configured in Telegraf.

When setting this value, you want to make sure that the buffer is big enough to hold data for a typical outage. A good rule of thumb is to set the buffer to hold an hour worth of data in case of failure. Obviously if you are writing a lot of data, this may not be feasible so set it accordingly. This buffer is not a durable write queue; if Telegraf fails or is shutdown, the buffer is gone since it is in memory. So, what do we do if we need a durable write queue?

Kafka and Telegraf

Figure 2. Telegraf Replication with Kafka

In this pattern, I have Kafka in front of our Telegraf Instances. Kafka will provide a durable write queue for all our data as it comes into the cluster. Additionally, this will add some flexibility on what we can do with all our data coming in. For instance, we might also want to send all our data to long term storage in something like S3 or send it to other analytical platforms for other types of analysis. It is assumed you will already have Kafka installed. In this pattern, our Telegraf config will look like the following:

# Read metrics from Kafka topic(s)
[[inputs.kafka_consumer]]
  ## topic(s) to consume
  topics = ["telegraf"]
  brokers = ["kafkaBrokerHost:9092"]
  ## the name of the consumer group
  consumer_group = "telegraf_metrics_consumers"

  ## Offset (must be either "oldest" or "newest")
  offset = "oldest"

  ## Data format to consume.
  data_format = "influx"

  ## Maximum length of a message to consume, in bytes (default 0/unlimited);
  ## larger messages are dropped
  max_message_len = 65536

## Cluster 1
[[outputs.influxdb]]
  urls = ["http://clusterLB:8086"] # URL of the cluster load balancer
  database = "telegraf" # Name of the DB you want to write to
  retention_policy = "myRetentionPolicy"
  ## Write consistency (clusters only), can be: "any", "one", "quorum", "all"
  write_consistency = "any"
  timeout = "5s"
  content_encoding = "gzip"

The above example shows that we are using the Kafka Consumer Input Plugin. The topic should be whatever the correct Kafka topic is for our data. The brokers list can be one or more kafka brokers. On each call to Kafka, Telegraf will send the request to only one of the brokers. The use of a consumer group is optional, but if you have a large volume of data to pull from Kafka, you can setup multiple Telegraf instances each pulling form the same consumer group. This will allow you to pull more data and not have duplicate data form Kafka as the consumer group will keep track of the topic offsets for each consumer client.

Replication After Ingest

The second set of patterns to discuss are those where replication of data occurs after we have ingested data. There are two general buckets that these patterns can fall into. The first bucket is what I call a ‘pass-through’ where we are really just making a copy of the data we have ingested and sending it to another instance or cluster of InfluxDB. The second bucket is for derived data or data that is the output of something like a Kapacitor job, a Continuous Query, or the result set of an InfluxQL query.

Figure 3. Replication with Subscriptions

Subscriptions were originally built for sending data to Kapacitor for stream-based TICK scripts. But here’s a little-known fact about subscriptions: They can send data anywhere you want over HTTP or UDP. This makes them really handy. When you set a subscription in InfluxDB, all it does is forward all input data that matches the database.retentionpolicy you specified.

One important thing to note is that the subscriptions on each cluster should not have the same combination of database name and retention policy. This will set up an infinite loop that will eventually tear a hole in the space time continuum; or at least crash both your clusters.

The destination for your subscription can be one or more entries. Additionally, you can specify that the subscription send it to ‘ANY’ or ‘ALL’ of those destinations. What this means is that in addition to cross-cluster replication you could also have it send data to another system for other types of analysis or to long-term storage of raw data.

# On Cluster 1
CREATE SUBSCRIPTION 'mySubscription' ON "myDB"."myRetentionPolicy" DESTINATIONS ALL 'http://cluster2LoadBalancer:port'

# On Cluster 2
CREATE SUBSCRIPTION 'mySubscription' ON "myDB"."myOtherRetentionPolicy" DESTINATIONS ALL 'http://cluster1LoadBalancer:port'

Kapacitor

Figure 4: Replication with Kapacitor

Another take on backend replication that is very similar to subscriptions is to use Kapacitor. This pattern would be more applicable for data that is the output of a TICK script where we are either creating new data, decorating or transforming existing data. For example, this is the pattern to use for aggregations or roll-ups of data from one cluster to another. Let’s say I had the following TICK script that was performing a batch rollup of data to a 5 minute resolution:

var data = batch
    |query('SELECT median(usage_idle) as usage_idle FROM "telegraf"."autogen"."cpu"')
.period(5m)
.every(5m)
.groupBy(*)

We assign this to the variable data in our script. Once that is set, we branch the output to both clusters with the following:

data
   |influxDBOut()
.cluster('localCluster')
.database('telegraf')
.retentionPolicy('rollup_5m')
.measurement('median_cpu_idle')
.precision('s')

data
   |influxDBOut()
.cluster('remoteCluster')
.database('telegraf')
.retentionPolicy('rollup_5m')
.measurement('median_cpu_idle')
.precision('s')

Notice the cluster name? That is not the URL of the cluster; rather it is a named variable from our kapacitor.conf file. More details on the kapacitor.conf file can be found here.

# Multiple InfluxDB configurations can be defined.
# Exactly one must be marked as the default.
# Each one will be given a name and can be referenced 
# in batch queries and InfluxDBOut nodes.
[[influxdb]]
# Connect to an InfluxDB cluster
# Kapacitor can subscribe, query and write to this cluster.
# Using InfluxDB is not required and can be disabled.
enabled = true
default = true
name = "localcluster"
urls = ["http://cluster1LoadBalancer:8086"]
username = ""
password = ""
timeout = 0
[[influxdb]]
# Connect to an InfluxDB cluster
# Kapacitor can subscribe, query and write to this cluster.
# Using InfluxDB is not required and can be disabled.
enabled = true
default = true
name = "remoteCluster"
urls = ["http://cluster2LoadBalancer:8086"]
username = ""
password = ""
timeout = 0

Backup and Restore

The last pattern to discuss is the good old backup and restore function. It should be thought of as sort of a “sneaker-net” option as it does not operate in near real time. As you may have guessed, this uses the backup and restore commands of InfluxDB Enterprise. Currently InfluxDB Enterprise supports backup of all the data in your cluster, a single database and retention policy or a single shard.

The backup command creates a full copy of both the data and the metastore creating a sort of snapshot of when the backup was taken. There are two types of backups that can be taken: a full backup or an incremental backup. When using this pattern, you should probably automate with a cron job that calls a script, every hour or day, that could perform the backups. That script might look something like this:

#!/bin/bash

BKUPDIR=[path to backup dir]
BKUPFILE= Backup_$( date +"%Y%m%d%H%M").tar.gz
influx-ctl -bind [metahost]:8091 backup –incremental -db [db-name] $BKUPDIR
tar –cvzf $BACKUPDIR ./$BKUPFILE
scp –i credentialFIle $BKUPFILE root@cluster2DataNode:path/to/save/file

The first time you use the backup command you should probably do a full backup. However, the –incremental flag is the default, but if there is no existing incremental backups, the system will first do a full backup first.

On the destination cluster, you might want to run a script in the background that will watch your drop folder and perform the restore when the backup file is sent over. That script might look something like this:

#!/bin/bash

MONITORDIR="/path/to/the/dir/to/monitor/"
inotifywait -m -r -e create --format '%w%f' "${MONITORDIR}" | while read NEWFILE
do
	influx-ctl restore ${NEWFILE}
done

The Nuclear Option

Although this is not a proper cluster-to-cluster replication pattern, I feel it is still worth some discussion. What is the “nuclear option”? This is the last resort option if the whole world, or rather your clusters and all your data, have gone really bad. I should admit that I am a fan of keeping raw data for as long as possible. Doing this gives us more flexibility in our systems and how we might want to analyze things at a future point in time. Implementing this should be done with one of the replication on ingest patterns, where we would tap off our input stream and store all our raw data in long-term storage like S3 or Glacier. Doing this gives us the ability to rebuild our system from source.

Hence the nuclear option. If we have all our source data and need to rebuild our cluster from scratch, including all derivative data, we can now do this by replaying our source data. Obviously, there are still some considerations around exactly how much source data you have and the time needed to replay it, but hopefully you see the benefits.

Conclusion

In this post, I have discussed several patterns for replicating data between two different clusters of InfluxDB Enterprise. These are only the most basic patterns, and I think there are probably at least 20 more that exist. I wanted to present a few patterns that you could use as a starting point. The point of this post was not to present an exhaustive list or to say that this method or that method is the best one to use. In reality, there is no single best method or pattern. The best pattern for you is the one that meets your business objectives and fits within your organization’s infrastructure, processes and practices. Now go forth and replicate.