InfluxData Blog - Margo Schaedel

Visualizing Time Series Data with Dygraphs

Margo Schaedel (InfluxData) — Wed, 03 Oct 2018 09:00:49 -0700

Overview

This post will walk through how to visualize dynamically updating time series data that is stored in InfluxDB (a time series database), using the JavaScript graphing library: Dygraphs. If you have a preference for a specific visualization library, check out these other graphical integration posts using various libraries—plotly.js, Rickshaw, Highcharts, or you can always build out a dashboard in our very own Chronograf, which is designed exclusively for InfluxDB.

Prep and Setup

To begin with, we’ll need some sample data to display on screen. For this example, I’ll be using the data generated from a separate tutorial written by DevRel Anais Dotis-Georgiou on using the Telegraf exec or tail plugins to collect Bitcoin price and volume data and see it trend over time. I’ll then query for the data in InfluxDB periodically using the HTTP API on the frontend. Let’s get started!

Depending on whether you want to pull in Dygraphs as a script file into your index.html file or import the npm module, you can find all the relevant instructions here. I added several script tags into my index.html file for ease of reference in this case:

<!DOCTYPE html>
<html lang="en" dir="ltr">
  <head>
    <meta charset="utf-8">
    <title>Dygraphs Sample</title>
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/dygraph/2.1.0/dygraph.min.css" />
    <link rel="stylesheet" type="text/css" href="styles.css">
  </head>
  <body>
    <div id="div_g"></div>
  </body>
  <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/dygraph/2.1.0/dygraph.min.js"></script>
  <script type="text/javascript" src="script.js"></script>
</html>

Querying InfluxDB

Ensure your local instance of InfluxDB is running (you can get all the components of the TICK Stack set up locally or spin up the stack the sandbox way) and that Telegraf is collecting Bitcoin stats by running SELECT "price" FROM "exec"."autogen"."price" WHERE time > now() - 12h in your Influx shell (you can access the Influx shell, with the command influx). With time series data, you always want to scope your queries, so rather than running a SELECT * from exec, we are limiting our results here by selecting specifically for price and limiting by time (12 hrs).

You should receive at least one result when running this query, depending on how long your Telegraf instance has been running and collecting stats via one of the plugins from the tutorial. Alternatively, you can navigate to your local Chronograf instance and verify that you’re successfully collecting data via the Data Explorer page, which has an automatic query builder.

Fetching the Data from InfluxDB

In your script file, you’ll want to fetch the data from InfluxDB using the HTTP API, like so:

const fetchData = () => {
  return fetch(`http://localhost:8086/query?db=exec&q=SELECT%20"price"%20FROM%20"price"`)
    .then( response => {
      if (response.status !== 200) {
        console.log(response);
      }
      return response;
    })
    .then( response => response.json() )
    .then( parsedResponse => {
      const data = [];
      parsedResponse.results[0].series[0].values.map( (elem, i) => {
        let newArr = [];
        newArr.push(new Date(Date.parse(elem[0])));
        newArr.push(elem[1]);
        data.push(newArr);
      });
      return data;
    })
    .catch( error => console.log(error) );
}

Constructing the Graph

We can construct the graph using the Dygraphs constructor function as follows:

const drawGraph = () => {
  let g;
  Promise.resolve(fetchData())
    .then( data => {
      g = new Dygraph(
        document.getElementById("div_g"),
        data,
        {
          drawPoints: true,
          title: 'Bitcoin Pricing',
          titleHeight: 32,
          ylabel: 'Price (USD)',
          xlabel: 'Date',
          strokeWidth: 1.5,
          labels: ['Date', 'Price'],
        });
    });

  window.setInterval( () => {
    console.log(Date.now());
    Promise.resolve(fetchData())
      .then( data => {
        g.updateOptions( { 'file': data } );
      });
  }, 300000);
}

What’s happening in the drawGraph function is that after fetching the data from InfluxDB, we create a new Dygraph, by targeting the element within which to render the graph, add the data array, and add in our options object as the third argument. In order to dynamically update the graph over time, we add a setInterval method to fetch new data every five minutes (unfortunately, any calls more often than that require a paid subscription to the Alpha Vantage API for Bitcoin pricing) and use the updateOptions method to bring in new data.

Summary

If you’ve made it this far, I applaud you. Feel free to check out the source code for a little side-by-side comparison. Additionally, Dygraphs has a gallery of demos available if you want to experiment with a myriad of styles. We want to hear all about your creations! Look for us on Twitter: @mschae16 or @influxDB.

Metrics to Monitor in Your PostgreSQL Database

Margo Schaedel (InfluxData) — Thu, 16 Aug 2018 14:45:41 -0700

Overview

Last month I wrote a guide on how to monitor your PostgreSQL database using Telegraf and InfluxDB, and though I was able to cover a walkthrough of how to monitor PostgreSQL, I didn’t have a chance to cover what exactly you should be looking at when tracking the health of your database. There are several key metrics you’ll definitely want to keep track of when it comes to database performance, and they’re not all database-specific. For example, this blog post on MySQL database metrics gives a great introduction and overview to get you started in the monitoring scene.

PostgreSQL’s statistics collector automatically gathers a substantial number of statistics about its own activity. In the previous post we saw that the Telegraf plugin for PostgreSQL pulls data from two of these built-in views: pg_stat_database and pg_stat_bgwriter. If you want to pull in data from additional views, you should definitely check out this extended Telegraf plugin. In this post, we’ll take a more thorough look at the significance of these stats as an indicator of your PostgreSQL database health.

The pg_stat_database View

The pg_stat_database view records information concerning each database within a given cluster, including the database id (datid); number of backends actively connected to the database (numbackends); commits and rollbacks; disk blocks read and shared buffer cache hits; rows fetched, inserted, updated, and deleted; conflicts and deadlocks; temporary files created; and duration times spent reading and writing data.

The pg_stat_bgwriter View

The pg_stat_bgwriter view supplies information about the checkpoint process in order to determine how much load is being placed on the database while it’s updating or replicating files. The variables cover the number of total checkpoints occurring across all databases in the cluster—both scheduled and requested checkpoints—in addition to the amount of time spent in checkpoint processing. The buffers_checkpoint, buffers_clean, and buffers_backend indicate how the buffers were written to disk.

The Basics - Resource Utilization

In order for anything to be written, updated, and queried within PostgreSQL, the database needs to have adequate resources with which to achieve these tasks successfully. PostgreSQL, like other databases out there, relies heavily on various system resources such as CPU, network bandwidth, disk space/disk utilization, and RAM. Therefore, having insight into these system metrics and others like disk IOPS, swap space, and network errors can generally provide a good indication of the health of your overall database.

A few other metrics you may want to keep tabs on that PostgreSQL collects information on include connections, shared buffer usage, and disk usage. Tracking variables like numbackends in relation to max_connections (the pg_settings view) can draw attention to possible issues with slower queries and applications having to create new connections in order to carry out requests rather than using already active connections. You would rather keep a small pool of connections alive than have to constantly start up new ones and terminate idle ones.

Keeping an eye on shared buffer usage can be significant for reading or updating data. The shared buffer cache is where PostgreSQL will check first when executing a request, and if the block is not found there, it will then need to grab the data from disk, after which the data will be cached in the database’s shared buffer cache and possibly the OS cache. This allows for subsequent querying of that data without needing to access it on disk. However, the downside to this is that some data could end up cached in several places at once. Keep an eye on blks_hit and blks_read, which represent shared buffer hits and blocks read from disk, but also keep in mind that data sometimes gets saved in the OS cache, which PostgreSQL doesn’t report on.

Lastly, gathering information about the database’s disk usage (see pg_table_size or pg_indexes_size) can help to illuminate possible problems with query performance. There is a direct relationship between the two—as tables and indexes increase in size, queries will inevitably take longer, resulting in a need to allocate more disk space. A sudden rise in table or index size can also hint at problems with the VACUUM process (the process of cleaning up and removing dead rows—read more on that below).

Read/Write Throughput

Monitoring read and write query throughput helps to ascertain that your applications are able to both add data to the database and access it as well. Issues arising in this area can often lead to problems in other parts of the database, especially with regards to replication and reliability. In order to ensure availability, it’s not a bad idea to keep an eye on your reads and writes.

Take a look at tup_returned, the number of rows read or scanned versus tup_fetched, the number of rows fetched that contained data necessary to execute the query successfully. These two variables should consistently stay pretty close in number, which would point to the database carrying out read queries efficiently, since it wouldn’t be scanning through extra rows to satisfy the query requirements. Additionally, you may want to track temp_files and temp_bytes, since PostgreSQL sometimes has to write data temporarily to disk in order to successfully execute various queries (if there is not enough memory available). High numbers in this area indicate a potentially increasing number of resource-draining queries.

You’ll also want to make sure your write performance is up to snuff, so keeping tabs on tup_inserted, tup_updated, and tup_deleted is crucial. High rates of updated and deleted rows could lead to a higher number of dead rows (n_dead_tup in the pg_stat_user_tables view), which is another metric to keep tabs on. Having a huge number of dead rows (rows that have already been deleted and are waiting to be cleaned out) indicates something may be wrong with the clean-up process—in PostgreSQL, this process is known as the VACUUM process. Essentially, its job is to remove dead rows from tables and indexes in order to make the space available for new row insertions. As a side note, the VACUUM process should be run on a routine basis to allow for continued query efficiency and to update PostgreSQL’s internal statistics regularly. Remember that high amounts of dead rows (essentially wasted space) can definitely slow down your queries in the long-term.

If you encounter high rates of change in both read and write throughput, it makes sense to check if there are delays occurring from locks (lock from the pg_locks view) on tables or rows that are currently experiencing or awaiting updates. Related to this is the presence of any deadlocks in the database, which occur when several transactions hold locks on a row or table that another transaction needs in order to execute a query. It’s best avoid the occurrence of deadlocks altogether if possible by ensuring that locks are assigned in a consistent order each time.

Reliability

If your data is pretty important to you, then you’re probably keeping multiple copies of it (so you don’t lose it all in the event of a crash) and you want it to be highly available at all times. This is where the pg_stat_bgwriter view can make a big difference. It tracks a number of checkpoint metrics.

Checkpoints are periodic moments in the transaction process that ensure data files have been updated up to that moment on disk. If that sounds confusing, think of how word processors periodically auto-save the files you’re working on and if your program were to crash, upon reboot you’re brought back to that previous auto-saved version. Checkpoints operate similarly with respect to recorded and updated data files. Generally, the process of flushing the updated data to disk can cause significant I/O load, and as a result, checkpoint activity is spaced out in order to avoid a loss in performance. This means that a single checkpoint must complete before the next one can start.

Compare the following two variables: checkpoints_req and checkpoints_timed. The first shows the number of checkpoints requested while the latter represents the number of checkpoints scheduled. It’s preferable to have more checkpoints scheduled than requested; the other way around could point to your checkpoints not being able to keep up with the rate of data updates and indicate heavy load on the database.

The pg_stat_bgwriter also shows metrics on how PostgreSQL chooses to flush data in memory (buffers) to disk. It can do this in three different ways

buffers_backend - via backends
buffers_clean - via the background writer
buffers_checkpoint - via the checkpoint process

Ideally you want most of the flushes happening via the checkpoint process, but sometimes the background writer steps in to help lighten the I/O load that often occurs in the checkpoint process. An increase in buffers written directly by backends could mean a write-intensive load that is creating buffers so fast the checkpoint process can’t keep up. Ultimately, it’s in your best interest to keep an eye on these three.

Summary

Hopefully all this information can be combined with the previous tutorial to make it super easy for you to monitor your PostgreSQL databases using Telegraf and InfluxDB. Feel free to reach out to us on Twitter @InfluxDB and @mschae16 with any questions or comments or you can check out our community forum to see what other InfluxData users are building.

Monitoring Your PostgreSQL Database with Telegraf and InfluxDB

Margo Schaedel (InfluxData) — Fri, 20 Jul 2018 10:40:24 -0700

Overview

This tutorial will specifically cover the process of setting up Telegraf and InfluxDB to monitor PostgreSQL. For any newcomers to the scene, PostgreSQL (or just Postgres for short) is a really popular open source, object-relational database system that was originally spearheaded by developers at UC Berkeley back in 1986. It has important features like multi-version concurrency control and write-ahead logging that help to ensure data reliability. If you’re not too familiar with PostgreSQL, I’d recommend starting with their beginner’s tutorial.

Recognizing the importance of tracking and monitoring performance and throughput of databases, the makers of PostgreSQL added a statistics collector that automatically amasses information about its own database activity. You essentially have all these great metrics right out of the box. So let’s capitalize on that, expose all those metrics to Telegraf and send them on over to InfluxDB.

What You'll Need

I’m using a local installation of InfluxDB, Telegraf, and Chronograf for this tutorial; the “Getting Started” guides for each of those projects are great and easy to walk through. You’ll also need PostgreSQL on your machine and if you don’t happen to have any sample applications and databases lying around, you can fork/clone this repo down to follow along—it’s just a small Node/Express app that stores color palettes in PostgreSQL—be sure to follow the README on how to get the app working.

Editing Your Telegraf Config

To start with, the Telegraf GitHub page offers a number of input and output plugins to suit a variety of use cases—one of those includes the PostgreSQL input plugin. If we configure this plugin correctly in our Telegraf configuration file, we should automatically start seeing metrics being sent over to our default telegraf.autogen database within InfluxDB.

Let’s try it out.

Navigate to your Telegraf config file and find the [[inputs.postgresql]] section. If you’re using a Mac OS and used Homebrew to install InfluxDB and Telegraf, this path /usr/local/etc/telegraf.conf should get you to the default config file. Otherwise, feel free to refer to the Telegraf docs for further reference.

# # Read metrics from one or many postgresql servers
# [[inputs.postgresql]]
#   ## specify address via a url matching:
#   ##   postgres://[pqgotest[:password]]@localhost[/dbname]\
#   ##       ?sslmode=[disable|verify-ca|verify-full]
#   ## or a simple string:
#   ##   host=localhost user=pqotest password=... sslmode=... dbname=app_production
#   ##
#   ## All connection parameters are optional.
#   ##
#   ## Without the dbname parameter, the driver will default to a database
#   ## with the same name as the user. This dbname is just for instantiating a
#   ## connection with the server and doesn't restrict the databases we are trying
#   ## to grab metrics for.
#   ##
#   address = "host=localhost user=postgres sslmode=disable"
#   ## A custom name for the database that will be used as the "server" tag in the
#   ## measurement output. If not specified, a default one generated from
#   ## the connection address is used.
#   # outputaddress = "db01"
#
#   ## connection configuration.
#   ## maxlifetime - specify the maximum lifetime of a connection.
#   ## default is forever (0s)
#   max_lifetime = "0s"
#
#   ## A  list of databases to explicitly ignore.  If not specified, metrics for all
#   ## databases are gathered.  Do NOT use with the 'databases' option.
#   # ignored_databases = ["postgres", "template0", "template1"]
#
#   ## A list of databases to pull metrics about. If not specified, metrics for all
#   ## databases are gathered.  Do NOT use with the 'ignored_databases' option.
#   # databases = ["app_production", "testing"]

This is what the config file looks like out of the box. As you can see the instructions to follow are fairly simple. You definitely need to specify the address to connect to so Telegraf can talk to your PostgreSQL server. You can optionally specify other parameters such as a username,password, enable or disable ssl-mode, and connect to a specific database if you wish.

If you want to create a custom name for the server tag in your InfluxDB database, you can specify that in outputaddress. Connection lifetime dictates the duration you’d like the connection to remain open. Finally, you can list arrays of databases to either ignore or to collect metrics specifically for those listed. For this option you can only do one or the other, not both.

This plugin makes it easy to pull metrics from the already built-in pg_stat_database and pg_stat_bgwriter views within postgresql. Check out the docs to see exactly what metrics are pulled. Let’s change the address value to a string listing our host as localhost, like so:

address = "host=localhost"

The only other thing to ensure is that your data output will be sent to InfluxDB. If you scroll down to the outputs.influxdb section, you can edit the url to include InfluxDB’s default port 8086:

# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
  ## The full HTTP or UDP URL for your InfluxDB instance.
  ##
  ## Multiple urls can be specified as part of the same cluster,
  ## this means that only ONE of the urls will be written to each interval.
  # urls = ["udp://localhost:8089"] # UDP endpoint example
  urls = ["http://localhost:8086"] # required
  ## The target database for metrics (telegraf will create it if not exists).
  database = "telegraf" # required

  ## Name of existing retention policy to write to.  Empty string writes to
  ## the default retention policy.
  retention_policy = ""
  ## Write consistency (clusters only), can be: "any", "one", "quorum", "all"
  write_consistency = "any"

  ## Write timeout (for the InfluxDB client), formatted as a string.
  ## If not provided, will default to 5s. 0s means no timeout (not recommended).
  timeout = "5s"
  # username = "telegraf"
  # password = "metricsmetricsmetricsmetrics"
  ## Set the user agent for HTTP POSTs (can be useful for log differentiation)
  # user_agent = "telegraf"
  ## Set UDP payload size, defaults to InfluxDB UDP Client default (512 bytes)
  # udp_payload = 512

  ## Optional SSL Config
  # ssl_ca = "/etc/telegraf/ca.pem"
  # ssl_cert = "/etc/telegraf/cert.pem"
  # ssl_key = "/etc/telegraf/key.pem"
  ## Use SSL but skip chain & host verification
  # insecure_skip_verify = false

  ## HTTP Proxy Config
  # http_proxy = "http://corporate.proxy:3128"

  ## Optional HTTP headers
  # http_headers = {"X-Special-Header" = "Special-Value"}

  ## Compress each HTTP request payload using GZIP.
  # content_encoding = "gzip"

Restart Telegraf and Chronograf, navigate to Chronograf’s default port (8888) and in the Data Explorer section of the menu, you should see a measurement called postgresql under the default telegraf.autogen database. You should also see a plethora of metrics in the field column, including blk_read_time, blk_write_time, buffers_clean, datid, deadlocks, tup_inserted, and tup_deleted, just to name a few. To read up on what each of those fields means exactly, check out this reference page.

Alternatively, you can query the data from InfluxDB, using the CLI. In your terminal, type influx to access the Influx shell. The command, SHOW DATABASES will list the databases out for you, USE [databasename] and then SHOW MEASUREMENTS will list out the measurement names associated with that particular database. Then you can run various query statements such as

SELECT mean("xact_commit") AS "mean_xact_commit" FROM "telegraf"."autogen"."postgresql" WHERE time > now() - 5m AND "db"='palette_picker'

SELECT * FROM "telegraf"."autogen"."postgresql" WHERE time > now() - 1m AND "db"='palette_picker'

Try it out and see for yourself! If you get too query-happy and need to kill a query at any time, just run KILL QUERY [qid] which can be found using the SHOW QUERIES command.

Monitoring PostgreSQL in Production

If you want to keep tabs on your PostgreSQL databases while in production, it’s easy-peasy. Just update the telegraf config file with the correct address information. I’ve updated the address in my telegraf config file below to monitor Postgresql from my Heroku instance of this same sample app (Palette Picker). I was able to find all these credentials on my Heroku dashboard page. Check it out:

address = "host=ec2-204-236-239-225.compute-1.amazonaws.com user=username password=password dbname=databasename"

(The username, password, and dbname have been changed here for security purposes)

Pretty simple, right?

Next Steps

Hopefully this guide has helped show just how easy it is to monitor your PostgreSQL databases using Telegraf and InfluxDB. Next post, we’ll talk about some of the key metrics to keep an eye on when evaluating the health of your Postgres database. Feel free to reach out to us on Twitter @influxDB and @mschae16 with any questions or comments!

Simplifying InfluxDB: Retention Policy Best Practices

Margo Schaedel (InfluxData) — Wed, 20 Jun 2018 09:00:42 -0700

Retention policies can often be tricky even at the best of times but when you’re dealing with time series data, setting up the appropriate retention policy to automatically expire (delete) unnecessary data can save you loads of time in the long run. This post will walk through some general guidelines on creating the best retention policy for your use case with InfluxDB.

Wait...What's a Retention Policy?

Data doesn't remain useful forever.

Before we start talking about best practices around retention policies, it’s important to understand just what they are. Although its name is somewhat explanatory, an InfluxDB retention policy is defined in the documentation as:

The part of InfluxDB’s data structure that describes for how long InfluxDB keeps data (duration), how many copies of those data are stored in the cluster (replication factor), and the time range covered by shard groups (shard group duration). RPs are unique per database and along with the measurement and tag set define a series.

When you create a database, InfluxDB automatically creates a retention policy called autogen with an infinite duration, a replication factor set to one, and a shard group duration set to seven days.

A retention policy dictates for how long data will be kept and stored. Because time series data accumulates rapidly, best practice discards or downsamples data from InfluxDB once it’s no longer as relevant. Because time series data tends to pile up really quickly, you’re definitely going to want to discard or downsample data from InfluxDB once it’s no longer as useful. If you need further convincing, just check out these blog posts:

General Guidelines

There are a few key things to consider when you’re setting up your database’s retention policy. First and foremost, you’ll need to consider how long your use case requires that you retain the data. Do you need it for a week? A month? A year? This decision will specifically guide to what amount of time you set your retention policy duration and isn’t really negotiable.

But wait - you’re not done yet. Another integral part of setting up a retention policy involves designating the shard group duration for all data that will follow this retention policy. This is where things get tricky. Since shards really represent the core physical part of the database, tuning the shard group duration to just the right setting can really maximize performance and so, it’s important to get it right.

Setting the duration on the higher side will result in larger collections of data within each shard. This could cause problems when querying the database. For example, if you’re querying the database for a shorter time window than the shard group time span, the database may need to decode longer blocks of data in order to read a subset of the time range of the shard and that process will require greater effort and time.

On the other hand, if you set the shard group duration on the shorter side, the result is a greater number of shard groups. Due to Time Series Indexing, each shard will have some extra overhead in the form of this index and metadata, so having thousands of shards with little data on each is by no means efficient.

It can sometimes be difficult to determine the right setting for your shard group duration.

My recommendation is to be like Goldilocks and try them all out until you hit the perfect spot!

Okay, all joking aside - we at InfluxData recommend setting the shard group duration as follows:

The shard group duration should be twice your longest typical query's time range - yep, that means you'll need to think about what kinds of queries you'll be running on InfluxDB.
The shard group duration should be set so that each shard group ends up with at least 100,000 points per group - you want more data per shard, but not too much data.
The shard group duration should be set so that each shard group has at least 1,000 points per series.

Summary

If you’re new to using InfluxDB, setting up your database schema and retention policies can sometimes feel like a daunting task. Especially in more exceptional cases like working with very large clusters (Influx Enterprise) or with very long or short retention periods. You’ll definitely want to spend some time tweaking retention duration and shard group duration until you find the right fit. After all, it took Goldilocks three tries, right? Once you find that setting that’s just right, tweet us @InfluxDB and @mschae16 and tell us all about it!

Simplifying InfluxDB: Shards and Retention Policies

Margo Schaedel (InfluxData) — Tue, 05 Jun 2018 08:00:10 -0700

I recently did a webinar on an Introduction to InfluxDB and Telegraf and in preparing for it, I came to the woeful realization that there are still a number of concepts about InfluxDB which remain quite mysterious to me. Now, if you’re anything like me, databases and data storage don’t come naturally. (If they do, it never hurts to have a bit of review). I thought I had a pretty thorough understanding of InfluxDB as a time series data store, but now I see that there’s a lot more to it than meets the eye. Coincidentally, the inner workings of InfluxDB are pretty mysterious to some of our community as well and thus this blog post. In this guide (of sorts), we’ll try to make sense of some of the more enigmatic concepts around InfluxDB - specifically with regards to retention policies, shard groups, and shards. We’ll look at what they are and how they’re related to one another.

It's about time we paid attention to TSDBs.

Before We Jump In

If you’re new to the Time Series Database world, or if this is the first time you’re reading about InfluxDB, you may want to do a little light reading and gain some contextual knowledge. Here are some helpful resources to get you up to speed:

Retention Policies

Let’s tackle retention policies first. Time series data by nature begins to pile up pretty quickly and it can be helpful to discard old data after it’s no longer useful. Retention policies offer a simple and effective way to achieve this. It amounts to what is essentially an expiration date on your data. Once the data is “expired” it will automatically be dropped from the database, an action commonly referred to as retention policy enforcement. When it comes time to drop that data however, InfluxDB doesn’t just drop one data point at a time; it drops an entire shard group.

Retention policies drop entire groups of data, not just a single data point.

Shard Groups

A shard group is a container for shards, which in turn contain the actual time series data (but more on that in a minute). Every shard group has a corresponding retention policy and any shards within a single shard group adhere to the same retention policy. Additionally, every shard group has a shard group duration, which dictates the window of time each shard group spans (the time interval). The time interval can be specified when configuring the retention policy. If nothing is specified, the shard group duration defaults to 7 days.

Shards

When we think about a typical Time Series Database, the sheer volume of time series data that is stored and queried within merits an alternative approach to the categorization of that data. This is where shards come in. Shards are ideal containers for time series data. Sharding the data within InfluxDB allows for a highly scalable approach for boosting throughput and overall performance, especially considering that the data in a Time Series Database will in all likelihood grow over time.

Shards contain temporal blocks of data and are mapped to an underlying storage engine database. The InfluxDB storage engine is called TSM or Time-Structured Merge Tree and is remarkably similar to an LSM Tree. The TSM files are what contains the encoded and compressed time series data, organized within shards.

All shards belong to a single shard group, and their time intervals fall within the shard group’s time interval. It’s quite possible to have a single shard per shard group, as we see in the open-source version of InfluxDB, or multiple shards per shard group as often occurs in a multi-node cluster.

Looping Back to RPs

Looping back to retention policies for a moment, let’s take a closer look at how things fit together. When you create a database in InfluxDB, you automatically create a default retention policy for that database called autogen. If you choose not to modify the default policy, the value is set to infinite. In this case, the shard group duration will default to 7 days. This means that your data will be stored in 1 week time windows. If your retention policy is on autogen (or infinite), the data is not actually stored infinitely - this just means the retention policy matches the shard group duration, so the retention policy is effectively disabled. On the other hand, the minimum time you can set your retention policy to is one hour.

Another way to think about it is that a retention policy is like a bucket for shard groups to live in. Once the retention policy expiration date kicks in, you throw out the shard group that has the interval of time that doesn’t pass the retention policy expiry date. So even as time passes, you’ll still have the same amount of data available to you - it will just shift in time. For example, if I set my retention policy to one year, I’ll always have a year’s worth of data available to me (once I hit that first year mark).

As you can see, shard groups (and by association, shards) are closely related to retention policies; if a retention policy has data, it will have at least one shard group. Every data point, which is a measurement consisting of any number of values and tags associated with a particular point in time, must be associated with a database and a retention policy. It’s important to remember here that a database can have more than one retention policy and that all retention policies are unique per database.

OSS vs. Enterprise

Things get a little hairy when we start looking at shards, shard groups, and retention policies in InfluxDB Enterprise as compared to the open-source version of InfluxDB. If we’re using the open-source version, we’ve only got a single node instance of InfluxDB, and this means we don’t need to worry about replicating our data because that feature isn’t available. So the shard group ends up having only one shard within it, effectively making them the same thing (another way to think about it is that the shard becomes redundant). This is because you don’t need to spread the data evenly across multiple nodes—you’ve only got one node! When the retention policy kicks in, you drop the whole shard group.

With InfluxDB Enterprise, on the other hand, you can have multiple node instances of InfluxDB. If you want to know more about this clustering capability, I recommend reading this blog, which covers the basics. Having more than one node in a cluster is the reason shard groups exist. We needed a way to spread the data evenly across multiple nodes, while still belonging to the appropriate database, retention policy, and time interval. In Enterprise, a shard group can have (and usually does have) a set of shards within it that all share the same time span. Each shard in the shard group would contain a different subset of time series.

We also see replication factor come into play with the Enterprise version of InfluxDB. The replication factor represents the number of copies you want to make of the data. You can specify the replication factor in the database retention policy. Two copies of the same data cannot end up in the same shard group. They would ideally live in separate shard groups and on separate nodes. That way, if one node goes down, you still have a backup on another node.

Seeing It in Action

To help this sink in, let’s consider all of this with a few examples:

For the open-source version, remember we’ve only got one node instance, so a shard group would have only one shard within, like so:

Data Points
---------------
series_a t0
series_a t4

series_b t2
series_b t6

series_c t3
series_c t8

series_d t7
series_d t9

Shard Group Z (t0 - t10)
-------------
Shard 1 (series_a, series_b, series_c, series_d)

From the simplified example above, you see we have a shard group (Z) with a time span from t0 to t10 and several series subsets (a, b, c, and d). Because we don’t have to worry about distribution here (spreading the data evenly across various nodes), all the series are contained within one shard (Shard 1).

For the Enterprise version, we can have more than one node, so things get a little more complicated. If we had a two-node cluster, for example, with a replication factor of 1:

Data Points
---------------
series_a t0
series_a t4

series_b t2
series_b t6

series_c t3
series_c t8

series_d t7
series_d t9

Shard Group Z (t0 - t10)
-------------
Shard 1 (series_a, series_c) (Node A)
Shard 2 (series_b, series_d) (Node B)

You can see we still have shard group Z with a time span from t0 to t10, but this shard group contains two shards. Because replication factor is only 1 (i.e. only 1 copy of data), distribution takes priority and so half the data is stored on Node A and the other half is stored on Node B. This evenly spreads the data across the two nodes and lessens possibility of performance issues. However, if we increase the replication factor to 2, the replication takes precedence over distribution and the outcome looks quite similar to the open-source example. See below:

Data Points
---------------
series_a t0
series_a t4

series_b t2
series_b t6

series_c t3
series_c t8

series_d t7
series_d t9

Shard Group Z (t0 - t10)
-------------
Shard 1 (series_a, series_b, series_c, series_d) (Node A, Node B)

Now we’re back to one shard within one shard group (Z), but it exists on both nodes, due to the replication factor. Let’s add retention policy to the mix now.

Let’s say we’ve got our database all set up with a retention policy of 1 day (24hrs) and our shard group duration set to the recommended 1 hour time interval. If this is the OSS version of InfluxDB, the shard group will contain one shard. That shard will house all series for the 1 hour time span similar to what we saw in our first example:

Shard Group Z (t0 - t60)
-------------
Shard 1 (series_a, series_b, series_c, series_d)

Of course for every hour in the day, a new shard group will be created spanning 60 minutes and the number of shard groups will continue to increase until we hit the 25th hour (after 1 full day passes). When the retention policy is enforced, we will see that the initial shard group has passed the expiration point, and so the entire shard group will be dropped. This will continue on the hour, every hour. So at any given time, we will have precisely 1 day’s worth of data.

Making Sense of It All

To summarize:

An InfluxDB instance can have 1 or more databases.
Each of those databases can have 1 or more retention policies.
You can specify the retention interval, shard group duration, and replication factor in your retention policy.
Each retention policy can have 1 or more shard groups (as long as there's data).
Each shard group can have 1 or more shards (always 1 shard for the OSS version).
Shards contain the actual data.

I hope this post has helped to clear things up a little, but if you’re still feeling confused (trust me, I know the feeling well), please reach out to us on Twitter @influxDB and @mschae16 and we can try to answer all your questions. Or check out our awesome community site where everyone comes together to help each other out with debugging and making sense of the magical and oft-times mysterious InfluxData platform.

Visualizing Your Time Series Data with the Highcharts Library

Margo Schaedel (InfluxData) — Thu, 19 Apr 2018 13:00:12 -0700

There have been a couple of posts in the past on visualizing your time series data using different charting libraries such as this integration with plotly.js or this one on the Rickshaw library. Today we’re going to take a look at the charting library, Highcharts—another great tool for your data visualization needs. Of course, if you don’t want to pull in external graphing libraries, you can always check out Grafana or Chronograf. Grafana easily integrates with InfluxDB, and Chronograf was built out specifically to be used with InfluxDB.

<figcaption> Our famed InfluxDB I’iwi</figcaption>

Before we start throwing those graphs on the page though, you’ll need to ensure you have an instance of InfluxDB up and running. You can get all the components of the TICK Stack set up locally or spin up the stack in our handy sandbox mode.

I recently published a beginner’s guide on the Node-influx client library as an option for integrating with InfluxDB without necessarily having to use Telegraf to collect your data. This visualization is built out using the same ocean tide data from that post. You can clone the repo down here if you want to check out the end product.

First Steps

Pulling in the library is our first step. I added the following script tag to the head section of the index.html file.

<script src="https://code.highcharts.com/highcharts.js"></script>

To the body of the index.html file, you’ll need a container div with an id of ‘container’ so we can later target that in the script file, like so:

<div id="container"></div>

The Highcharts graph will be rendered within this container.

In our server file we’ve already set up an endpoint to query the data from our ocean tides database (see below) so we’ll need to fetch the data in our script file and set it into our Highcharts constructor function.

app.get('/api/v1/tide/:place', (request, response) => {
  const { place } = request.params;
  influx.query(`
    select * from tide
    where location =~ /(?i)(${place})/
  `)
  .then( result => response.status(200).json(result) )
  .catch( error => response.status(500).json({ error }) );
});

In the script file, I wrote a simple fetch function that retrieves the data based on the location name passed in.

const fetchData = (place) => {
  return fetch(`/api/v1/tide/${place}`)
    .then(res => {
      if (res.status !== 200) {
        console.log(res);
      }
      return res;
    })
    .then(res => res.json())
    .catch(error => console.log(error));
}

To fetch all the data for the four different locations, I used Promise.all() and then mutated the results to fit into the required format referenced in the Highcharts documentation.

return Promise.all([
            fetchData('hilo'),
            fetchData('hanalei'),
            fetchData('honolulu'),
            fetchData('kahului')
         ])
        .then(parsedRes => {
          const mutatedArray = parsedRes.map( arr => {
            return Object.assign({}, {
              name: arr[0].location,
              data: arr.map( obj => Object.assign({}, {
                x: (moment(obj.time).unix())*1000,
                y:obj.height
              }))
            });
          });
        })
        .catch(error => console.log(error));

Now that we have our data ready to go, we can construct our graph.

Highcharts.chart('container', {
            colors: ['#508991', '#175456', '#09BC8A', '#78CAD2'],
            chart: {
              backgroundColor: {
                  linearGradient: [0, 600, 0, 0],
                  stops: [
                    [0, 'rgb(255, 255, 255)'],
                    [1, 'rgb(161, 210, 206)']
                  ]
              },
              type: 'spline'
            },
            title: {
              text: 'Hawaii Ocean Tides',
              style: {
                'color': '#175456',
              }
            },
            xAxis: {
              type: 'datetime'
            },
            yAxis: {
              title: {
                text: 'Height (ft)'
              }
            },
            plotOptions: {
              series: {
                turboThreshold: 2000,
              }
            },
            series: mutatedArray
          });

There’s definitely a lot going on here. The Highcharts library comes with the method chart() which accepts two arguments: the target element within which to render the chart and an options object within which you can specify various properties such as style, title, legend, series, type, plotOptions, and so on. Let’s go through each of the options one by one.

colors: [array] - The colors property accepts an array of hex codes which will represent the default color scheme for the chart. If all colors are used up, any new colors needed will result in the array being looped through again.
chart: {object} - The chart property accepts an object with various additional properties including type, zoomtype, animation, events, description and a number of style properties. In this instance, I've given the background a linear gradient and designated the type as spline.
title: {object} - This represents the chart's main title and can be additionally given a style object to jazz things up a bit.
xAxis: {object} - In this scenario, because I'm using time series data, I know the x-axis will always be time so I can designate the type as 'datetime' and the scale will automatically adjust to the appropriate time unit. However, there are numerous other options here including styling, labels, custom tick placement, and logarithm or linear type.
yAxis: {object} - Similar to the xAxis property, the y-axis takes an object and has access to a number of options to customize the design and style of the chart's y-axis. I've only specified y-axis title in this case, and deferred to Highcharts automatic tick placement.
plotOptions: {object} - The plotOptions property is a wrapper object for config objects for each series type. The config objects for each series can also be overridden for an individual series item as given in the series array. Here I've used the plotOptions.series property to override the default turboThreshold of 1000 and change it to 2000. This allows for charting a greater number of data points (over the default of 1000). According to the docs, conf options for the series are accessed at three different levels. If you want to target all series in a chart, you would use the plotOptions.series object. For series of a specific type, you would access the plotOptions of that type. For instance, to target the plotOptions for a chart type of 'line' you would access the plotOptions.line object. Lastly, options for a specific series are given in the series property (see next bullet point).
series: [array] or {object} - This is where you'll pass in your data. You can additionally define the type for the data to be passed in, give it a name, and define additional plotOptions for it.

Check out the result!

<figcaption> How wavy! (Get it? - You know, because of the ocean… and tides.)</figcaption>

This information really just covers the tip of the iceberg. The possibilities seem endless in terms of what you can create using the Highcharts graphing library. Why not take a look at their documentation or demos and let us know all about your new creations with InfluxDB and Highcharts? Questions and comments? You can always reach out to us on Twitter: @mschae16 or @influxDB. Happy charting!

Getting Started with the Node-Influx Client Library

Margo Schaedel (InfluxData) — Tue, 17 Apr 2018 13:20:39 -0700

Embark on a new journey with node-influx!

When in doubt, start at the beginning—an adage that applies to any learning journey, including getting started with the node-influx client library. Let’s take a look at the InfluxDB client libraries in particular,node-influx, an InfluxDB client for JavaScript users. This client library features a simple API for most InfluxDB operations and is fully supported in Node and the browser, all without needing any extra dependencies.

There’s a great tutorial for the node-influx library available online as well as some handy documentation, which I recommend reading through beforehand. Here, we will just cover a few of the basics.

What You'll Need

For this tutorial, I’ll be running a local installation of InfluxDB; you can learn how to get that up and running here. You’ll also need Node installed. If Node.js is not your cup of tea, there are plenty of other client libraries to work with and several guides on using InfluxDB with other languages available, such as these posts on Python and Ruby.

Set the Scene

How to get slotted

Let us imagine for a minute you have an inexplicable love for surfing. You find yourself in Hawaii on a journey following in Duke’s footsteps and you’re trying to find the best surf spot. And the best time at which to surf said amazing spot. Makes sense to take a look at the tides right? Well, according to our trusty friend Wikipedia, ocean tides are a great example of time series data. They ebb and flow over time(yes, I know I’m laying it on rather thick here). So let’s practice putting some sample tide data into InfluxDB using the node-influx library and see what happens.

First things first, we need to install the node-influx library in the application folder where it will be used.

$ npm install --save influx

This adds the node-influx library to our node_modules; we also need to require the library into our server file, like so

const Influx = require('influx');

We’ll use the following constructor function to connect to a single InfluxDB instance and specify our connection options.

const influx = new Influx.InfluxDB({
  host: 'localhost',
  database: 'ocean_tides',
  schema: [
    {
      measurement: 'tide',
      fields: { height: Influx.FieldType.FLOAT },
      tags: ['unit', 'location']
    }
  ]
});

There are a few different options available here:

You could connect to a single host by passing the DSN as a string into the constructor argument, like so:

const influx = new Influx.InfluxDB('http://user:password@host:8086/database')

You could also pass in a full set of config details and specify properties such as username, password, database, host, port, and schema - that's what we did above.

If you have multiple Influx nodes to connect to, you can pass in a cluster config. For example:

const client = new InfluxDB({
  database: 'my_database',
  username: 'duke_kahanamoku',
  password: 'aloha',
  hosts: [
    { host: 'db1.example.com' },
    { host: 'db2.example.com' },
  ]
  schema: [
    {
      measurement: 'tide',
      fields: { height: Influx.FieldType.FLOAT },
      tags: ['unit', 'location']
    }
  ]
})

It’s worth noting here that within your schema design, you will need to designate the FieldType for your field values using Influx.FieldType - they can be strings, integers, floats, or booleans.

Checking The Database

We can use influx.getDatabaseNames() to first check if our database already exists. If it doesn’t, we can then use influx.createDatabase() to create our database. See below:

influx.getDatabaseNames()
  .then(names => {
    if (!names.includes('ocean_tides')) {
      return influx.createDatabase('ocean_tides');
    }
  })
  .then(() => {
    app.listen(app.get('port'), () => {
      console.log(`Listening on ${app.get('port')}.`);
    });
    writeDataToInflux(hanalei);
    writeDataToInflux(hilo);
    writeDataToInflux(honolulu);
    writeDataToInflux(kahului);
  })
  .catch(error => console.log({ error }));

We are first grabbing all the databases available from our connected Influx instance, and then cycling through the returned array to see if any of the names match up with ‘ocean_tides’. If none do, then we create a new database with that name. The callback from that then writes our data into the database.

Writing Data to InfluxDB

Using influx.writePoints(), we can write our data points into the database.

influx.writePoints([
      {
        measurement: 'tide',
        tags: {
          unit: locationObj.rawtide.tideInfo[0].units,
          location: locationObj.rawtide.tideInfo[0].tideSite,
        },
        fields: { height: tidePoint.height },
        timestamp: tidePoint.epoch,
      }
    ], {
      database: 'ocean_tides',
      precision: 's',
    })
    .catch(error => {
      console.error(`Error saving data to InfluxDB! ${err.stack}`)
    });

To keep things simple, I just pulled in a few sample data files, then loop through them by location and write each data point to InfluxDB under the measurement name tide with location and unit tags (both are strings). There is only one field here, height and I send in a timestamp as well, although that is not technically required (it’s more accurate though). You can specify additional options such as the database to write to, the time precision, and the retention policy.

Querying the Database

We’ve learned how to write data into the database; now we need to know how to query for that data. It’s simple - we can use influx.query() and pass in our InfluxQL statement to retrieve the data we want.

influx.query(`
    select * from tide
    where location =~ /(?i)(${place})/
  `)
  .then( result => response.status(200).json(result) )
  .catch( error => response.status(500).json({ error }) );

Here we are querying the database for any data from measurement tide where location contains the place name passed in (using a regular expression). If you’ve stored a lot of data, it’s a good idea to also limit your query to a certain time span. You can additionally pass in an options object (database, retention policy, and time precision) to the influx.query() method.

Conclusion

That covers all the basics for the node-influx client library. Have a scan over the docs and let us know if there are other use cases you’d like to hear about! I’ve also posted all this code in a repository on GitHub if you want to try it out for yourself. Questions and comments? Reach out to us on Twitter: @mschae16 or @influxDB. Now go forth and find that monster wave surf’s up!

Batch Processing vs. Stream Processing: What's the Difference?

Margo Schaedel (InfluxData) — Thu, 29 Mar 2018 09:30:05 -0700

If you’ve read DevRel Katy Farmer’s stellar post, Kapacitor and Continuous Queries: How To Decide Which Tool You Need, then you know that when our community talks, we listen. So, in alignment with that view and in honor of our very own Kapacitor Koala, let’s tackle another common community issue that has come to our attention: when should we use batch processing versus stream processing in our Kapacitor tasks?

<figcaption> Our famous Kapacitor Koala</figcaption>

Now, if you’ve no vague idea what Kapacitor is, I recommend doing a little light reading on it here and here just to get you up to speed. Kapacitor, the final component of our TICK Stack, offers several capabilities such as data transformation, downsampling, and alerting. Kapacitor uses its own DSL, called TICKscript, which allows you to define certain tasks, which can then be executed on your data—essentially, it’s processing your data for you.

Here’s where it gets tricky though: how do you choose whether to process your data as a batch task or streaming task?

Batch Tasks

Let’s discuss batch tasks first. A batch is a collection of data points that have been grouped together within a specific time interval. Another term often used for this is a window of data. When running a batch task, Kapacitor queries InfluxDB periodically, thereby avoiding having to buffer much of your data in RAM. There are several cases where batch processing is the way to go:

Performing aggregate functions such as finding the mean, maximum, or minimum of a set interval of data.
Cases where alerting doesn't need to run on every single data point (since state changes will probably not happen that often). You don't want to be inundated with alerts!
Downsampling of your data takes a large collection of data points and only retains the most significant data (so you can still view overall trends in the data).
Cases where a little extra latency won't severely impact your operation.
Cases with a super-high throughput InfluxDB instance since Kapacitor cannot process data as quickly as it can be written to InfluxDB (this occurs more frequently with InfluxDB Enterprise clusters).

Stream Tasks

On the other side, we have stream tasks. Stream tasks create subscriptions to InfluxDB so that every data point written to InfluxDB is also written to Kapacitor. One should note though that stream tasks use a high percentage of available memory, so memory availability is a key factor to take into consideration. Here’s where stream processing is most ideal:

If you want to transform each individual data point in real time (technically, this could also be run with a batch process but there's latency to consider).
Cases where lowest possible latency is paramount to the operation. If alerts need to be triggered immediately, for example, running a stream task will ensure the least possible delay.
Cases in which InfluxDB is handling high volume query load and you may want to alleviate some of the query pressure from InfluxDB.
Stream tasks understand time by the data's timestamps; there are no race conditions for when exactly a given point will make it into a window or not. With batch tasks, on the other hand, it is possible for a data point to arrive late and be left out of its relevant window.

Another advantage some might see with writing stream tasks is the ease of use in having to define the task using only Kapacitor’s TICKscript, without having to delve into writing queries for InfluxDB. If you are comfortable with writing both, however, it’s probably going to be in your best interest to go with batch processing most of the time since it uses a lot less memory. An additional factor to consider is that Kapacitor is not limited to use only with InfluxDB. For example, if you want to send data straight from Telegraf over to Kapacitor, that will have to be done as a streaming task.

Key Takeaways

Batch tasks query InfluxDB periodically, use limited memory, but can place additional query load on InfluxDB.
Batch tasks are best used for performing aggregate functions on your data, downsampling, and processing large temporal windows of data.
Stream tasks subscribe to writes from InfluxDB placing additional write load on Kapacitor, but can reduce query load on InfluxDB.
Stream tasks are best used for cases where low latency is integral to the operation.

When our community talks, we listen.

We’d love to hear how your batch and stream tasks are going! Send us your comments, questions, issues, and blog ideas on our community site and feel free to reach out to us on Twitter:

@InfluxDB @mschae16

Instrumenting Your Node/Express Application: Viewing Your Data

Margo Schaedel (InfluxData) — Wed, 14 Mar 2018 07:30:58 -0700

This post is the follow-up to Instrumenting Your Node/Express Application. Here we will begin to explore some of the data that is being stored in InfluxDB and build out a dashboard in Chronograf. If you haven’t had a chance yet to begin instrumenting your Node.js applications, I recommend taking a look at my previous post to provide some context.

When I last left off, we had some data being collected and stored in InfluxDB, as we could see from querying the database:

Of course, it’s not entirely helpful just to see rows upon rows of numbers. It would be more sensible to view the data in a graph or table so we can more easily see trends in the data, and better yet to build out a full dashboard, so we can view all relevant data simultaneously. Using Chronograf to visualize our data offers just that. If you haven’t installed Chronograf yet, here’s a nifty guide that will get you up and running with all the different components of the TICK Stack—Telegraf, InfluxDB, Kapacitor, and Chronograf.

For this section, I’ll be using the instrumented version of the Node.js application, AmazonBay, which you can clone down from GitHub here. It’s using this Node Metrics library to send data via Telegraf into InfluxDB. Once you’ve cloned it down and set everything up, ensure your server is running with node server.js so that Telegraf can start collecting metrics.

Let’s start Chronograf and navigate to our dashboards section, where we can start building out a proper dashboard to gain some insight into the data we’re collecting. You’ll need to create a dashboard first and name it— mine is “Instrumented-AmazonBay” for the sake of convenience.

<figcaption> Let’s create a new dashboard</figcaption>

As we start visualizing, let’s take a moment to consider what metrics we’re collecting and why.

CPU Usage

It’s generally a good idea to keep track of an application’s CPU usage over time. Although Node.js apps typically consume a minimal amount of CPU, having this data on-hand affords visibility into the health of your application, by highlighting instances that deviate from the norm. Having the capability to ascertain what, if any, operations are causing high CPU usage is certainly a step towards understanding the performance of your application.

In the case of AmazonBay, we are monitoring the CPU percentage (a value between 0-1) of our process (the percentage of CPU used by the application) and our system (the percentage of CPU used by the system as a whole). We can chart both as seen below:

<figcaption> CPU Usage of both Process and System</figcaption>

I built the query through the Chronograf UI, but edited it to change percentage to a value between 0-100 as so:

SELECT (mean("process")*100) AS "mean_process" FROM "telegraf"."autogen"."cpu_percentage" WHERE time > :dashboardTime: GROUP BY :interval: FILL(null)

Event Loop Latency

Because of Node.js’s nonblocking, single-threaded nature, it is extraordinarily fast in handling a multitude of events quickly and asynchronously. The event loop is responsible for this, and it would therefore behoove one to recognize and pinpoint any latencies present within the event loop that could be causing regression in application performance. Longer-lasting latency exacerbates each cycle of the event loop and could eventually slow down the app to a state of purgatory. If the server witnesses an increase in load, for example, this can lead to an increase in tasks per event loop, which will effect longer response times for the end user. Collecting data on these latencies can assist in the decision of whether to scale up the number of processes running the application and return performance levels to equilibrium.

For our measurement of event loop latency, we have access to the min, max, and average latency times in milliseconds. There are several visualization options available in Chronograf, including line and stacked graphs, step-plots, bar graphs, and gauges, all available when you switch from the Queries section to the Visualizations section.

<figcaption> Various Visualization Types are available in Chronograf</figcaption>

Below you can see event loop latency depicted in various visualizations:

<figcaption> Minimum, Average, and Maximum Sampled Event Loop Latency (in milliseconds)</figcaption>

Garbage Collection, Heap Usage, and Memory Leaks

Memory leaks are an oft-cited complaint by Node.js developers, as it is usually tricky to determine the point of causation. They occur when objects are referenced for too long, when variables are stored past their point of use. Recognizing their existence early on is integral to monitoring the health of your application, and can be achieved by tracking the app’s heap usage (a segment of memory allocated for storing objects, strings and closures) and/or its garbage collection (the process of freeing up unused memory) rates. For instance, a steady growth in heap usage will eventually max out at the 1.5GB default restriction required by Node.js and cause a service crash and restart on the process. Similarly, you can look for patterns within garbage collection rates, for as extraneous objects accumulate within memory, the time spent in the garbage collection process likewise increases. Of course, once you’ve found yourself with a memory leak, it’s a rather tedious process trying to pinpoint the root cause, usually involving comparing differently timed heap snapshots of your application to see what has changed between the two.

We will monitor both the heap usage and the garbage collection rates in this instance. See below:

<figcaption> GC Cycle Duration and Heap Usage (MB)</figcaption>

For heap usage in particular, I altered the query to display in megabytes rather than the default (bytes):

SELECT ("used"/1000000) FROM "telegraf"."autogen"."gc" WHERE time > :dashboardTime:

HTTP Requests

The duration of HTTP requests is an important metric especially because it most often directly involves the end user. As users have become more impatient than ever, slow response times can heavily detriment the success of an application. Monitoring the duration of these requests presents awareness on whether users are able to interact with the application quickly and efficiently. The faster things are, the higher user satisfaction will be, plain and simple.

Here, you’ll see I built out a stacked graph visualizing mean HTTP request/response duration in milliseconds, grouped by different URLs:

<figcaption> HTTP Request/Response Duration (ms)</figcaption>

Database Queries

In this particular application, the inventory and order history are stored using the PostgreSQL relational database and at various points in the application, one has to query the database. This falls under the category of an external dependency or any system with which your application interacts. There are others beside databases—third-party APIs, web services, legacy systems—and although we cannot necessarily change the code running within these services directly, these dependencies are nevertheless important to the success of the application and therefore worth tracking, if only to be able to differentiate between problems arising within the application and problems without. However your application communicates with third-party applications, internal or external, the latency in waiting for the response can potentially impact the performance of your application and your customer experience. Measuring and optimizing these response times can help solve for these bottlenecks.

We’ve tracked the duration of our queries to the Postgres database and are depicting them in a line/stat graph as so:

<figcaption> Postgres Query Duration (ms)</figcaption>

Summary

Once you pull everything together, you have a full dashboard at your disposal monitoring the health of your Node.js application:

<figcaption> Success!</figcaption>

That just about sums it up for this post. I’d love to hear how you’re instrumenting your Node.js applications, and how you’re visualizing your metrics and events! Thanks for coming along on this journey and feel free to reach out to me via margo@influxdata.com, or on Twitter with any questions and/or comments. Happy dashboarding!