StatsD is a simple protocol for sending application metrics via UDP. These metrics can be sent to a Telegraf instance, where they are aggregated and eventually flushed to InfluxDB or other output sinks that you have configured.
There are many good reasons to use StatsD, and many good blogs about why it’s great. StatsD is simple, has a tiny footprint, and has a fantastic ecosystem of client libraries. It can’t crash your application and has become the standard for large-scale metric collection.
How it Works
Telegraf is an agent written in Go and accepts StatsD protocol metrics over UDP, then periodically forwards the metrics to InfluxDB.
UDP is often called the “fire and forget” protocol. UDP does not wait for a response, meaning that your application can continue doing it’s work regardless of the status of the StatsD server. Additionally, it is connection-less, so there is very little overhead introduced into your app.
First, you need to have Telegraf installed. Here I’ll just download the standalone linux binary:
$ wget http://get.influxdb.org/telegraf/telegraf_linux_amd64_0.2.0.tar.gz $ tar -xvf telegraf_linux_amd64_0.2.0.tar.gz $ mv telegraf_linux_amd64 /usr/bin/telegraf && chmod +x /usr/bin/telegraf
Next, you will need to setup Telegraf with the
statsd plugin. The
-sample-config option tells Telegraf to output a config file.
-output-filter tell Telegraf which plugins (StatsD) and outputs (InfluxDB) to configure:
$ telegraf -sample-config -filter statsd -outputfilter influxdb > tele.conf $ telegraf -config tele.conf
The config file (
tele.conf) will assume that your InfluxDB instance is running on
localhost, and will need to be edited if that’s not the case. There are also many configuration options for the StatsD server, but I won’t go into all of them in this guide, see the documentation for more details.
Sending StatsD Metrics
By default, Telegraf will begin listening on port 8125 for UDP packets. StatsD metrics can be sent to it using echo and netcat:
$ echo "mycounter:10|c" | nc -C -w 1 -u localhost 8125
Or using your favorite client library.
Introducing Influx StatsD
Since InfluxDB supports tags, our StatsD implementation does too! Adding tags to a StatsD metric is similar to how they appear in line-protocol.
This means that you can tag your StatsD metrics like below. This particular metric increments the
user_login counter by 1, with tags for the service and region we are using.
That’s it! Simply add a comma-separated list of tags in key=value format.
For those of you using a StatsD client, this extra bit can be added to the the bucket. I’ll use the Python client as an example:
>>> import statsd >>> c = statsd.StatsClient('localhost', 8125) >>> c.incr('user.logins,service=payroll,region=us-west') # Increment counter
Once flushed, the metric will be available in InfluxDB in all its tagged glory.
> SELECT * FROM statsd_user_logins name: statsd_user_logins ------------------------ time host metric_type region service value 2015-10-27T03:26:40Z tyrion counter us-west payroll 1 2015-10-27T03:26:50Z tyrion counter us-west payroll 1
Telegraf supports all of the standard StatsD metrics, which are detailed below.
A simple counter, the
logins_total metric will be incremented by 1 and 15 in the above examples. Counters can be always-increasing, or you can opt to have them cleared with each flush using the
delete_counters config option.
Tells StatsD that this counter is being sent sampled every 1/10th of the time. In this example, the
logins_total metric will be incremented by 10.
Gauges are changed with each subsequent value sent. The value that makes it to InfluxDB will be the last recorded value. Gauges will remain at the same value until a new value is sent. You can opt to have them cleared with each flush using the
delete_gauges config option.
Adding a sign can change the value of a gauge rather than overwriting it:
Timings & Histograms
Timings are meant to track how long something took. They are an invaluable tool for tracking application performance.
When Telegraf receives timing metrics, it will aggregate them and write the following statistics to InfluxDB, more details on each of these can be found in the documentation
Timings, like counters, can also be sampled. This will let Telegraf know that this timing was only taken once every 10 runs.
Telegraf aggregates stats as they arrive, and limits the number of timings cached to keep its memory footprint low. By default, Telegraf will keep track of 1000 timings per-stat when calculating percentiles. This can be adjusted using the
percentile_limit config option.
Sets can be used to count unique occurences. In the above example, the
unique.users metric will be incremented by 1, then will not be incremented no matter how many times the value
100 is sent.
In the coming months, we plan to support additional extensions of the standard StatsD protocol, including the ability to specify multiple fields within a measurement. This will come after InfluxDB 0.9.5 ships, where multiple fields will have no performance hit.
Also, let us know what you’d like to see by opening an issue on github!