Parsing Logs into InfluxDB using Telegraf’s logparser Plugin

Cameron Sparr — June 23, 2016

InfluxData is excited to announce that it now supports the ability stream log event data to InfluxDB with out-of-the-box patterns for popular servers like Nginx and Apache. Log monitoring support is the latest addition to Telegraf’s already impressive list of 90+ input and output data plug-ins. In addition to standard format logs, custom log parsing and patterns based on the popular “grok” patterns are also supported.

What is Telegraf? Telegraf is a metric collection daemon that can collect metrics from a wide array of inputs and write them into a wide array of outputs. It is plugin-driven for both collection and output of data so it is easily extendable. It is also written in Go, which means that it is a compiled and standalone binary that can be executed on any system with no need for external dependencies, no npm, pip, gem, or other package management tools required.

With Telegraf you can not only parse logs to create metrics and store them as time-series data, InfluxDB also has the ability to store raw log lines along-side the metrics. We should caution that this new capability is not intended to be a replacement for full text search tools, rather for aiding in root cause analysis during specific windows of triage. So, if you are investigating performance metrics within a band of time, you can search via regex for any log errors that occurred in that same band. All the data is conveniently in one place, making it fast to lookup and correlate data.

The logparser plugin can be utilized in Telegraf to structure any log data into queryable and searchable metrics. In this guide we will use one of Telegraf’s built-in patterns to parse logs that are in the “Combined Log Format”, which is a common setup in Apache and Nginx access logs. It is an extension of the Common Log Format with the addition of two quoted strings for “Referrer” and “User-Agent”:

127.0.0.1 ident auth [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://localhost:8083/" "Chrome/51.0.2704.84"

In Nginx, configuring your server to output logs like this would look something like this in `/etc/nginx/nginx.conf`:

log_format vhosts '$host $remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"';

access_log /var/log/nginx/access.log vhosts;

And like this in Apache (see here for more details):

"%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""

Step 1 — Install Telegraf

Telegraf packages are available for many different platforms from the InfluxData downloads page:

# Install on Ubuntu:
wget https://dl.influxdata.com/telegraf/releases/telegraf_1.0.0_amd64.deb
sudo dpkg -i telegraf_1.0.0_amd64.deb
service telegraf stop

Step 2 — Running Through a Test Metric

First create a temp log file:

touch /tmp/test.log

Next setup a simplified `telegraf.conf` file by running the following `echo` command.

echo '
[[inputs.logparser]]
  ## file(s) to tail:
  files = ["/tmp/test.log"]
  from_beginning = false
  name_override = "test_metric"
  ## For parsing logstash-style "grok" patterns:
  [inputs.logparser.grok]
    patterns = ["%{COMBINED_LOG_FORMAT}"]

[[outputs.file]]
  ## Files to write to, "stdout" is a specially handled file.
  files = ["stdout"]' >> /tmp/telegraf.conf

The file tells Telegraf to run using the `logparser` input plugin and the `file` output plugin. The `logparser` input will tail the `/tmp/test.log` file, parsing and outputting metrics with the name `test_metric`. The output will be written using the `file` output plugin to `stdout`.

Now we will use that config file to run Telegraf:

telegraf --config /tmp/telegraf.conf

While Telegraf is running, we will echo the example log line to the `/tmp/test.log` file.

bash
echo '127.0.0.1 ident auth [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://localhost:8083/" "Chrome/51.0.2704.84"' >> /tmp/test.log

In the Telegraf output, you should see the above log line get parsed into an InfluxDB line-protocol metric and printed to `STDOUT`.

test_metric,host=localhost,http_method=GET agent="Chrome/51.0.2704.84",auth="auth",client_ip="127.0.0.1",http_status_code=200i,http_version=1,ident="ident",referrer="http://localhost:8083/",request="/apache_pb.gif",response_bytes=2326i 971211336000000000

Parsing Access Logs and Writing Real Data into InfluxDB

The next steps will build on the previous test data and parse real access log data into InfluxDB.

These steps will deal with parsing an Nginx log file. If you don’t have one handy, you can download an example here.

Step 3 — Install InfluxDB

InfluxDB packages are available for many different platforms from the downloads page.

# Install on Ubuntu:
wget https://dl.influxdata.com/influxdb/releases/influxdb_1.0.0_amd64.deb
sudo dpkg -i influxdb_1.0.0_amd64.deb
service influxdb start

Step 4 — Configure Telegraf

First we will rename the default Telegraf config file:

mv /etc/telegraf/telegraf.conf /etc/telegraf/telegraf.conf.orig

Now write a new config file:

echo '
[[inputs.logparser]]
  ## files to tail.
  files = ["/var/log/nginx/access.log"]
  ## Read file from beginning.
  from_beginning = true
  ## Override the default measurement name, which would be "logparser_grok"
  name_override = "nginx_access_log"
  ## For parsing logstash-style "grok" patterns:
  [inputs.logparser.grok]
    patterns = ["%{COMBINED_LOG_FORMAT}"]

[[outputs.influxdb]]
  ## The full HTTP or UDP endpoint URL for your InfluxDB instance.
  urls = ["http://localhost:8086"] # required
  ## The target database for metrics (telegraf will create it if not exists).
  database = "telegraf" # required
  ## Write timeout (for the InfluxDB client), formatted as a string.
  timeout = "5s"
' >> /etc/telegraf/telegraf.conf

This time our config file will assume that you are running an InfluxDB instance locally. We will also be tailing an nginx log file at `/var/log/nginx/access.log`. If you don’t have an Nginx access log handy, you can download an example here.

Step 5 — Parsing the Nginx Access Log

Now we can start the Telegraf service daemon:

`service telegraf start`

We won’t see metrics getting written to `STDOUT`, but we should see log lines indicating that metrics are getting written to InfluxDB.

The Telegraf service log is located at `/var/log/telegraf/telegraf.log`.

2016/06/16 15:32:50 Output [influxdb] buffer fullness: 600 / 10000 metrics. Total gathered metrics: 600. Total dropped metrics: 0.
2016/06/16 15:32:50 Output [influxdb] wrote batch of 600 metrics in 5.809µs

Step 6 — Exploring the Data in InfluxDB

When we installed InfluxDB, we also installed the Influx CLI client for connecting to and querying the database. We can launch this by simply running `influx` and using the `telegraf` database.

$ influx
Connected to http://localhost:8086 version 1.0.0
InfluxDB shell version: 1.0.0
> USE telegraf
Using database telegraf
> precision rfc3339
>

First let’s query our parsed data and see how many metrics we’ve got:

> SELECT count(resp_code) FROM nginx_access_log
name: nginx_access_log
----------------------
time			           count
1970-01-01T00:00:00Z	           600

Perfect! This is the number of parsable metrics in the example log file.

Now let’s see how many errors we’ve returned using the `count()` function:

> SELECT count(resp_code) FROM nginx_access_log WHERE resp_code >= 400
name: nginx_access_log
----------------------
time			           count
1970-01-01T00:00:00Z	           71

71 errors is quite a bit for only 600 requests in total! Let’s take a peek at a few of those errors using `*` to grab all fields and `LIMIT` to limit the number returned.

> SELECT * FROM nginx_access_log WHERE resp_code >= 400 LIMIT 5
name: nginx_access_log
----------------------
time				agent			auth	client_ip	host	http_version	ident	referrer	request	resp_bytes	resp_code	verb
2011-08-22T01:02:01.005Z	okhttp/2.4.0		-	192.168.1.2	tars	1.1		-	-		/huh	0		404	GET
2011-08-22T01:02:01.008Z	Python-urllib/2.7	-	127.0.0.1	tars	1.1		-	-		/huh	0		404	GET
2011-08-22T01:02:02.005Z	Firefox/5.0		-	127.0.0.1	tars	1.1		-	-		/huh	0		404	GET
2011-08-22T01:02:03.006Z	Firefox/5.0		-	127.0.0.1	tars	1.1		-	-		/huh	0		404	GET
2011-08-22T01:02:04.009Z	Firefox/5.0		-	127.0.0.1	tars	1.1		-	-		/huh	0		404	GET

Looks like we’ve got some 404s, if we look closely we can see that each of the 404s are coming from requests on the `/huh` endpoint, let’s see if every 404 is on that endpoint, for this we will use the `distinct()` function.

> SELECT distinct(request) FROM nginx_access_log WHERE resp_code >= 400
name: nginx_access_log
----------------------
time			          distinct
1970-01-01T00:00:00Z	          /huh
1970-01-01T00:00:00Z	          /foo

Looks like there might be a few people getting 404 from the `/foo` endpoint as well, so let’s see how many:

> SELECT count(resp_code) FROM nginx_access_log WHERE resp_code >= 400 AND request = '/foo'
name: nginx_access_log
----------------------
time			           count
1970-01-01T00:00:00Z	           1

Great! Since there’s only one, this means that the main issue is users querying the `/huh` endpoint, and now we can deal with finding out where that errant endpoint is coming from.

Customizing the logparser Plugin

Now that we’ve covered how to parse data from standard “Combined Log Format”, let’s look into parsing logs that might have additional fields. Apache and Nginx allow for a high level of customization to their log output, and likewise so does the logparser.

Let’s say that our Apache admin added 2 fields on to the end of each log line:

  • the ID of the request and
  • the response time in microseconds.

While the response time is likely very useful from an analytics perspective, the `ID` is probably not as useful, so let’s come up with a `grok` pattern to deal with this.

Say the log lines look like this (see the full file here):

127.0.0.1 - - [22/Aug/2011:10:02:59 +0900] "GET /huh HTTP/1.1" 404 0 "-" "Firefox/5.0" f18064bf-a4c5-4fd9-9cf8-37aff27382c5 3648

Which is the Combined Log Format with the addition of a UUID and integer at the end. So, how do we go about parsing this?

The Telegraf grok logparser

In order to parse a custom log file we must create a custom “grok” pattern for Telegraf. The logparser plugin uses a slightly modified version of logstash “grok” patterns, with the format:

`%{<capture_syntax>[:<semantic_name>][:<modifier>]}`

Telegraf has many of it’s own built-in patterns, as well as supporting logstash’s built-in patterns. The best way to get acquainted with grok patterns is to read the logstash docs, which are available here.

Setting up a Custom Log Pattern

Telegraf allows for specifying custom patterns in a separate file, or directly in the config file. Let’s change our config file to include our custom pattern.

service telegraf stop
vim /etc/telegraf/telegraf.conf

telegraf.conf:

[[inputs.logparser]]
  ## files to tail.
  files = ["/var/log/myapp/access.log"]
  ## Read file from beginning.
  from_beginning = true
  ## Override the default measurement name, which would be "logparser_grok"
  name_override = "my_log"
  ## For parsing logstash-style "grok" patterns:
  [inputs.logparser.grok]
    patterns = ["%{CUSTOM_LOG}"]
    custom_patterns = '''
      CUSTOM_LOG %{COMBINED_LOG_FORMAT} %{UUID} %{NUMBER:response_time_us:int}
    '''

[[outputs.influxdb]]
  ## The full HTTP or UDP endpoint URL for your InfluxDB instance.
  urls = ["http://localhost:8086"] # required
  ## The target database for metrics (telegraf will create it if not exists).
  database = "telegraf" # required
  ## Write timeout (for the InfluxDB client), formatted as a string.
  timeout = "5s"

Notice that we need to add the `CUSTOM_LOG` pattern, as well as telling Telegraf to use the `%{CUSTOM_LOG}` pattern when parsing.

Let’s go over the fields in this pattern:

  • `CUSTOM_LOG` is the name of the pattern.
  • `%{COMBINED_LOG_FORMAT}` is the name of the built-in pattern.
  • `%{UUID}` Uses the logstash built-in UUID pattern to recognize this portion of the log file, but does not store it.
  • `%{NUMBER:response_time_us:int}` Uses the logstash built-in NUMBER pattern. It names this field
  • `response_time_us` and casts it to an `int`.

Parsing the Log & Analyzing the Data

Now that our config file is setup, we can run the Telegraf service again (but first make sure that you have a file at `/var/log/myapp/access.log`, you can grab the example here.)

service telegraf start

And once the metrics are written we can enter the `influx` CLI again:

$ influx
Connected to http://localhost:8086 version 1.0.0
InfluxDB shell version: 1.0.0
> USE telegraf
Using database telegraf
> precision rfc3339

And now that we have access to the response times, let’s see what our average & 95th percentile response times are:

> SELECT mean(response_time_us),percentile(response_time_us,95) FROM my_log
name: my_log
------------
time				mean		percentile
1970-01-01T00:00:00Z	        912411.36	10078023

Yikes, seems a bit high, so let’s peek at some of the metrics that have a very large response time:

> SELECT * FROM my_log WHERE response_time_us > 10078023 LIMIT 5
name: my_log
------------
time				agent			auth	client_ip	host	http_version	ident	referrer	request		resp_bytes	resp_code	response_time_us	verb
2011-08-22T01:02:01.008Z	Firefox/5.0		userb	127.0.0.1	tars	1.1		-	-		/large.jpg	10485760	200		10181259		GET
2011-08-22T01:02:04.004Z	Firefox/5.0		usera	127.0.0.1	tars	1.1		-	-		/large.jpg	10485760	200		10110514		GET
2011-08-22T01:02:06.004Z	Python-urllib/2.7	-	127.0.0.1	tars	1.1		-	-		/large.jpg	10485760	200		10132756		GET
2011-08-22T01:02:10.002Z	Firefox/5.0		userb	192.168.1.2	tars	1.1		-	-		/large.jpg	10485760	200		10101485		GET
2011-08-22T01:02:11.006Z	Firefox/5.0		-	127.0.0.1	tars	1.1		-	-		/large.jpg	10485760	200		10144209		GET

Aha! These are serving up a very large image file. Let’s ignore these and view what our response time is in cases with smaller requests:

> SELECT mean(response_time_us),percentile(response_time_us,95) FROM my_log WHERE resp_bytes < 4096
name: my_log
------------
time				mean				percentile
1970-01-01T00:00:00Z	4003.456043956044	7633

>

Great! That looks a bit better.

What’s next?