Logging Improvements for InfluxDB 1.5.0

Navigate to:

When it comes to troubleshooting issues, log files are a high-valued asset. If things go wrong, you’ll almost always be asked to “send the logs”. InfluxDB 1.5 comes with a number of improvements to logging in an effort to simplify the task of analyzing this data.

Logging

InfluxDB generates a lot of log output that chronicles many aspects of its internal operation. This latest update has revamped how the database logs key information and the format of the log output. The primary goal of these changes is to enable tooling which can efficiently parse and analyze the log data and reduce the time required to diagnose issues.

Structured Logging

InfluxDB 1.5 can generate structured logs in either logfmt or JSON lines. This feature is found in a new section of the configuration file titled [logging]. The default log format is auto, which determines the output based on whether stderr refers to a terminal (TTY) or not. If stderr is a TTY, a less verbose “console” format is selected; otherwise, the output will be logfmt. If you would prefer consistent output, we would encourage you to explicitly set the format to logfmt.

The value of structured logging is greatly improved when specific elements of a log event can be extracted easily. With that in mind, the existing logging code was reviewed to ensure notable data was moved to separate keys. The following example shows a few startup events related to opening files in version 1.3. Note some entries include the path and duration embedded in the message:

[I] 2018-03-03T00:46:48Z /Users/stuartcarnie/.influxdb/data/db/autogen/510/000000001-000000001.tsm (#0) opened in 280.401827ms engine=tsm1 service=filestore
[I] 2018-03-03T00:46:48Z reading file /Users/stuartcarnie/.influxdb/wal/db/autogen/510/_00001.wal, size 15276 engine=tsm1 service=cacheloader
[I] 2018-03-03T00:46:54Z /Users/stuartcarnie/.influxdb/data/db/autogen/510 opened in 5.709152422s service=store

As of 1.5, the same events using logfmt now look like:

ts=2018-03-03T00:48:57.246371Z lvl=info msg="Opened file" log_id=06bLAgTG000 engine=tsm1 service=filestore path=/Users/stuartcarnie/.influxdb/data/db/autogen/510/000000001-000000001.tsm id=0 duration=61.736ms
ts=2018-03-03T00:48:57.246590Z lvl=info msg="Reading file" log_id=06bLAgTG000 engine=tsm1 service=cacheloader path=/Users/stuartcarnie/.influxdb/wal/db/autogen/510/_00001.wal size=15276
ts=2018-03-03T00:49:01.023313Z lvl=info msg="Opened shard" log_id=06bLAgTG000 service=store trace_id=06bLAgv0000 op_name=tsdb_open path=/Users/stuartcarnie/.influxdb/data/db/autogen/510 duration=3841.275ms

Take note that path and duration are now separate keys. Using a tool like lcut, we can select specific keys (tsmsgpath and duration) to reformat the output:

2018-03-03T00:48:57.246371Z  Opened   file   /Users/stuartcarnie/.influxdb/data/db/autogen/510/000000001-000000001.tsm  61.736ms
2018-03-03T00:48:57.246590Z  Reading  file   /Users/stuartcarnie/.influxdb/wal/db/autogen/510/_00001.wal
2018-03-03T00:49:01.023313Z  Opened   shard  /Users/stuartcarnie/.influxdb/data/db/autogen/510                          3841.275ms

Collating Log Events

During the lifetime of an InfluxDB process, operations such as compactions run continuously and generate multiple events as they advance. To further complicate matters, when these operations run concurrently, the events from each will be interleaved together. Determining the outcome of a specific compaction is practically impossible, as there is no way to associate log events for each independently. To address this issue, InfluxDB 1.5 adds the keys listed in the following table:

key comment
trace_id A unique value associated with each log event for a single run of an operation
op_name A searchable identifier, such as  tsm1.compact_group assigned to each operation type
op_event Given a value of start or end to indicate whether this is the first or last event of the operation
op_elapsed The time an operation took to complete, in milliseconds. This key is always included with the op_event=end log event

To demonstrate how these keys might be used, we’ll use the following (abridged) file named “influxd.log”, which includes at least two compactions running concurrently.

msg="TSM compaction (start)" trace_id=06avQESl000 op_name=tsm1_compact_group op_event=start
msg="Beginning compaction" trace_id=06avQESl000 op_name=tsm1_compact_group tsm1_files_n=2
msg="Compacting file" trace_id=06avQESl000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/influxdb/data/db1/rp/10/000000859-000000002.tsm
msg="Compacting file" trace_id=06avQESl000 op_name=tsm1_compact_group tsm1_index=1 tsm1_file=/influxdb/data/db1/rp/10/000000861-000000001.tsm
msg="TSM compaction (start)" trace_id=06avQEZW000 op_name=tsm1_compact_group op_event=start
msg="Beginning compaction" trace_id=06avQEZW000 op_name=tsm1_compact_group tsm1_files_n=2
msg="invalid subscription token" service=subscriber
msg="Post http://kapacitor-rw:9092/write?consistency=&db=_internal&precision=ns&rp=monitor: dial tcp: lookup kapacitor-rw on 10.0.0.1: server misbehaving" service=subscriber
msg="Compacting file" trace_id=06avQEZW000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/influxdb/data/db2/rp/12/000000027-000000002.tsm
msg="Compacting file" trace_id=06avQEZW000 op_name=tsm1_compact_group tsm1_index=1 tsm1_file=/influxdb/data/db2/rp/12/000000029-000000001.tsm
msg="Compacted file" trace_id=06avQEZW000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/influxdb/data/db2/rp/12/000000029-000000002.tsm.tmp
msg="Finished compacting files" trace_id=06avQEZW000 op_name=tsm1_compact_group tsm1_files_n=1
msg="TSM compaction (end)" trace_id=06avQEZW000 op_name=tsm1_compact_group op_event=end op_elapsed=56.907ms
msg="Compacted file" trace_id=06avQESl000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/influxdb/data/db1/rp/10/000000861-000000002.tsm.tmp
msg="Finished compacting files" trace_id=06avQESl000 op_name=tsm1_compact_group tsm1_files_n=1
msg="TSM compaction (end)" trace_id=06avQESl000 op_name=tsm1_compact_group op_event=end op_elapsed=157.739ms

Compactions are identified by op_name=tsm1_compact_group, so to summarize them, we might use the following command to output the trace id and elapsed time:

$ fgrep 'tsm1_compact_group' influxd.log | fgrep 'op_event=end' | lcut trace_id op_elapsed

which can be read as:

Find all the lines containing the text tsm1_compact_group and op_event=end and display the trace_id and op_elapsed keys

and would produce the following output:

06avQEZW000	56.907ms
06avQESl000	157.739ms

From here it is easy to filter the logs for trace 06avQESl000 using the following:

$ fgrep '06avQESl000' influxd.log | lcut msg tsm1_file
TSM compaction (start)
Beginning compaction
Compacting file	/influxdb/data/db1/rp/10/000000859-000000002.tsm
Compacting file	/influxdb/data/db1/rp/10/000000861-000000001.tsm
Compacted file	/influxdb/data/db1/rp/10/000000861-000000002.tsm.tmp
Finished compacting files
TSM compaction (end)

We can ask more complex questions, such as:

What are the top 10 slowest continuous queries?

using a command like:

$ fgrep 'continuous_querier_execute' influxd.log | fgrep 'op_event=end' | lcut trace_id op_elapsed | sort -r -h -k2 | head -10
06eXrSJG000	15007.940ms
06d7Ow3W000	15007.646ms
06axkRVG000	15007.222ms
06ay9170000	15007.118ms
06c9tbwG000	15006.701ms
06dUcXhG000	15006.533ms
06ekMi40000	15006.158ms
06c5FH7l000	15006.145ms
06bDHhkG000	15006.012ms
06a~ioYG000	15005.988ms

In summary, structured logging will enable us to analyze log data more efficiently with off-the-shelf tooling.

HTTP Access Logs

InfluxDB has long supported the ability to output HTTP request traffic in Common Log Format. This feature is enabled by setting the log-enabled option to true in the [http] section of the InfluxDB configuration.

Prior to 1.5, all log output was sent to stderr and looked something like the following:

[I] 2018-03-02T19:59:58Z compacted level 1 4 files into 1 files in 130.832391ms engine=tsm1
[I] 2018-03-02T20:00:09Z SELECT count(v0) FROM db.autogen.m0 WHERE time >= '2018-02-27T01:00:00Z' AND time < '2018-02-27T02:00:00Z' GROUP BY * service=query
[httpd] ::1 - - [02/Mar/2018:13:00:09 -0700] "GET /query?db=db&q=select+count%28v0%29+from+m0+where+time+%3E%3D+%272018-02-27T01%3A00%3A00Z%27+and+time+%3C+%272018-02-27T02%3A00%3A00Z%27+group+by+%2A HTTP/1.1" 200 0 "-" "curl/7.54.0" 4f39378e-1e54-11e8-8001-000000000000 726
[I] 2018-03-02T20:00:57Z retention policy shard deletion check commencing service=retention

Intermingling the log event streams required additional work to separate them before we can begin any analysis. The latest release adds an additional configuration option to write the access log to a separate file. For example, the following configuration will write the access log to a file located at /var/log/influxd/access.log

[http]
  # ...
  log-enabled = true
  access-log-path = "/var/log/influxd/access.log"
  # ...

When the access log is written to a file, the [http] prefix is stripped such that the file can be parsed by standard HTTP log analysis and monitoring tools without further processing. For example, using lnav, an admin can open an active log file and display the data using a number of different visualizations, including error rates and histograms as is demonstrated in the following asciinema.