Logging Improvements for InfluxDB 1.5.0
By Stuart Carnie / Mar 06, 2018 / InfluxData, InfluxDB, Developer
When it comes to troubleshooting issues, log files are a high-valued asset. If things go wrong, you’ll almost always be asked to “send the logs”. InfluxDB 1.5 comes with a number of improvements to logging in an effort to simplify the task of analyzing this data.
InfluxDB generates a lot of log output that chronicles many aspects of its internal operation. This latest update has revamped how the database logs key information and the format of the log output. The primary goal of these changes is to enable tooling which can efficiently parse and analyze the log data and reduce the time required to diagnose issues.
InfluxDB 1.5 can generate structured logs in either logfmt or JSON lines. This feature is found in a new section of the configuration file titled
[logging]. The default log format is
auto, which determines the output based on whether
stderr refers to a terminal (TTY) or not. If
stderr is a TTY, a less verbose “console” format is selected; otherwise, the output will be
logfmt. If you would prefer consistent output, we would encourage you to explicitly set the format to
The value of structured logging is greatly improved when specific elements of a log event can be extracted easily. With that in mind, the existing logging code was reviewed to ensure notable data was moved to separate keys. The following example shows a few startup events related to opening files in version 1.3. Note some entries include the path and duration embedded in the message:
[I] 2018-03-03T00:46:48Z /Users/stuartcarnie/.influxdb/data/db/autogen/510/000000001-000000001.tsm (#0) opened in 280.401827ms engine=tsm1 service=filestore [I] 2018-03-03T00:46:48Z reading file /Users/stuartcarnie/.influxdb/wal/db/autogen/510/_00001.wal, size 15276 engine=tsm1 service=cacheloader [I] 2018-03-03T00:46:54Z /Users/stuartcarnie/.influxdb/data/db/autogen/510 opened in 5.709152422s service=store
As of 1.5, the same events using
logfmt now look like:
ts=2018-03-03T00:48:57.246371Z lvl=info msg="Opened file" log_id=06bLAgTG000 engine=tsm1 service=filestore path=/Users/stuartcarnie/.influxdb/data/db/autogen/510/000000001-000000001.tsm id=0 duration=61.736ms ts=2018-03-03T00:48:57.246590Z lvl=info msg="Reading file" log_id=06bLAgTG000 engine=tsm1 service=cacheloader path=/Users/stuartcarnie/.influxdb/wal/db/autogen/510/_00001.wal size=15276 ts=2018-03-03T00:49:01.023313Z lvl=info msg="Opened shard" log_id=06bLAgTG000 service=store trace_id=06bLAgv0000 op_name=tsdb_open path=/Users/stuartcarnie/.influxdb/data/db/autogen/510 duration=3841.275ms
Take note that
duration are now separate keys. Using a tool like lcut, we can select specific keys (
duration) to reformat the output:
2018-03-03T00:48:57.246371Z Opened file /Users/stuartcarnie/.influxdb/data/db/autogen/510/000000001-000000001.tsm 61.736ms 2018-03-03T00:48:57.246590Z Reading file /Users/stuartcarnie/.influxdb/wal/db/autogen/510/_00001.wal 2018-03-03T00:49:01.023313Z Opened shard /Users/stuartcarnie/.influxdb/data/db/autogen/510 3841.275ms
Collating Log Events
During the lifetime of an InfluxDB process, operations such as compactions run continuously and generate multiple events as they advance. To further complicate matters, when these operations run concurrently, the events from each will be interleaved together. Determining the outcome of a specific compaction is practically impossible, as there is no way to associate log events for each independently. To address this issue, InfluxDB 1.5 adds the keys listed in the following table:
||A unique value associated with each log event for a single run of an operation|
||A searchable identifier, such as
||Given a value of
||The time an operation took to complete, in milliseconds. This key is always included with the
To demonstrate how these keys might be used, we’ll use the following (abridged) file named “influxd.log”, which includes at least two compactions running concurrently.
msg="TSM compaction (start)" trace_id=06avQESl000 op_name=tsm1_compact_group op_event=start msg="Beginning compaction" trace_id=06avQESl000 op_name=tsm1_compact_group tsm1_files_n=2 msg="Compacting file" trace_id=06avQESl000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/influxdb/data/db1/rp/10/000000859-000000002.tsm msg="Compacting file" trace_id=06avQESl000 op_name=tsm1_compact_group tsm1_index=1 tsm1_file=/influxdb/data/db1/rp/10/000000861-000000001.tsm msg="TSM compaction (start)" trace_id=06avQEZW000 op_name=tsm1_compact_group op_event=start msg="Beginning compaction" trace_id=06avQEZW000 op_name=tsm1_compact_group tsm1_files_n=2 msg="invalid subscription token" service=subscriber msg="Post http://kapacitor-rw:9092/write?consistency=&db=_internal&precision=ns&rp=monitor: dial tcp: lookup kapacitor-rw on 10.0.0.1: server misbehaving" service=subscriber msg="Compacting file" trace_id=06avQEZW000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/influxdb/data/db2/rp/12/000000027-000000002.tsm msg="Compacting file" trace_id=06avQEZW000 op_name=tsm1_compact_group tsm1_index=1 tsm1_file=/influxdb/data/db2/rp/12/000000029-000000001.tsm msg="Compacted file" trace_id=06avQEZW000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/influxdb/data/db2/rp/12/000000029-000000002.tsm.tmp msg="Finished compacting files" trace_id=06avQEZW000 op_name=tsm1_compact_group tsm1_files_n=1 msg="TSM compaction (end)" trace_id=06avQEZW000 op_name=tsm1_compact_group op_event=end op_elapsed=56.907ms msg="Compacted file" trace_id=06avQESl000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/influxdb/data/db1/rp/10/000000861-000000002.tsm.tmp msg="Finished compacting files" trace_id=06avQESl000 op_name=tsm1_compact_group tsm1_files_n=1 msg="TSM compaction (end)" trace_id=06avQESl000 op_name=tsm1_compact_group op_event=end op_elapsed=157.739ms
Compactions are identified by
op_name=tsm1_compact_group, so to summarize them, we might use the following command to output the trace id and elapsed time:
$ fgrep 'tsm1_compact_group' influxd.log | fgrep 'op_event=end' | lcut trace_id op_elapsed
which can be read as:
Find all the lines containing the text
op_event=endand display the
and would produce the following output:
06avQEZW000 56.907ms 06avQESl000 157.739ms
From here it is easy to filter the logs for trace
06avQESl000 using the following:
$ fgrep '06avQESl000' influxd.log | lcut msg tsm1_file TSM compaction (start) Beginning compaction Compacting file /influxdb/data/db1/rp/10/000000859-000000002.tsm Compacting file /influxdb/data/db1/rp/10/000000861-000000001.tsm Compacted file /influxdb/data/db1/rp/10/000000861-000000002.tsm.tmp Finished compacting files TSM compaction (end)
We can ask more complex questions, such as:
What are the top 10 slowest continuous queries?
using a command like:
$ fgrep 'continuous_querier_execute' influxd.log | fgrep 'op_event=end' | lcut trace_id op_elapsed | sort -r -h -k2 | head -10 06eXrSJG000 15007.940ms 06d7Ow3W000 15007.646ms 06axkRVG000 15007.222ms 06ay9170000 15007.118ms 06c9tbwG000 15006.701ms 06dUcXhG000 15006.533ms 06ekMi40000 15006.158ms 06c5FH7l000 15006.145ms 06bDHhkG000 15006.012ms 06a~ioYG000 15005.988ms
In summary, structured logging will enable us to analyze log data more efficiently with off-the-shelf tooling.
HTTP Access Logs
InfluxDB has long supported the ability to output HTTP request traffic in Common Log Format. This feature is enabled by setting the
log-enabled option to
true in the
[http] section of the InfluxDB configuration.
Prior to 1.5, all log output was sent to
stderr and looked something like the following:
[I] 2018-03-02T19:59:58Z compacted level 1 4 files into 1 files in 130.832391ms engine=tsm1 [I] 2018-03-02T20:00:09Z SELECT count(v0) FROM db.autogen.m0 WHERE time >= '2018-02-27T01:00:00Z' AND time < '2018-02-27T02:00:00Z' GROUP BY * service=query [httpd] ::1 - - [02/Mar/2018:13:00:09 -0700] "GET /query?db=db&q=select+count%28v0%29+from+m0+where+time+%3E%3D+%272018-02-27T01%3A00%3A00Z%27+and+time+%3C+%272018-02-27T02%3A00%3A00Z%27+group+by+%2A HTTP/1.1" 200 0 "-" "curl/7.54.0" 4f39378e-1e54-11e8-8001-000000000000 726 [I] 2018-03-02T20:00:57Z retention policy shard deletion check commencing service=retention
Intermingling the log event streams required additional work to separate them before we can begin any analysis. The latest release adds an additional configuration option to write the access log to a separate file. For example, the following configuration will write the access log to a file located at
[http] # ... log-enabled = true access-log-path = "/var/log/influxd/access.log" # ...
When the access log is written to a file, the
[http] prefix is stripped such that the file can be parsed by standard HTTP log analysis and monitoring tools without further processing. For example, using lnav, an admin can open an active log file and display the data using a number of different visualizations, including error rates and histograms as is demonstrated in the following asciinema.