InfluxData Blog - Mark Rushakoff

Reproducing a Flaky Test in Go

Mark Rushakoff (InfluxData) — Thu, 15 Aug 2019 10:26:33 -0700

Oftentimes, a test that occasionally fails in CI can be reproduced locally with a simple go test -count=N ./path/to/flaky/package… but once in a while, it just doesn’t repro locally. Or maybe the full test suite for that package takes too long, and the test fails so rarely, that you need a more precise way of zeroing in on the bad test.

It’s much better to confidently fix the test failure, than to make a good guess and hope for the best. If you have to try and fix it again on another day, you will waste time regathering context you had the first time around.

In this document I will detail the approaches I have found to be effective to methodically reproduce a test failure, over many hours I’ve spent tracking down many flaky tests.

But first, let’s look at the common patterns that result in flaky tests.

Flaky test categories

Ultimately, flaky tests are nondeterministic tests. In my experience, the root cause of the flakiness generally falls into one of two categories.

Nondeterministic data

With flaky tests, there are two interesting types of nondeterministic data: data that is itself consistent but is accessed nondeterministically, and data that is generated randomly but is accessed deterministically. Of course, those two types are not mutually exclusive.

Nondeterministic access

A test that doesn’t access its data in the same way every time is fine, so long as the test’s assertions account for the data being in a different order.

Most Go developers are aware that map iteration order is random. Iterating a map to populate a slice, and then asserting against that slice’s order — especially when the map only has two or three elements — is a test pattern that will pass surprisingly often. Combine that with Go’s test caching and the human habit of blindly re-running a failed CI job, and you will have a perfectly annoying test failure that rarely pops up.

Other than map iteration, goroutines performing concurrent work may finish in an arbitrary order. Perhaps you have two goroutines, one reading a very small file and one reading a very large file, and each sending some result on the same channel. Most of the time, the small file’s goroutine will finish and send its result first; if your test assumes that is always the case, then the test will sometimes fail.

Nondeterministic generation

Making good use of random data in tests is an art unto itself. go-fuzz is an excellent tool to discover bugs related to handling of arbitrary input. Using random values in tests is a lightweight way to potentially discover bugs in a similar way, but the downside is that you will learn about the bug only when the test only occasionally fails. For that reason, it’s important that you can easily get at the input that caused the failure.

The single flaky test I tracked down, perhaps six years ago, that sticks out most in my mind involved deserializing a randomly populated YAML file. We would infrequently see this test fail with a mysterious failure message, and running it again would always pass. We were randomly generating a string of hex characters for a particular value. Most of the time, the YAML would look like key: a1b2c3 and would be interpreted as a string…but every once in a while it would pick a sequence of all decimal digits, then a single letter E, and then the rest decimal digits. We didn’t surround the value with quotes, so the parser would interpret key: 12345e12 as a floating-point number instead of a string!

When using randomly generated values, make sure it’s easy to recover the input that caused the failure. Usually you can include the value in a call to t.Fatalf.

If it’s a more complicated test involving many files each containing some random content, I would at least put all the files in the same temporary directory. That way, if I need to reproduce a failure to inspect the files, I can just comment out the call to os.RemoveAll and add a t.Log(dirName) to know where to explore to see the bad input. If you’re already working on locally reproducing an intermittent failure anyway, I don’t see anything wrong with making some temporary edits to the test function.

Timing-based tests

In my experience, tests that are sensitive to timing tend to cause flaky failures more frequently than the previously mentioned logic bugs around data order.

Usually it goes like this: your test starts a goroutine to do an operation that should complete quickly, perhaps in tens of milliseconds. You pick a resasonable timeout value — “If I don’t see a result in 100ms, the test has failed.” You run that test in a loop for 15 minutes on your machine and it passes every time. Then after merging that change to master, the test fails at least once the first day it’s running on CI. Someone bumps up the timeout to one second, and it still manages to fail a couple times per week. Now what?

If you have a way to test a synchronous API instead of an asynchronous one — avoiding sensitivity to time — that is generally the best solution.

If you must test asynchronously, be sure to poll rather than do a single long sleep. Don’t do this:

go longOperation.Start()

// Bad: this will always eat up 5 seconds in test, even if the operation completes instantly.
time.Sleep(5 * time.Second)

if !longOperation.IsDone() {
  t.Fatal("didn't see result in time")
}

res := longOperation.Result()
if res != expected {
  t.Fatalf("expected %v, got %v", expected, res)
}

Instead, for APIs that let you check whether a call will block, I often use a pattern like:

go longOperation.Start()

deadline := time.Now().Add(5 * time.Second)
for {
  if time.Now().After(deadline) {
    t.Fatal("didn't see result in time")
  }

  if !longOperation.IsDone() {
    time.Sleep(100 * time.Millisecond)
    continue
  }

  res := longOperation.Result()
  if res != expected {
    t.Fatalf("expected %v, got %v", expected, res)
  }
}

If the API is blocking but accepts a cancelable context, you should use a reasonable timeout so that the test will fail more quickly than the default 10 minute timeout:

go longOperation.Start()

ctx, cancel := context.WithTimeout(context.Background(), 5 * time.Second)
defer cancel()

res, err := longOperation.WaitForResult(ctx)
if err != nil {
  t.Fatal(err)
}
if res != expected {
  t.Fatalf("expected %v, got %v", expected, res)
}

If the API blocks but does not accept a context, you can write a helper function to run the method and fail if it doesn’t complete in a given timeout (left as an exercise for the reader).

Reproducing a flaky test on your workstation

You’ve seen the test fail a couple times on CI. It usually passes if you re-run the CI job, but instead of kicking the can down the road, let’s fix it now. Before you can confidently say you’ve fixed the test, you need to confidently reproduce the test locally.

Set up your test loop

First, focus on the exact package that failed. That is, you want to run go test ./mypkg, not go test ./....

Then, use -run to focus on the exact test that failed. Usually I would just copy and paste the test name that’s failing, e.g. go test -run=TestFoo ./mypkg. However, note that the -run flag accepts a regular expression, so if your test name is also the prefix of another test, you can ensure you only run the exact match, by anchoring the name like go test -run='^TestFoo$' ./mypkg.

If you run that more than once, you will surely notice that recent versions of Go cache the test results. Obviously we don’t want that while we are trying to reproduce a flaky test. You might use -count=1 if you were just running the full test suite without caching, but you should pick a larger number. That number will vary by the exact test; my personal preference is a count that completes in around 10 seconds or so. Let’s say we’ve settled on 100 — your command now looks like go test -run=TestFoo -count=100 ./mypkg.

The -count flag will run the specified number of iterations regardless of how many runs fail. Most of the time, you don’t get any extra information from multiple failures in a single run. For that reason, I prefer to use the -failfast flag so that the test process stops after the first failed run.

Now, we can throw this in a very simple bash loop: while go test -run=TestFoo -count=100 -failfast ./mypkg; do date; done. You could put anything inside the body of the loop, but I like to see the date go by in the output so I can see that it hasn’t deadlocked. (And if the test failure you’re trying to reproduce is itself a deadlock, that loop which completes in about 10 seconds pairs well with -timeout=20s so you don’t have to wait around 10 minutes to see a stack trace.)

At this point, if you can reproduce the failure within a minute or less, you’re in good shape to fix the test. If it takes much longer than that to reproduce, you can shave off a bit more time by compiling the test package before the loop. Under the hood, go test will compile and run the test package, so we can avoid some repeated work by compiling it ourselves. When we execute our compiled test package directly, we need to use the test. prefix on the test-specific flags, like so: go test -c ./mypkg && while ./mypkg.test -test.run=TestFoo -test.count=100 -test.failfast; do date; done.

When that test loop still doesn't reproduce the flaky test

Use the data race detector

Sometimes the test failure is a hard time-based deadline, like waiting one second for something to happen. Using the flag -race to compile the test with the race detector enabled will generally slow down execution, possibly enough to reproduce the failure. And if you happen to detect any new data races along the way, that’s great too.

Stop focusing on the test

This is more helpful for data-ordering flakiness than it is for timing-based flakiness.

In very rare circumstances, the flaky test’s failure has to do with pollution from another test. You might drop the -run flag altogether and run that loop for a while to see if the test fails. Then you can gradually skip more and more of the passing tests until you identify which one(s) are causing the pollution for the flaky test. Adding a t.Skip call works, but for this kind of temporary change, I usually rename TestBar to xTestBar so that Go’s test detection just stops noticing the test altogether and I don’t have to see the SKIP lines in the verbose output.

Throttle the CPU usage of the process

Some flaky tests only seem to show up in a resource-contended environment like your CI server, and never on an otherwise idle system like your workstation. For reproducing those kinds of failures, we have had good luck with cpulimit which should be available in your standard package manager (e.g. brew install cpulimit or apt install cpulimit).

There’s nothing magical about cpulimit. If you’ve ever pressed control-Z in your terminal to stop a running process and restored it with fg, then you’re basically familiar with how cpulimit works: it repeatedly sends SIGSTOP to pause the process and SIGCONT to restore it again, effectively limiting how much CPU time the process is allowed to use. For most use cases, this is a close enough approximation of a limited CPU; and the cpulimit tool has the benefit of being cross-platform and easy to use.

The interesting flags for cpulimit are:

-l/--limit, to control how much CPU is available
-i/--include-children to apply to subprocesses of the target process
-z/--lazy to exit if the target process dies

Choosing an appropriate value for --limit is mostly trial and error. Too low of a value may cause excessive run durations and unintended timeouts, but too high of a value won’t necessarily reproduce the issue. Keep in mind that the maximum value is 100 times the number of available cores on the system, so -l 100 doesn’t represent 100% of available CPU, but rather 100% of one core.

Then, you typically want to run your compiled test package under cpulimit, like cpulimit -l 50 -i -z ./mypkg.test -test.run=TestFoo -test.count=100. As mentioned above, prefer to build your test package out of band with go test -c without using cpulimit. If you were to run cpulimit -l 50 go test ./mypkg -run=TestFoo -count=100 — note the missing -i flag — you would limit go test but the mypkg.test subprocess would run without limit.

Adjust the concurrency of the Go runtime

You can set the GOMAXPROCS environment variable as explained in the runtime package documentation:

The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit.

When attempting to reproduce a flaky test, this setting is typically useful only when you already suspect the flakiness to be related to concurrency.

There are generally only three interesting settings for GOMAXPROCS:

One, which will allow only one goroutine to execute at any moment in time
The default setting, which is the number of logical CPUs available
Double (or more) the number of CPUs on your system, to try to introduce more contention for shared resources

In my experience, flaky tests that seem to be affected by GOMAXPROCS tend to never reproduce for one extreme, occasionally reproduce at default, and often reproduce at the other extreme. It depends on the particular failure mode; some tests may be flaky when there is too much contention, and others may be flaky when there aren’t enough goroutines processing work.

Conclusion

Reproducing flaky tests is part science and part art. I hope that these tips help you next time you’re dealing with a test that keeps failing and blocking CI at inopportune times.

Announcing Grade: A Tool to Track Your Go Benchmarks in InfluxDB

Mark Rushakoff (InfluxData) — Thu, 28 Sep 2017 04:00:06 -0700

Have you written Go benchmarks? How often do you run them?

Many Go developers will write and run a benchmark when working on critical code,and then maybe run the benchmark again when modifying that area of the code, to decide whether the change is likely to affect performance. If you follow that use pattern, benchcmp is an excellent utility to compare benchmark output, but if you want to run your benchmarks in CI and track their performance over time with InfluxDB, grade is the tool for you.

To use grade, first you need the output from a benchmark run. For example, here are the results from running go test -bench=. -run=^$ -benchmem ./models/... 2>/dev/null against the InfluxDB v1.0.2 tag:

PASS
BenchmarkMarshal-2                  	  500000	      2901 ns/op	     560 B/op	      13 allocs/op
BenchmarkParsePointNoTags-2         	 2000000	       733 ns/op	  31.36 MB/s	     208 B/op	       4 allocs/op
BenchmarkParsePointWithPrecisionN-2 	 2000000	       627 ns/op	  36.68 MB/s	     208 B/op	       4 allocs/op
BenchmarkParsePointWithPrecisionU-2 	 2000000	       636 ns/op	  36.15 MB/s	     208 B/op	       4 allocs/op
BenchmarkParsePointsTagsSorted2-2   	 2000000	       947 ns/op	  53.85 MB/s	     240 B/op	       4 allocs/op
BenchmarkParsePointsTagsSorted5-2   	 1000000	      1189 ns/op	  69.75 MB/s	     272 B/op	       4 allocs/op
BenchmarkParsePointsTagsSorted10-2  	 1000000	      1624 ns/op	  88.05 MB/s	     320 B/op	       4 allocs/op
BenchmarkParsePointsTagsUnSorted2-2 	 1000000	      1167 ns/op	  43.69 MB/s	     272 B/op	       5 allocs/op
BenchmarkParsePointsTagsUnSorted5-2 	 1000000	      1627 ns/op	  50.99 MB/s	     336 B/op	       5 allocs/op
BenchmarkParsePointsTagsUnSorted10-2	  500000	      2733 ns/op	  52.32 MB/s	     448 B/op	       5 allocs/op
BenchmarkParseKey-2                 	 1000000	      2361 ns/op	    1030 B/op	      24 allocs/op
ok  	github.com/influxdata/influxdb/models	19.809s

(One detail easy to overlook about this output is that the -2 suffix on all the names indicates the test was run with GOMAXPROCS set to 2).

I ran this on an EC2 c4.large instance, under Go 1.6.2, which is what we used to build InfluxDB at that time.If I had this output stored as models-1.0.2.txt, I could run:

grade \
  -influxurl '' \
  -goversion "$(go version | cut -d' ' -f3-)" \
  -hardwareid c4.large \
  -revision v1.0.2 \
  -timestamp "$(cd $GOPATH/src/github.com/influxdata/influxdb && git log v1.0.2 -1 --format=%ct)" \
  < models-1.0.2.txt

Line-by-line, the options are:

-influxurl set to an empty string so that I can print the line protocol of what would be sent to a real host
-goversion set to the output of go version, without the string prefix go version
-hardwareid set to c4.large, so that when querying the data I understand what hardware ran the benchmarks
-revision set to the tag of the commit being tested (but I could have just as well used the SHA of the commit)
-timestamp set to the Unix timestamp of the commit being tested.

The above command would produce the following line protocol:

go,goversion=go1.6.2\ linux/amd64,hwid=c4.large,name=ParsePointNoTags,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=208i,allocs_per_op=4i,mb_per_s=31.36,n=2000000i,ns_per_op=733,revision="v1.0.2" 1475695157000000000
go,goversion=go1.6.2\ linux/amd64,hwid=c4.large,name=ParsePointWithPrecisionN,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=208i,allocs_per_op=4i,mb_per_s=36.68,n=2000000i,ns_per_op=627,revision="v1.0.2" 1475695157000000000
go,goversion=go1.6.2\ linux/amd64,hwid=c4.large,name=ParsePointWithPrecisionU,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=208i,allocs_per_op=4i,mb_per_s=36.15,n=2000000i,ns_per_op=636,revision="v1.0.2" 1475695157000000000
go,goversion=go1.6.2\ linux/amd64,hwid=c4.large,name=ParsePointsTagsSorted2,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=240i,allocs_per_op=4i,mb_per_s=53.85,n=2000000i,ns_per_op=947,revision="v1.0.2" 1475695157000000000
go,goversion=go1.6.2\ linux/amd64,hwid=c4.large,name=ParsePointsTagsSorted5,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=272i,allocs_per_op=4i,mb_per_s=69.75,n=1000000i,ns_per_op=1189,revision="v1.0.2" 1475695157000000000
go,goversion=go1.6.2\ linux/amd64,hwid=c4.large,name=ParsePointsTagsSorted10,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=320i,allocs_per_op=4i,mb_per_s=88.05,n=1000000i,ns_per_op=1624,revision="v1.0.2" 1475695157000000000
go,goversion=go1.6.2\ linux/amd64,hwid=c4.large,name=ParsePointsTagsUnSorted2,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=272i,allocs_per_op=5i,mb_per_s=43.69,n=1000000i,ns_per_op=1167,revision="v1.0.2" 1475695157000000000
go,goversion=go1.6.2\ linux/amd64,hwid=c4.large,name=ParsePointsTagsUnSorted5,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=336i,allocs_per_op=5i,mb_per_s=50.99,n=1000000i,ns_per_op=1627,revision="v1.0.2" 1475695157000000000
go,goversion=go1.6.2\ linux/amd64,hwid=c4.large,name=ParsePointsTagsUnSorted10,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=448i,allocs_per_op=5i,mb_per_s=52.32,n=500000i,ns_per_op=2733,revision="v1.0.2" 1475695157000000000
go,goversion=go1.6.2\ linux/amd64,hwid=c4.large,name=ParseKey,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=1030i,allocs_per_op=24i,n=1000000i,ns_per_op=2361,revision="v1.0.2" 1475695157000000000

And if we were to repeat that process for the other tags v1.1.5, v1.2.4, and v1.3.5, we would produce line protocol like:

go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=Marshal,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=416i,allocs_per_op=4i,n=1000000i,ns_per_op=1260,revision="v1.1.5" 1493408827000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=NewPoint,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=3424i,allocs_per_op=28i,n=200000i,ns_per_op=6387,revision="v1.1.5" 1493408827000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointNoTags5000,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=1644800i,allocs_per_op=5002i,mb_per_s=52.81,n=1000i,ns_per_op=2272374,revision="v1.1.5" 1493408827000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointNoTags,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=336i,allocs_per_op=3i,mb_per_s=36.65,n=2000000i,ns_per_op=627,revision="v1.1.5" 1493408827000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointWithPrecisionN,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=336i,allocs_per_op=3i,mb_per_s=44.47,n=3000000i,ns_per_op=517,revision="v1.1.5" 1493408827000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointWithPrecisionU,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=336i,allocs_per_op=3i,mb_per_s=44.51,n=3000000i,ns_per_op=516,revision="v1.1.5" 1493408827000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointsTagsSorted2,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=368i,allocs_per_op=3i,mb_per_s=62.39,n=2000000i,ns_per_op=817,revision="v1.1.5" 1493408827000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointsTagsSorted5,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=400i,allocs_per_op=3i,mb_per_s=80.09,n=1000000i,ns_per_op=1036,revision="v1.1.5" 1493408827000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointsTagsSorted10,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=448i,allocs_per_op=3i,mb_per_s=98.62,n=1000000i,ns_per_op=1449,revision="v1.1.5" 1493408827000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointsTagsUnSorted2,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=400i,allocs_per_op=4i,mb_per_s=51.44,n=2000000i,ns_per_op=991,revision="v1.1.5" 1493408827000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointsTagsUnSorted5,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=464i,allocs_per_op=4i,mb_per_s=58.78,n=1000000i,ns_per_op=1412,revision="v1.1.5" 1493408827000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointsTagsUnSorted10,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=576i,allocs_per_op=4i,mb_per_s=59.45,n=1000000i,ns_per_op=2405,revision="v1.1.5" 1493408827000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParseKey,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=611i,allocs_per_op=3i,n=2000000i,ns_per_op=705,revision="v1.1.5" 1493408827000000000

go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=Marshal,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=416i,allocs_per_op=4i,n=1000000i,ns_per_op=1259,revision="v1.2.4" 1494272869000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=NewPoint,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=3048i,allocs_per_op=22i,n=300000i,ns_per_op=5168,revision="v1.2.4" 1494272869000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointNoTags5000,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=1404800i,allocs_per_op=5002i,mb_per_s=50.18,n=1000i,ns_per_op=2391554,revision="v1.2.4" 1494272869000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointNoTags,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=288i,allocs_per_op=3i,mb_per_s=37.14,n=2000000i,ns_per_op=619,revision="v1.2.4" 1494272869000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointWithPrecisionN,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=288i,allocs_per_op=3i,mb_per_s=45.16,n=3000000i,ns_per_op=509,revision="v1.2.4" 1494272869000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointWithPrecisionU,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=288i,allocs_per_op=3i,mb_per_s=44.57,n=3000000i,ns_per_op=516,revision="v1.2.4" 1494272869000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointsTagsSorted2,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=320i,allocs_per_op=3i,mb_per_s=63.47,n=2000000i,ns_per_op=803,revision="v1.2.4" 1494272869000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointsTagsSorted5,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=352i,allocs_per_op=3i,mb_per_s=79.19,n=1000000i,ns_per_op=1048,revision="v1.2.4" 1494272869000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointsTagsSorted10,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=400i,allocs_per_op=3i,mb_per_s=97.4,n=1000000i,ns_per_op=1468,revision="v1.2.4" 1494272869000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointsTagsUnSorted2,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=352i,allocs_per_op=4i,mb_per_s=51.63,n=2000000i,ns_per_op=987,revision="v1.2.4" 1494272869000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointsTagsUnSorted5,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=416i,allocs_per_op=4i,mb_per_s=58.78,n=1000000i,ns_per_op=1411,revision="v1.2.4" 1494272869000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParsePointsTagsUnSorted10,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=528i,allocs_per_op=4i,mb_per_s=59.81,n=1000000i,ns_per_op=2390,revision="v1.2.4" 1494272869000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=ParseKey,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=611i,allocs_per_op=3i,n=2000000i,ns_per_op=700,revision="v1.2.4" 1494272869000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=EscapeStringField_Plain,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=16i,allocs_per_op=1i,n=20000000i,ns_per_op=67.5,revision="v1.2.4" 1494272869000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=EscapeString_Quotes,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=48i,allocs_per_op=3i,n=10000000i,ns_per_op=169,revision="v1.2.4" 1494272869000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=EscapeString_Backslashes,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=80i,allocs_per_op=3i,n=10000000i,ns_per_op=196,revision="v1.2.4" 1494272869000000000
go,goversion=go1.7.4\ linux/amd64,hwid=c4.large,name=EscapeString_QuotesAndBackslashes,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=160i,allocs_per_op=5i,n=3000000i,ns_per_op=412,revision="v1.2.4" 1494272869000000000

go,goversion=go1.8.3\ linux/amd64,hwid=c4.large,name=Marshal,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=256i,allocs_per_op=2i,n=1000000i,ns_per_op=1043,revision="v1.3.5" 1504050409000000000
go,goversion=go1.8.3\ linux/amd64,hwid=c4.large,name=NewPoint,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=2888i,allocs_per_op=20i,n=300000i,ns_per_op=4945,revision="v1.3.5" 1504050409000000000
go,goversion=go1.8.3\ linux/amd64,hwid=c4.large,name=NewPointFromBinary,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=240i,allocs_per_op=1i,n=3000000i,ns_per_op=456,revision="v1.3.5" 1504050409000000000
go,goversion=go1.8.3\ linux/amd64,hwid=c4.large,name=ParsePointNoTags5000,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=1404800i,allocs_per_op=5002i,mb_per_s=46.66,n=500i,ns_per_op=2571739,revision="v1.3.5" 1504050409000000000
go,goversion=go1.8.3\ linux/amd64,hwid=c4.large,name=ParsePointNoTags,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=288i,allocs_per_op=3i,mb_per_s=35.96,n=2000000i,ns_per_op=639,revision="v1.3.5" 1504050409000000000
go,goversion=go1.8.3\ linux/amd64,hwid=c4.large,name=ParsePointWithPrecisionN,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=288i,allocs_per_op=3i,mb_per_s=43.28,n=3000000i,ns_per_op=531,revision="v1.3.5" 1504050409000000000
go,goversion=go1.8.3\ linux/amd64,hwid=c4.large,name=ParsePointWithPrecisionU,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=288i,allocs_per_op=3i,mb_per_s=42.4,n=3000000i,ns_per_op=542,revision="v1.3.5" 1504050409000000000
go,goversion=go1.8.3\ linux/amd64,hwid=c4.large,name=ParsePointsTagsSorted2,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=320i,allocs_per_op=3i,mb_per_s=61.32,n=2000000i,ns_per_op=831,revision="v1.3.5" 1504050409000000000
go,goversion=go1.8.3\ linux/amd64,hwid=c4.large,name=ParsePointsTagsSorted5,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=352i,allocs_per_op=3i,mb_per_s=77.86,n=1000000i,ns_per_op=1066,revision="v1.3.5" 1504050409000000000
go,goversion=go1.8.3\ linux/amd64,hwid=c4.large,name=ParsePointsTagsSorted10,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=400i,allocs_per_op=3i,mb_per_s=97.54,n=1000000i,ns_per_op=1466,revision="v1.3.5" 1504050409000000000
go,goversion=go1.8.3\ linux/amd64,hwid=c4.large,name=ParsePointsTagsUnSorted2,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=352i,allocs_per_op=4i,mb_per_s=49.99,n=1000000i,ns_per_op=1020,revision="v1.3.5" 1504050409000000000
go,goversion=go1.8.3\ linux/amd64,hwid=c4.large,name=ParsePointsTagsUnSorted5,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=416i,allocs_per_op=4i,mb_per_s=58.28,n=1000000i,ns_per_op=1424,revision="v1.3.5" 1504050409000000000
go,goversion=go1.8.3\ linux/amd64,hwid=c4.large,name=ParsePointsTagsUnSorted10,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=528i,allocs_per_op=4i,mb_per_s=58.9,n=500000i,ns_per_op=2427,revision="v1.3.5" 1504050409000000000
go,goversion=go1.8.3\ linux/amd64,hwid=c4.large,name=ParseKey,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=611i,allocs_per_op=3i,n=2000000i,ns_per_op=782,revision="v1.3.5" 1504050409000000000
go,goversion=go1.8.3\ linux/amd64,hwid=c4.large,name=EscapeStringField_Plain,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=16i,allocs_per_op=1i,n=20000000i,ns_per_op=66.3,revision="v1.3.5" 1504050409000000000
go,goversion=go1.8.3\ linux/amd64,hwid=c4.large,name=EscapeString_Quotes,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=48i,allocs_per_op=3i,n=10000000i,ns_per_op=169,revision="v1.3.5" 1504050409000000000
go,goversion=go1.8.3\ linux/amd64,hwid=c4.large,name=EscapeString_Backslashes,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=80i,allocs_per_op=3i,n=10000000i,ns_per_op=192,revision="v1.3.5" 1504050409000000000
go,goversion=go1.8.3\ linux/amd64,hwid=c4.large,name=EscapeString_QuotesAndBackslashes,pkg=github.com/influxdata/influxdb/models,procs=2 alloced_bytes_per_op=160i,allocs_per_op=5i,n=3000000i,ns_per_op=412,revision="v1.3.5" 1504050409000000000

To get output similar to the Go benchmark output, go to the influx CLI and execute: select revision, ns_per_op, mb_per_s, alloced_bytes_per_op, allocs_per_op from go group by "name".

You will see tabular data like:

name: go
tags: name=ParseKey
time                 revision ns_per_op mb_per_s alloced_bytes_per_op allocs_per_op
----                 -------- --------- -------- -------------------- -------------
2016-10-05T19:19:17Z v1.0.2   2361               1030                 24
2017-04-28T19:47:07Z v1.1.5   705                611                  3
2017-05-08T19:47:49Z v1.2.4   700                611                  3
2017-08-29T23:46:49Z v1.3.5   782                611                  3

name: go
tags: name=ParsePointNoTags
time                 revision ns_per_op mb_per_s alloced_bytes_per_op allocs_per_op
----                 -------- --------- -------- -------------------- -------------
2016-10-05T19:19:17Z v1.0.2   733       31.36    208                  4
2017-04-28T19:47:07Z v1.1.5   627       36.65    336                  3
2017-05-08T19:47:49Z v1.2.4   619       37.14    288                  3
2017-08-29T23:46:49Z v1.3.5   639       35.96    288                  3

That’s how easy it is to use grade. The decisions you’ll need to make if you use grade are:

Am I going to run benchmarks against every commit, one commit per day or per week, only against tags, or something else?
What hardware is going to execute my benchmarks, and what operating system am I going to test?
Am I going to run with different values of GOMAXPROCS or just the default value on my benchmark runner?
Is a 1-second sample long enough, or will I use the -benchtime flag for a longer duration?

And the last decision you’ll need to make is how you’ll act on the data you’re collecting. In the future, we will share TICK scripts to use with Kapacitor so that you can get an alert if a new benchmark indicates a performance decrease.

InfluxDB and the /debug/vars Endpoint

Mark Rushakoff (InfluxData) — Mon, 03 Apr 2017 15:43:46 -0700

Like many Go programs with an HTTP server, InfluxDB exposes some diagnostic information over the /debug/vars endpoint. Because the information we expose there is simply JSON, it should be very straightforward to expose InfluxDB’s diagnostics to other custom utilities that you might want to use to monitor your InfluxDB instance.

Go makes it trivial to add a /debug/vars to your program via the /debug/vars package:

Package expvar provides a standardized interface to public variables, such as operation counters in servers. It exposes these variables via HTTP at /debug/vars in JSON format.

expvar works great for most use cases. Even though InfluxDB started using expvar directly, there was, and still is, a missing feature that caused us to stop using the standard expvar package: it does not allow you to remove published variables. There were at least a couple of issues filed about stats being reported for entities that were deleted. A low signal-to-noise ratio is never helpful when you're trying to debug an active problem.

In InfluxDB PR 6964 (which was first present in the 1.0 release) we moved away from using expvar directly, in favor of a custom implementation with expvar-compatible output.

Kapacitor took a slightly different approach: forking the standard library expvar package to add a Delete method on the Map type.

`/debug/vars` and the TICK Stack

InfluxDB's `expvar` Format

The output of InfluxDB’s /debug/vars endpoint is one JSON object that roughly corresponds with the contents of the _internal database. The_internal database stores the values every 10 seconds by default, but the /debug/vars endpoint, like the SHOW STATS query, gives you an instant-in-time view of those stats.

First, there are two key-value pairs that match the standard expvar output:

cmdline is an array of strings representing the command line arguments used to invoke the process. If you're running influxd as a service on Linux, the value for cmdline will look something like: ["/usr/bin/influxd","-config","/etc/influxdb/influxdb.conf"].

memstats is a JSON object corresponding to the Go runtime.MemStats struct.

The rest of the /debug/vars output have keys representing the details of the object being measured, with values in "InfluxDB expvar format". The "InfluxDB expvar format" is an object with this structure:

name: a string describing what's being measured, i.e. the corresponding InfluxDB measurement
tags: an object with string keys and values, corresponding to the tags to use for the fields
values: an object whose keys and values correspond with fields for the measurement

Kapacitor's `/kapacitor/v1/debug/vars` Endpoint

Kapacitor uses a mix of “standard” expvar format with simple top-level key-value pairs, and InfluxDB-formatted expvars for received writes underneath the kapacitor key.

Telegraf's InfluxDB Input Plugin

Telegraf’s Influxdb input plugin supports ingesting InfluxDB-formatted expvars from a remote HTTP endpoint. It also handles memstats as a special case. For a single instance of InfluxDB, this will result only in information redundant with the _internal database, but this is a straightforward approach to monitoring multiple InfluxDB instances in one central location.

Unlike InfluxDB and Kapacitor, Telegraf does not (currently) expose a /debug/vars HTTP endpoint.

Emitting InfluxDB-Formatted expvars in Your Own Application

I haven’t seen any standalone projects that allow you to emit InfluxDB-formatted expvars, but it doesn’t take much code to do that yourself if you understand the format. The entire file of our first, pure-expvar implementation is less than 50 lines, and the corresponding code to serve expvars over HTTP is only about 11 lines.

Have you implemented code to emit InfluxDB-formatted expvars, in Go or in any other language? Share your solutions over on this thread on community.influxdata.com.