Reproducing a Flaky Test in Go

Navigate to:

Oftentimes, a test that occasionally fails in CI can be reproduced locally with a simple go test -count=N ./path/to/flaky/package… but once in a while, it just doesn’t repro locally. Or maybe the full test suite for that package takes too long, and the test fails so rarely, that you need a more precise way of zeroing in on the bad test.

It’s much better to confidently fix the test failure, than to make a good guess and hope for the best. If you have to try and fix it again on another day, you will waste time regathering context you had the first time around.

In this document I will detail the approaches I have found to be effective to methodically reproduce a test failure, over many hours I’ve spent tracking down many flaky tests.

But first, let’s look at the common patterns that result in flaky tests.

Flaky test categories

Ultimately, flaky tests are nondeterministic tests. In my experience, the root cause of the flakiness generally falls into one of two categories.

Nondeterministic data

With flaky tests, there are two interesting types of nondeterministic data: data that is itself consistent but is accessed nondeterministically, and data that is generated randomly but is accessed deterministically. Of course, those two types are not mutually exclusive.

Nondeterministic access

A test that doesn’t access its data in the same way every time is fine, so long as the test’s assertions account for the data being in a different order.

Most Go developers are aware that map iteration order is random. Iterating a map to populate a slice, and then asserting against that slice’s order — especially when the map only has two or three elements — is a test pattern that will pass surprisingly often. Combine that with Go’s test caching and the human habit of blindly re-running a failed CI job, and you will have a perfectly annoying test failure that rarely pops up.

Other than map iteration, goroutines performing concurrent work may finish in an arbitrary order. Perhaps you have two goroutines, one reading a very small file and one reading a very large file, and each sending some result on the same channel. Most of the time, the small file’s goroutine will finish and send its result first; if your test assumes that is always the case, then the test will sometimes fail.

Nondeterministic generation

Making good use of random data in tests is an art unto itself. go-fuzz is an excellent tool to discover bugs related to handling of arbitrary input. Using random values in tests is a lightweight way to potentially discover bugs in a similar way, but the downside is that you will learn about the bug only when the test only occasionally fails. For that reason, it’s important that you can easily get at the input that caused the failure.

The single flaky test I tracked down, perhaps six years ago, that sticks out most in my mind involved deserializing a randomly populated YAML file. We would infrequently see this test fail with a mysterious failure message, and running it again would always pass. We were randomly generating a string of hex characters for a particular value. Most of the time, the YAML would look like key: a1b2c3 and would be interpreted as a string…but every once in a while it would pick a sequence of all decimal digits, then a single letter E, and then the rest decimal digits. We didn’t surround the value with quotes, so the parser would interpret key: 12345e12 as a floating-point number instead of a string!

When using randomly generated values, make sure it’s easy to recover the input that caused the failure. Usually you can include the value in a call to t.Fatalf.

If it’s a more complicated test involving many files each containing some random content, I would at least put all the files in the same temporary directory. That way, if I need to reproduce a failure to inspect the files, I can just comment out the call to os.RemoveAll and add a t.Log(dirName) to know where to explore to see the bad input. If you’re already working on locally reproducing an intermittent failure anyway, I don’t see anything wrong with making some temporary edits to the test function.

Timing-based tests

In my experience, tests that are sensitive to timing tend to cause flaky failures more frequently than the previously mentioned logic bugs around data order.

Usually it goes like this: your test starts a goroutine to do an operation that should complete quickly, perhaps in tens of milliseconds. You pick a resasonable timeout value — “If I don’t see a result in 100ms, the test has failed.” You run that test in a loop for 15 minutes on your machine and it passes every time. Then after merging that change to master, the test fails at least once the first day it’s running on CI. Someone bumps up the timeout to one second, and it still manages to fail a couple times per week. Now what?

If you have a way to test a synchronous API instead of an asynchronous one — avoiding sensitivity to time — that is generally the best solution.

If you must test asynchronously, be sure to poll rather than do a single long sleep. Don’t do this:

go longOperation.Start()

// Bad: this will always eat up 5 seconds in test, even if the operation completes instantly.
time.Sleep(5 * time.Second)

if !longOperation.IsDone() {
  t.Fatal("didn't see result in time")
}

res := longOperation.Result()
if res != expected {
  t.Fatalf("expected %v, got %v", expected, res)
}

Instead, for APIs that let you check whether a call will block, I often use a pattern like:

go longOperation.Start()

deadline := time.Now().Add(5 * time.Second)
for {
  if time.Now().After(deadline) {
    t.Fatal("didn't see result in time")
  }

  if !longOperation.IsDone() {
    time.Sleep(100 * time.Millisecond)
    continue
  }

  res := longOperation.Result()
  if res != expected {
    t.Fatalf("expected %v, got %v", expected, res)
  }
}

If the API is blocking but accepts a cancelable context, you should use a reasonable timeout so that the test will fail more quickly than the default 10 minute timeout:

go longOperation.Start()

ctx, cancel := context.WithTimeout(context.Background(), 5 * time.Second)
defer cancel()

res, err := longOperation.WaitForResult(ctx)
if err != nil {
  t.Fatal(err)
}
if res != expected {
  t.Fatalf("expected %v, got %v", expected, res)
}

If the API blocks but does not accept a context, you can write a helper function to run the method and fail if it doesn’t complete in a given timeout (left as an exercise for the reader).

Reproducing a flaky test on your workstation

You’ve seen the test fail a couple times on CI. It usually passes if you re-run the CI job, but instead of kicking the can down the road, let’s fix it now. Before you can confidently say you’ve fixed the test, you need to confidently reproduce the test locally.

Set up your test loop

First, focus on the exact package that failed. That is, you want to run go test ./mypkg, not go test ./....

Then, use -run to focus on the exact test that failed. Usually I would just copy and paste the test name that’s failing, e.g. go test -run=TestFoo ./mypkg. However, note that the -run flag accepts a regular expression, so if your test name is also the prefix of another test, you can ensure you only run the exact match, by anchoring the name like go test -run='^TestFoo$' ./mypkg.

If you run that more than once, you will surely notice that recent versions of Go cache the test results. Obviously we don’t want that while we are trying to reproduce a flaky test. You might use -count=1 if you were just running the full test suite without caching, but you should pick a larger number. That number will vary by the exact test; my personal preference is a count that completes in around 10 seconds or so. Let’s say we’ve settled on 100 — your command now looks like go test -run=TestFoo -count=100 ./mypkg.

The -count flag will run the specified number of iterations regardless of how many runs fail. Most of the time, you don’t get any extra information from multiple failures in a single run. For that reason, I prefer to use the -failfast flag so that the test process stops after the first failed run.

Now, we can throw this in a very simple bash loop: while go test -run=TestFoo -count=100 -failfast ./mypkg; do date; done. You could put anything inside the body of the loop, but I like to see the date go by in the output so I can see that it hasn’t deadlocked. (And if the test failure you’re trying to reproduce is itself a deadlock, that loop which completes in about 10 seconds pairs well with -timeout=20s so you don’t have to wait around 10 minutes to see a stack trace.)

At this point, if you can reproduce the failure within a minute or less, you’re in good shape to fix the test. If it takes much longer than that to reproduce, you can shave off a bit more time by compiling the test package before the loop. Under the hood, go test will compile and run the test package, so we can avoid some repeated work by compiling it ourselves. When we execute our compiled test package directly, we need to use the test. prefix on the test-specific flags, like so: go test -c ./mypkg && while ./mypkg.test -test.run=TestFoo -test.count=100 -test.failfast; do date; done.

When that test loop still doesn't reproduce the flaky test

Use the data race detector

Sometimes the test failure is a hard time-based deadline, like waiting one second for something to happen. Using the flag -race to compile the test with the race detector enabled will generally slow down execution, possibly enough to reproduce the failure. And if you happen to detect any new data races along the way, that’s great too.

Stop focusing on the test

This is more helpful for data-ordering flakiness than it is for timing-based flakiness.

In very rare circumstances, the flaky test’s failure has to do with pollution from another test. You might drop the -run flag altogether and run that loop for a while to see if the test fails. Then you can gradually skip more and more of the passing tests until you identify which one(s) are causing the pollution for the flaky test. Adding a t.Skip call works, but for this kind of temporary change, I usually rename TestBar to xTestBar so that Go’s test detection just stops noticing the test altogether and I don’t have to see the SKIP lines in the verbose output.

Throttle the CPU usage of the process

Some flaky tests only seem to show up in a resource-contended environment like your CI server, and never on an otherwise idle system like your workstation. For reproducing those kinds of failures, we have had good luck with cpulimit which should be available in your standard package manager (e.g. brew install cpulimit or apt install cpulimit).

There’s nothing magical about cpulimit. If you’ve ever pressed control-Z in your terminal to stop a running process and restored it with fg, then you’re basically familiar with how cpulimit works: it repeatedly sends SIGSTOP to pause the process and SIGCONT to restore it again, effectively limiting how much CPU time the process is allowed to use. For most use cases, this is a close enough approximation of a limited CPU; and the cpulimit tool has the benefit of being cross-platform and easy to use.

The interesting flags for cpulimit are:

  • -l/--limit, to control how much CPU is available
  • -i/--include-children to apply to subprocesses of the target process
  • -z/--lazy to exit if the target process dies

Choosing an appropriate value for --limit is mostly trial and error. Too low of a value may cause excessive run durations and unintended timeouts, but too high of a value won’t necessarily reproduce the issue. Keep in mind that the maximum value is 100 times the number of available cores on the system, so -l 100 doesn’t represent 100% of available CPU, but rather 100% of one core.

Then, you typically want to run your compiled test package under cpulimit, like cpulimit -l 50 -i -z ./mypkg.test -test.run=TestFoo -test.count=100. As mentioned above, prefer to build your test package out of band with go test -c without using cpulimit. If you were to run cpulimit -l 50 go test ./mypkg -run=TestFoo -count=100 — note the missing -i flag — you would limit go test but the mypkg.test subprocess would run without limit.

Adjust the concurrency of the Go runtime

You can set the GOMAXPROCS environment variable as explained in the runtime package documentation:

The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit.

When attempting to reproduce a flaky test, this setting is typically useful only when you already suspect the flakiness to be related to concurrency.

There are generally only three interesting settings for GOMAXPROCS:

  • One, which will allow only one goroutine to execute at any moment in time
  • The default setting, which is the number of logical CPUs available
  • Double (or more) the number of CPUs on your system, to try to introduce more contention for shared resources

In my experience, flaky tests that seem to be affected by GOMAXPROCS tend to never reproduce for one extreme, occasionally reproduce at default, and often reproduce at the other extreme. It depends on the particular failure mode; some tests may be flaky when there is too much contention, and others may be flaky when there aren’t enough goroutines processing work.

Conclusion

Reproducing flaky tests is part science and part art. I hope that these tips help you next time you’re dealing with a test that keeps failing and blocking CI at inopportune times.