Fixing Flaky Tests

A Data Driven Approach to Triaging Race Conditions

Aug. 2, 2022, 2:09 p.m.

The Test Platform team at OpenShift recently had a flaky end-to-end test that was failing around 20% of the time. This test would only fail on our CI servers, and always succeeded when running locally. It was a complex test that was verifying the correct functionality of our pod-scaler which is a Kubernetes workload admission controller.

In order to verify the correctness of this complex component it is necessary to set up a prometheus server that is primed with metrics, as well as the producer component which uses these metrics to fill the pod-scaler cache with data for mock workloads. Once the data is in place, the actual admission consumer can be started and fed mock workloads in order to verify the responses.

Due to its complex nature, we have number of helper functions set up to allow us to run all of these components. However, some of these functions contain a bit of synchronization code to make sure that everything is ready when the actual test begins:

prometheus.Ready(t, func(o *testhelper.ReadyOptions) {
    o.ReadyURL = prometheusHost + "/-/ready"
    o.WaitFor = 5
})
time.Sleep(5 * time.Second)

Even though the "Ready" function waits for the Prometheus server to report that it is ready we still find it necessary to wait for an extra 5 seconds.

When I first noticed the flakiness in the e2e test I assumed that there was some synchronization problem which needed to be resolved in order to make the test pass consistently. Noting the above, I figured that simply waiting for some additional time after starting the admission consumer would clear up the flakes. I created the PR with the 5 second sleep and waited for the branch to build and be tested on the CI server. Unfortunately, in order to build the code and run through all the e2e tests takes around 45 minutes. It took around 5 retries before I received another failing run. This meant that not only did my fix not work, but I was no closer to solving the flakiness.

I slogged through a couple of more rounds of ineffective solutions followed by lengthy build-test cycles before taking a step back and really digging into the root of the issue. Since I couldn't get this failure to occur locally I decided that the only reasonable thing to do was add a bunch of logging, like the following, to see where differences occurred on failing and passing runs:

logger.Infof("Initializing admission webhook server with %d loaders.", len(loaders))
...
logger.Debugf("recommendation exists for: %s", pod.Spec.InitContainers[i].Name)
...
logger.Debugf("Digesting %d identifiers.", len(data.DataByMetaData))
...
metaLogger.Trace("merged all fingerprints")
...
logger.Debug("Finished digesting new data.")

Thanks to this logging, I was eventually able to discover that on the failing runs there was missing data for the memory consumption metric (one of the 6 metrics that pod-scaler cares about). This matched nicely with the result of the failed runs as the memory mutation was missing from the response. Continuing to dig through the logs resulted in the discovery that the

logger.Debug("Ready to serve.")

message was never appearing in the failing runs. On top of that, no matter how long I allowed for the server to report as ready, it never would on these failing runs.

I was eventually able to discover that the problem was due to a race condition. Occasionally, the producer would add the data to the cache prior to the admission consumer subscribing to that particular metric. Since the producer was only configured to produce the data once for this e2e test the consumer would never find the data, and would never report as ready. I made a simple fix to allow for the data to not be marked as loaded until there was at least one subscriber for that data, and then forced a reload of the data after all of the subscriptions had been made.

The solution itself was actually not that complex, but attempting to submit one bandaid fix after another, and waiting for the long feedback loop made it an incredibly time intensive pursuit. It is imperative to gather data about the root cause of a complex race condition like this utilizing any means available prior to attempting to fix it. Even if the only means available is a bunch of, debug level, log statements.

Comments about Fixing Flaky Tests

No one has left a comment yet, be the first to voice your opinion on Fixing Flaky Tests!