The stability builds execute a suite of web-based user interface automation against mocked back-end test data. They run to a schedule and, on a good night, we see six successful builds:
The builds do not run sequentially. At 1am and 4am we trigger two builds in close succession. These execute in parallel so that we use more of our Selenium Grid, which can give early warning of problems caused by load and thread contention.
When things are not going well, we rarely see six failed builds. As problems emerge, the stability test result trend starts to look like this:
In a suite of over 250 tests, there might be a handful of failures. The number of failing tests, and the specific tests that fail, will often vary between builds. Sometimes there is an obvious pattern e.g. tests with an image picker dialog. Sometimes there appears to be no common link.
Why don't we catch these problems during the day?
These tests are part of a build pipeline that includes a large unit test suite. In the build that is run during the day, the test result trend is skewed by unit test failures. The developers are actively working on the code and using our continuous integration for fast feedback.
Once the unit tests are successful, intermittent issues in the user interface tests are often resolved in a subsequent build without code changes. This means that the development team are not blocked, once the build executes successfully they can merge their code.
The overnight stability build is a collective conscience for everyone who works on the product. When the build status changes state, a notification is pushed into the shared chat channel:
Each morning someone will look at the failed builds, then share a short comment about their investigation in a thread of conversation spawned from the original notification message. The team decide whether additional investigation is warranted and how the problem might be addressed.
It can be difficult to prioritise technical debt tasks in test automation. The stability build makes problems visible quickly, to a wide audience. It is rare that these failures are systematically neglected. We know from experience that ignoring the problems has a negative impact on cycle time of our development teams. When it becomes part of the culture to repeatedly trigger a build in order to get a clean set of test results, everything slows down and people become frustrated.
If your user interface tests are integrated into your pipeline, you may find value in adopting a similar approach to stability. We see benefits in early detection, raising awareness of automation across a broad audience, and creating shared ownership of issue resolution.