Tuesday 30 January 2018

A stability strategy for test automation

As part of the continuous integration strategy for one of our products, we run stability builds each night. The purpose is to detect changes in the product or the tests that cause intermittent issues, which can be obscured during the day. Stability builds give us test results against a consistent code base during a period of time that our test environments are not under heavy load.

The stability builds execute a suite of web-based user interface automation against mocked back-end test data. They run to a schedule and, on a good night, we see six successful builds:

The builds do not run sequentially. At 1am and 4am we trigger two builds in close succession. These execute in parallel so that we use more of our Selenium Grid, which can give early warning of problems caused by load and thread contention.

When things are not going well, we rarely see six failed builds. As problems emerge, the stability test result trend starts to look like this:

In a suite of over 250 tests, there might be a handful of failures. The number of failing tests, and the specific tests that fail, will often vary between builds. Sometimes there is an obvious pattern e.g. tests with an image picker dialog. Sometimes there appears to be no common link.

Why don't we catch these problems during the day?

These tests are part of a build pipeline that includes a large unit test suite. In the build that is run during the day, the test result trend is skewed by unit test failures. The developers are actively working on the code and using our continuous integration for fast feedback.

Once the unit tests are successful, intermittent issues in the user interface tests are often resolved in a subsequent build without code changes. This means that the development team are not blocked, once the build executes successfully they can merge their code.

The overnight stability build is a collective conscience for everyone who works on the product. When the build status changes state, a notification is pushed into the shared chat channel:

Each morning someone will look at the failed builds, then share a short comment about their investigation in a thread of conversation spawned from the original notification message. The team decide whether additional investigation is warranted and how the problem might be addressed.

It can be difficult to prioritise technical debt tasks in test automation. The stability build makes problems visible quickly, to a wide audience. It is rare that these failures are systematically neglected. We know from experience that ignoring the problems has a negative impact on cycle time of our development teams. When it becomes part of the culture to repeatedly trigger a build in order to get a clean set of test results, everything slows down and people become frustrated.

If your user interface tests are integrated into your pipeline, you may find value in adopting a similar approach to stability. We see benefits in early detection, raising awareness of automation across a broad audience, and creating shared ownership of issue resolution.


  1. "These tests are part of a build pipeline that includes a large unit test suite. In the build that is run during the day, the test result trend is skewed by unit test failures. The developers are actively working on the code and using our continuous integration for fast feedback."

    So this assumes that the developers have fixed the defects found during unit testing on that day ?

    1. The build has to pass on their pull request prior to merge back into master. It includes both the unit tests and UI tests in a simple pipeline. If a unit test is broken, the build will fail within 10 minutes. If a UI test fails the feedback is received within 30 minutes. In either situation, the development team (developers and testers) work to fix the problems that have been caused by their change.

      Intermittent issues in the UI tests can make it to master as they will not reliably fail in the pull request build. The stability job helps us to catch this type of problem and resolve it as technical debt.

  2. Hello Katrina, thanks for sharing your strategy to stabilize the testautomation - a huge topic in many projects. I've two questions on the mocked back-end testdata:

    - Are the same test also used against an fully integrated system and do you double-check if an test fails against the mocked-data and went well against the real back-end. This could be an hint, that the data should be updated. Or the situation vise vera could happen ... so you have an false positive test.

    - is the mocked backend implemented by your own or are you using an framework or commercial software?

    Best, Sven

    1. We rarely hit issues with our mocking approach, though we do not run this exact automation against the real services. Our services are versioned and when changes are made, they are released against a new version of the service. If there is change driven through another product, then this product wouldn't pick up the change until they were developing in the same service. They would update the tests as their team worked on the next iteration of the service that included any previous changes. This approach means that we don't hit situations where the services change without the product, or vice versa. (I hope that makes sense).

      We have hit issues with using the mock data for dual purpose. The implementation is our own framework in node JS. We use the same data for both test automation and our product demo, which can create conflict.