The boundaries of performance testing
Before beginning the journey towards continuous delivery, the Expedia performance testing team had two conversations with the wider organisation to decide:- When is the application ready for performance testing?
- What performance testing results mean the application is ready for production?
These questions generated healthy discussion as there was a lot of diversity in opinion. The agreed answers established the boundaries of focus for Diana's team to implement change in their approach to performance testing.
Looking at their existing approach, the performance testers felt that most of their time was being lost in the slow feedback loop between development and performance testing. These two activities are so far removed from one another in a traditional development lifecycle, with several phases of testing happening between them, that raising and remedying performance problems takes a long time. They felt that where performance testing delayed the release to production, it was often due to the time being spent in these interactions.
The performance team decided to introduce performance testing of individual components in a continuous integration pipeline to help reduce the number of problems being discovered during integrated performance testing later in the lifecycle. Diana observed that as the team established this continuous integration pipeline they started to work "more with developers than testers".
Creating a performance pipeline
The "functional folk" had already built a pipeline that compiled the web application, ran unit tests, deployed to a functional test environment, ran basic regression tests and created a release build. Diana's team decided to create a separate performance pipeline that logically followed on from the functional pipeline.The performance pipeline took the release build from the functional pipeline and deployed it to an environment, then discovered the version of the same application that was currently in production and deployed it to a different environment with the same hardware specification. A two hour performance test was then run in parallel against both versions of the application and a comparative report generated.
The team was extremely fortunate to have access to a large pool of production-like servers on which to run their performance tests, so being able to run tests in parallel and at scale wasn't an issue. When asked whether the same approach would work on smaller test environments, Diana felt that it would so long as the production traffic profile was appropriate scaled to the test environment hardware specification.
The two hourly builds generated a lot of information, in fact too much to be useful. The performance team decided to save the data from the two hour performance tests and then run automated analysis that detected trending degradation across the combined result of three builds.
The thresholds at which to report performance decay were set in consultation with the business and were high enough to alleviate the risk of false positives in the report caused by developers simply adding functionality to the application. Diana noted that it was ultimately a business decision whether to release, and that where a new piece of functionality caused a performance degradation that resulted in failing performance tests the business could still opt to release it.
Even with trend based analysis, there was some balancing to get the email notifications from these two hourly builds correct. They were initially being sent to people within the development team who subscribed. As the performance analysis was improved and refined, the notifications became increasingly relevant and they started to be delivered to more people.
Performance testing the pieces
The performance pipeline was designed to test the deployed web application in isolation rather than in an integrated environment. It made use of a number of stubs so that any degradation detected would likely relate to changes in the application rather than instability in third party systems.In addition to testing the web application, the performance team created continuous integration pipelines for the database and services layer underneath the UI. Many of these lower level performance tests used self-service Jenkins jobs that the developers could use to spin up a cloud instance suited to the size of the component, deploy the component in isolation, run a performance test, tear down the environment and provide a report.
Diana also mentioned A/B performance testing where the team would deploy a build with a feature flag switched on, and the same build with the same feature flag switched off, then run a parallel performance test against each to determine whether the flag caused any significant performance problems.
Integrated performance
The performance testing team retained their traditional integration performance tests as part of the release to production, but with the presence of earlier performance testing in the development process this became more of a formality. Fewer problems were discovered late in the release process.
Diana estimated that these changes were about 2 years of work for 2 - 3 people. She commented that it was relatively easy to set up the tests and pipelines, but difficult to automate analysis of the results.
Ultimately Diana was part of taking the Expedia release process from monthly to twice per week. I imagine that their journey towards continuous delivery continues!