This growth is starting to create some interesting problems in the execution of our test automation. In particular for our web-based retail banking application, which is a relatively young product that has had test automation embedded in the development approach since the very beginning.
Alongside a comprehensive unit test suite, we've been using Selenium WebDriver to execute tests against Firefox. We call these tests our "automated acceptance suite" (AAS) or "node tests", which is a reference to the mock server technology that these tests execute against.
In the beginning the application was small and the node tests that ran alongside it were quick. As the product has grown we've added more tests, so they take longer to execute. When the fast feedback provided by our automation was no longer fast enough, we switched our tests from single thread to parallel execution.
In the beginning there was just a single development team and the node tests ran every time that a change was made. As the number of teams has grown the number of changes being made has increased, so the tests are being executed more frequently. When our build queues started to exceed reasonable lengths, we switched from a dedicated continuous integration hardware to docker containers that increased the number of builds we could execute in parallel.
Our solution to problems introduced by growth has been to do more things at once.
To get the tests to run faster we switched the test implementation to parallel execution.
To get the build queues to be shorter we switched the infrastructure to parallel execution.
These were good solutions for us. But now we're coming to the point where we can't do any more things at once with what we have. To illustrate, compare what was running on our build server against what is running there now:
In the beginning we had dedicated hardware. It ran a node server to return mock responses, a web server for our product, and the tests that opened a single Firefox window to execute against.
In our current state we have four active docker containers. Each runs a node server, a web server, and the tests that open four Firefox windows to execute against.
In our current state we're hitting the limits of what our infrastructure can do. This is manifesting in two types of problem that are causing a lot of frustration as they fundamentally impact on two key measures for the usefulness of automation: speed and stability.
Our current state can be slow, particularly when there are four builds executing at once and the hardware is fully loaded. Our overnight build time is approximately 30 minutes. By contrast, when a build executes during business hours it takes approximately 50 minutes.
I find it easiest to explain why this happens using an analogy. Imagine a horse towing a cart with four large pumpkins in it. The horse can trot down the street quite happily, relatively unencumbered by its load. Now imagine the same horse towing a cart with 28 large pumpkins in it. The horse can still move the cart, but it won't be able to travel at the same pace that it did with a lighter load. It may trudge rather than trot.
Our overnight build is carried by the lightly loaded horse as it may be the only build active on our hardware. Our build during business hours is carried by the heavily-laden horse as many builds run at once. The time taken to complete a build alters accordingly.
The instability we've seen comes partly from this variable speed. There's a particular case where we look for a success notification that is only displayed for a fixed duration. When the timing to complete the action that triggers this notification is variable, it becomes frustrating to verify.
But we've also had stability problems with the four Firefox browsers running on a single display. Some failures are caused by tests running in parallel that fight for focus e.g. attempting to confirm a payment via a modal dialog. Others are attributed to two different tests that simultaneously attempt to hover and click the mouse e.g. editing an account image. When these clashes occur, one of the tests involved will usually fail.
Our operations team ran some diagnostics on the existing hardware to determine what made it slow. They identified which processes were chewing up the most system resources or the largest pumpkins on the cart. It turned out that there was a clear single culprit: Firefox.
Enter Selenium Grid.
Selenium Grid enables a distributed test execution environment. What this means in our case is that we can move all of the Firefox instances out of our docker containers. This will significantly lighten the load on our existing continuous integration infrastructure:
In the proposed future state, our tests will trigger to the Selenium Grid Hub on our cloud-based infrastructure. The hub will have connectivity to a pool of Selenium Grid Nodes. Instead of having multiple Firefox windows open on a single display, we're provisioning each node in a dedicated container with a single browser.
Each grid node will know where it was triggered from, as the browser will still open the web application that is running on the existing docker architecture. This does mean that we are introducing network latency into each of our WebDriver interactions, so they'll be slower than on local hardware. But the distributed architecture should give us enough advantages that we still end up with a faster solution overall.
Our hope is that this proposed future will address our existing speed and stability issues. Increasing the system resource available through the introduction of hardware should help us to get consistent build times, regardless of the time of day. And having each Firefox browser in its own dedicated container should avoid any display contention.
We have a working prototype for the proposed future state and early signs are promising. I'm looking forward to turning the vision into reality and hope that it will bring benefits that we are searching for.