We've also seen a noticeable jump in the number of pull requests that are successfully merged back to our master branch each day. This gives some weight to the idea that our rate of application code change was previously impeded by our test infrastructure.
The increase in volume occasionally causes a problem when two feature branches are merged back to master in quick succession. Our tests fail on the second build of the master branch post-merge.
To illustrate, imagine that there are two open pull requests for two feature branches: orange and purple. We can trigger multiple pull request (PR) builds in parallel, so the two delivery teams who are behind these feature branches can receive feedback about their code simultaneously.
When a PR build passes successfully and the code has been through peer review, it can be merged back to the master branch. Each time the master branch changes it triggers the same test suite that executes for a pull request.
We do not trigger multiple builds against master in parallel. If two pull requests are merged in quick succession the first will build immediately and the second will trigger a build that waits for the first to complete before executing. Sometimes the second build will fail.
1. Failing tests after multiple PR merges to master |
As the person who had driven sweeping test infrastructure changes, when this happened the first time I assumed that the test automation was somehow faulty. The real issue was that the code changes in orange and purple, while not in conflict with each other at a source code level, caused unexpected problems when put together. The failing tests reflected this.
We hadn't seen this problem previously because our pull requests were rarely merged in such quick succession. They were widely spaced, which meant that when the developer pulled from master to their branch at the beginning of the merge process these type of failures were discovered and resolved.
I raised this as a topic of conversation during Lean Coffee at CAST2016 to find out how other teams move quickly with continuous integration. Those present offered up some possible options to resolve the problem as I described it.
Trunk based development
Google and Facebook move a lot faster than my organisation. Someone suggested that I research these companies to learn about their branching and merging strategy.I duly found Google's vs Facebook's Trunk Based Development by Paul Hammant and was slightly surprised to see a relevant visualisation at the very top of the article:
2. Google's vs Facebook's Trunk Based Development by Paul Hammant |
It seems that, to move very quickly with a large number of people contributing to a code base, trunk-based development is preferred. As the previous diagram illustrates, we currently use a mainline approach with feature branches. This creates larger opportunities for conflicts due to merging.
I had assumed that all possible solutions to these tests failing on master would be a testing-focused. However, a switch to trunk-based development would be a significant change to our practices for every person writing code. I think this solution is too big for the problem.
Sequential build
Someone else suggested that perhaps we were just going faster than we should be. If we weren't running any build requests in parallel and instead triggered everything sequentially, would there still be a problem?I don't think that switching to sequential builds would fix our issue as the step to trigger the merge is a manual one. A pull request might have successfully passed tests but be waiting on peer review from other developers. In the event that no changes are required by reviewers, the pull request could be merged to master at a time that still creates conflict:
3. Sequential PR build with rapid merge timing |
The pull request build being sequential would slow our feedback loop to the delivery teams with no certain benefit.
Staged Build
Another suggestion was to look at introducing an interim step to our branching strategy. Instead of feature branches to master, we'd have a staging zone that might work something like this:4. Introducing a staging area |
The staging branch would use sequential builds. If a test passes there, then it can go to master. If a test fails there, then it doesn't go to master. The theory is that master is always passing.
Where this solution gets a little vague is how the staging branch might automatically rollback a merge. I'm not sure whether it's possible to automatically back changes off a branch based on a test result from continuous integration. If this were possible, why wouldn't we just do this with master instead of introducing an interim step?
I'm relatively sure that the person who suggested this hadn't seen such an approach work in practice.
Do Nothing
After querying the cost of the problem that we're experiencing, the last suggestion that I received was to do nothing. This is the easiest suggestion to implement but one that I find challenging. It feels like I'm leaving a problem unresolved.However, I know that the build can't always pass successfully. Test automation that is meaningful should fail sometimes and provide information about potential problems in the software. I'm coming to terms with the idea that perhaps the failures we see post-merge are valuable, even though they have become more prevalent since we picked up our pace.
While frustrating, the failures are revealing dependencies between teams that might have been hidden. They also encourage collaboration as people from across the product work together on rapid solutions once the master branch is broken.
While I still feel like there must be a better way, for now it's likely that we will do nothing.
Other posts from CAST2016: