A short series on Test Automation dilemmas, principles and tips

The Test Automation Engineer dilemma

Your company is growing. You have more customers. You add more features and support more use cases. You can’t afford breakages and downtimes anymore, so you decide to improve the robustness and quality of your testing. In parallel, supporting your growth means hiring more developers, who need to ship fast, with early feedback and limited context-switch, while also sharing a pool of finite test and deploy resources. Covering more vs integrating and deploying faster is often the dilemma Test Automation Engineers face.

Here’s the good news, and the premise of this series of articles: you don’t actually have to choose. Years of DORA research (a program seeking to “understand the capabilities that drive software delivery and operations performance”) across tens of thousands of teams found that speed and stability are not a tradeoff: they move together. The highest-performing teams ship both faster and more reliably. They get there by choosing the right strategy.

How choosing the wrong strategy impacts your business

Here’s the tradeoff:

Running only fast tests means coverage gaps reach production, erodes confidence and loses customers.
Running all tests on every Pull Request (PR) before landing is infeasible at scale and with a finite amount of resources. It frustrates developers and severely limits the release cadence. As Martin Fowler puts it in his excellent intro to Continuous Integration, “the whole point of Continuous Integration is to provide rapid feedback. Nothing sucks the blood of Continuous Integration more than a build that takes a long time.”

It’s tempting to brute-force this problem. Just spend more $ running more tests on more machines, all the time. This works okay for a small team and app, but quickly hits limitations when the full validation suites run for hours (or even minutes), and rarely passes on the first try due to flaky tests. In 2016, Google analysed their own CI and reported that “almost 16% of our tests have some level of flakiness” (more than 1 in 7). At that scale a large suite almost never comes back green on the first attempt and needs reruns.

When brute-forcing is no longer an option, the real question becomes: When to test and signal failure, in what order and how?

What this series covers

This is the first, intro post of an upcoming series. Each part will analyse a single lever and how the wider industry has approached the same problem.

Two CI strategies to keep main green: block bad changes, or land fast and chase failures: should you gate changes with a merge queue before they reach main, or let them land and rely on bisection and rollback to catch what slips through?
Smart Validation: test only what a change can affect: instead of running everything, run the subset a change could actually break.
Batch landing & bisection: keep main green at scale: when many PRs land at once, how merge queues test them together, pinpoint the culprit, and keep the mainline shippable.

Each part stands on its own, but they compound: choosing where to draw the true head/green head line gives you the structure, smart triggers cut the cost of validating each change, and batch landing keeps the whole thing green as your team grows.

Part 1 lands next week. Subscribe if you’d like it in your inbox.

The Test Automation Engineer dilemma#

How choosing the wrong strategy impacts your business#

What this series covers#

The Test Automation Engineer dilemma

How choosing the wrong strategy impacts your business

What this series covers