Fix Build Errors ASAP

When a software build breaks due to failed compilation, code analysis, integration, or testing, all team members should swarm to resolve the issue as quickly as possible and prevent the introduction of any new enhancements or maintenance until the build again passes all quality gates.  This triggering of the Andon Cord would occur when team members fail to validate changes against a recent pull from the trunk and attempt to push bad code into the production code base.  This practice of swarming is essential when employing Trunk-Based Development.

It is important to note that we do not work around the problem or schedule a fix (when we have more time) — despite that this is usually what happens.  It is critical that the problem be resolved as quickly as possible to prohibit the same failure from arising in future changes and to prevent new failures from being introduced.  This also prevents the error from progressing downstream where the cost and effort to resolve the issue would be much greater — not to mention the addition of Technical Debt.

History

The Andon Cord, which originated in Japanese manufacturing in the early twentieth century, is used to signal potential defects on a production line.  Toyota places Andon Cords at every workstation and trains every worker and manager to pull the cord when problems arise.  This could be triggered by a defective or missing part or when work takes longer than expected.  When this happens, a leader is alerted and immediately works to resolve the problem.  If the problem is not resolved within a specified amount of time (usually within a matter of seconds), the production line is stopped, and the entire organization is mobilized to resolve the problem.

Methods in Software Development

Adopting the concept of the Andon Cord in software development enables fast feedback for everyone in the value stream (especially the person who caused the failure) and promotes the swarming response to quickly solve the problem and prevent the addition of any extensions or changes until the problem is fixed.  While you probably won’t have an actual cord to pull, the metaphorical Andon Cord would typically be triggered automatically by failures of predefined quality gates, caused by, for example, a compile error, code analysis error, integration or build failure, or the failure of any unit, component, integration, or system test. 

Regardless of how it is triggered, it is critical that the failure be visible and obvious (to the point of being annoying) to everyone in the value stream.  If you can’t have a giant, flashing light with sirens triggered by your CI servers, at the very least every person should be notified via text, e-mail, phone call, or better yet, all of the above.  This alert must initiate an immediate response by every person required to solve the problem.

Collective Ownership

So…  Do you think team members would be more careful about pushing code to the trunk if this happened every time they broke the build?  Yeah, you bet they would.  But remember, it’s not a team member breaking the build, it’s the team breaking the build.  Therefore, the entire team is accountable and responsible for fixing the build (as quickly as possible) any time it breaks.