10 November 2021

Within each normal-sized team, limit the choice of alternatives for any class of technology to three. These are: the current sensible default, the one we're experimenting with as a trial, and the one that we hate and want to retire.

The conversation goes like this: We want to introduce a new messaging technology. How many do we have already in place? Oh we have three in active use, including one that's considered legacy and we're partway through migrating off and one that we experimented with previously but didn't gain traction. Ok, so we're at our limit now. If we want to add another messaging tech then we have two choices. Either migrate all of our apps off the legacy tech, or properly rid ourselves of the failed experiment. This is quite closely related to the idea of capping the number of Innovation Tokens in use within your teams.

At a team level these kinds of limits are relatively easy to maintain and discuss and act upon, because we have common priorities and ways of working and high trust, high bandwidth communication. At the scope of the whole organisation the challenge is similar, but getting alignment takes a lot longer and doing actual migration and consolidation work can take a long time - so we sometimes have to allow for more variation in technology. We also use different techniques to discuss and communicate the status of our preferred technologies.

An approach we use at MYOB to engage our whole organisation in broader decisions about technology is by publishing our own MYOB Technology Radar, following the format of the Thoughtworks Technology Radar. This approach of building our own radar involves taking input from all of our verticals and teams, and making a clear statement on what technologies we encourage teams to adopt, trial, or more importantly which ones to keep clear of.


28 January 2021

When people think of code reviews, they usually think in terms of an explicit step in a development team's workflow. These days the Pre-Integration Review, carried out on a Pull Request is the most common mechanism for a code review, to the point that many people witlessly consider that not using pull requests removes all opportunities for doing code review. Such a narrow view of code reviews doesn't just ignore a host of explicit mechanisms for review, it more importantly neglects probably the most powerful code review technique - that of perpetual refinement done by the entire team.

One of the most pervasive perspectives in software is the notion that it's something we build and complete - hence the endless metaphor of building construction and architecture. Yet the key property of software is that it is soft, and can be as easily modified after it's released as it was when initially composed in the programmer's editor. That's why Erik Dörnenburg wisely argues that architecture is a poor metaphor and would be better replaced by town planning. Valuable software is usually in a constant state of change, as we add features from a better understanding of the value it can bring. But the opportunity is not just to add new features, but also to refine that software - incorporating the lessons the team steadily learns about how best that software can enable these changes.

With the right environment, I can look a bit of code written six months ago, see some problems with how it's written, and quickly fix them. This may be because this code was flawed when it was written, or that changes in the code base since led to the code no longer being quite right. Whichever the cause, the important thing is to fix problems as soon as they start getting in our way. As soon as I have an understanding about the code that wasn't immediately apparent from reading it, I have the responsibility to (as Ward Cunningham so wonderfully said) take that understanding out of my head and put it into the code. That way the next reader won't have to work so hard.

This process of refinement is exactly the same as what happens in a code review, but it's triggered each time the code is looked at rather than when the code is added to the codebase. This was, for me, a crucial insight. After all, many problems that code reviews seek to remedy are problems that only become problems when the code is read in the future. There's a strong argument for not worrying about them until then. After all, just like adding a large apartment complex changes traffic patterns, we may have altered the context of the code six months later, altering the kind of fix that code needs. It also involves more people, in this scheme every developer that reads the code is a reviewer, and one that's able to review based on their actual use of the code rather than on some general, but often hazily-justified guidelines.

A way to think about the validity of a practice is by thinking about what happens if it's a monopoly. What if the only code review mechanism we have is the iteration from later programmers? One consequence is that the review attention gets concentrated on the areas of code that are read more often - which is mostly the areas that ought to get the attention. One concern is that code that's never read will never get reviewed - but mostly that's fine. A team with good testing practices can be confident that the code works, performance tests can identify performance issues. Given that, if the code never needs to be looked at again, we don't need to spend effort on making it comprehensible. I'd expect such cases to be vanishingly rare, but it's an informative thought experiment.

But most ≠ all. One obvious exception here is security issues. Code can work just fine for years until an attacker finds an exploit, at that point we'll lament its lack of review. This is an example of high-impact but rare safety concerns which deserve special scrutiny. However that doesn't mean we shouldn't make conscious use of refinement as a code review mechanism. Instead it means we should be aware of rare-high-impact concerns and adjust our workflow to watch for that kind of specific problem to the degree that it's needed in our circumstances. Threat analysis should alert us to the modules that need additional attention and the kinds of risks they face. Targeted code reviews might be scheduled for security concerns, these can run more effectively because they are focused on a specific kind of problem.

In order to do this perpetual code refinement we require other practices. If I'm going to change code I need to have confidence that it won't break existing functionality, so I need something like Self Testing Code. I need to know that it won't cause big merge conflicts for others, so I need Continuous Integration. We all need to be good at refactoring so we can change code effectively. Since this relies on many developers being expected to modify any part of the code base, we are best off with collective (or at least weak) code ownership. But given a team that has these skills, they can rely on using their regular refinement as a substantial part of their code review strategy.

If nothing else, I think it's important that we put more thought into the role of refinement as code review. One of the dangers of focusing solely on Pre-Integration Reviews is that it can lead teams to neglect how change works in a code base. If I have a pristine mainline, and ensure that every commit merged into that mainline is pristine - can I be sure that the codebase is still pristine after six months? I'd argue that I can't, because the changes mean a good decision about some code six months ago is no longer a good decision now. Refining the code allows us to evaluate old code against this changing usage, allowing us to sustain its health.


Ben Noble, Chris Ford, Evan Bottcher, Ian Cartwright, Jeremy Huiskamp, Ken Mugrage, Mario Giampietri, Martha Rohte, Omar Bashir, Peter Gillard-Moss, and Simon Brunning commented on drafts of this post on our internal mailing list.


28 January 2021

Pull Requests are a mechanism popularized by github, used to help facilitate merging of work, particularly in the context of open-source projects. A contributor works on their contribution in a fork (clone) of the central repository. Once their contribution is finished they create a pull request to notify the owner of the central repository that their work is ready to be merged into the mainline. Tooling supports and encourages code review of the contribution before accepting the request. Pull requests have become widely used in software development, but critics are concerned by the addition of integration friction which can prevent continuous integration.

Pull requests essentially provide convenient tooling for a development workflow that existed in many open-source projects, particularly those using a distributed source-control system (such as git). This workflow begins with a contributor creating a new logical branch, either by starting a new branch in the central repository, cloning into a personal repository, or both. The contributor then works on that branch, typically in the style of a Feature Branch, pulling any updates from Mainline into their branch. When they are done they communicate with the maintainer of the central repository indicating that they are done, together with a reference to their commits. This reference could be the URL of a branch that needs to be integrated, or a set of patches in an email.

Once the maintainer gets the message, she can then examine the commits to decide if they are ready to go into mainline. If not, she can then suggest changes to the contributor, who then has opportunity to adjust their submission. Once all is ok, the maintainer can then merge, either with a regular merge/rebase or applying the patches from the final email.

Github's pull request mechanism makes this flow much easier. It keeps track of the clones through its fork mechanism, and automatically creates a message thread to discuss the pull request, together with behavior to handle the various steps in the review workflow. These conveniences were a major part of what made github successful and led to "pull request" becoming a fundamental part of the developer's lexicon.

So that's how pull requests work, but should we use them, and if so how? To answer that question, I like to step back from the mechanism and think about how it works in the context of a source code management workflow. To help me think about that, I wrote down a series of patterns for managing source code branching. I find understanding these (specifically the Base and Integration patterns) clarifies the role of pull requests.

In terms of these patterns, pull requests are a mechanism designed to implement a combination of Feature Branching and Pre-Integration Reviews. Thus to assess the usefulness of pull requests we first need to consider how applicable those patterns are to our situation. Like most patterns, they are sometimes valuable, and sometimes a pain in the neck - we have to examine them based on our specific context. Feature Branching is a good way of packaging together a logical contribution so that it can be assessed, accepted, or deferred as a single unit. This makes a lot of sense when contributors are not trusted to commit directly to mainline. But Feature Branching comes at a cost, which is that it usually limits the frequency of integration, leading to complicated merges and deterring refactoring. Pre-Integration Reviews provide a clear place to do code review at the cost of a significant increase in integration friction. [1]

That's a drastic summary of the situation (I need a lot more words to explain this further in the feature branching article), but it boils down to the fact that the value of these patterns, and thus the value of pull requests, rest mostly on the social structure of the team. Some teams work better with pull requests, some teams would find pull requests a severe drag on the effectiveness. I suspect that since pull requests are so popular, a lot of teams are using them by default when they would do better without them.

While pull requests are built for Feature Branches, teams can use them within a Continuous Integration environment. To do this they need to ensure that pull requests are small enough, and the team responsive enough, to follow the CI rule of thumb that everybody does Mainline Integration at least daily. (And I should remind everyone that Mainline Integration is more than just merging the current mainline into the feature branch). Using the ship/show/ask classification can be an effective way to integrate pull requests into a more CI-friendly workflow.

The wide usage of pull requests has encouraged a wider use of code review, since pull requests provide a clear point for Pre-Integration Review, together with tooling that encourages it. Code review is a Good Thing, but we must remember that a pull request isn't the only mechanism we can use for it. Many teams find great value in the continuous review afforded by Pair Programming. To avoid reducing integration frquency we can carry out post-integration code review in several ways. A formal process can record a review for each commit, or a tech lead can examine risky commits every couple of days. Perhaps the most powerful form of code review is one that's frequently ignored. A team that takes the attitude that the codebase is a fluid system, one that can be steadily refined with repeated iteration carries out Refinement Code Review every time a developer looks at existing code. I often hear people say that pull requests are necessary because without them you can't do code reviews - that's rubbish. Pre-integration code review is just one way to do code reviews, and for many teams it isn't the best choice.


Chris Ford, Dan Mutton, Jeremy Huiskamp, Kief Morris, Pramod Sadalage, and Ryan Boucher commented on drafts of this post on our internal mailing list.


1: A colleague of mine recently calculated the time a client spent waiting for pull requests that had no comments (true of 91% of them). Total time waiting in 2020 for 7000 PRs was 130,000 hours. This figure included time elapsed over nights and weekends.


18 November 2020

A computational notebook is an environment for writing a prose document that allows the author to embed code which can be easily executed with the results also incorporated into the document. It's a platform particularly well-suited for data science work. Such environments include Jupyter Notebook, R Markdown, Mathematica, and Emacs's org-mode.

When I'm exploring some data, it's useful to keep my notes close together with the code that performs the exploration. I like to try some code, look at the results, and note down any observations I have from that execution. A computational notebook allows me to combine these together easily in a single document.

Here's an example of this, looking at some analysis of my google analytics data for I'm doing this in R Studio, which uses the R Markdown format.

The example out here is a graph, as notebooks are well suited for plotting various charts. But it's just as useful to embed various data manipulations in the code and display the data in the document as a table.

I first encountered a computational notebook in the late 1980's with Mathematica. I remember wishing I'd had access to such a tool during my university degree, but didn't use a computational notebook again until recent years, with the rise of their use in data science circles. The notebook software I hear most about is Jupyter Notebook, which is popular in the Python community, but as I do my data munging with R I tend to use R Markdown, usually within R Studio. I also use a rather more niche notebook, org-mode, which is part of Emacs.

The code embedded in Mathematica is its own programming language, designed for expressing mathematics. Although Jupyter began in the Python world, it supports a wide range of programming languages, as does R Markdown. Mathematica is a commercial tool, but Jupyter and R Markdown are open source. Jupyter stores its files in JSON, R Markdown uses markdown files with some special markup for the code blocks. Using a text format for the documents allows them to be stored in regular version control tools, and using a markup language makes diffing easier. Using a markup language allows the possibility of editing the documents in other editors, but they need to have a suitable environment for executing the code blocks.

Computational notebooks are useful when exploring a problem, such as trying various forms of analysis on a dataset. The document acts as a record of what's been tried and all the observations the researcher makes as they try things. By keeping the code and results together the writer can see exactly what they did and what results that generated. This coupling of code and results is a form of IllustrativeProgramming, making the environment appealing to lay programmers. One thing to be wary of, however, is if any external environmental factors change the result - such as the contents of a database. If the dataset isn't too large it can be exported and kept in the version control system, but often its size is prohibitive.

Notebooks are also useful for preparing reports, usually by generating a document in PDF, HTML, or other formats. If I want to report to an author on the traffic for their article, I take the last such report, change the subject URL, rerun all the code, and tweak any prose commentary I think is appropriate. If I were sufficiently motivated I could auto-generate such reports every few months. I like that such reports can easily include the code used to generate the results, so readers can accurately understand the logic behind the figures they see.

Notebooks shouldn't be used, however, as a component of a production system. The notebook structure - with its casual mix of IO, calculation, and UI - is there to encourage interactivity, but works against the modularity needed for code that is used as part of a broader code base. It's best to think of notebooks as a way of exploring logic, once you've found a path, that logic should be replicated into a library designed for production use.


29 April 2020

Software development teams find life can be much easier if they integrate their work as often as they can. They also find it valuable to release frequently into production. But teams don't want to expose half-developed features to their users. A useful technique to deal with this tension is to build all the back-end code, integrate, but don't build the user-interface. The feature can be integrated and tested, but the UI is held back until the end until, like a keystone, it's added to complete the feature, revealing it to the users.

A simple example of this technique might be to give a customer the option of a rush order. Such an order needs to be priced, depending on where the customer lives and what delivery companies operate there. The nature of the goods involved affects the picking approach used in the warehouse. Certain customers may qualify to have rush orders available to them, which may also depend on the delivery location, the time of year, and the kind of goods ordered.

All in all that's a fair bit of business logic, particularly since it will involve gnarly integration with various warehousing, catalog, and customer service systems. Doing this could take several weeks, while other features, need to be released every few days. But as far as the user is concerned, a rush order is just a check-box on the order form.

To build this using the check-box as the keystone, the team does development work on the underlying business logic and interfaces to internal systems over the course of several production releases. The user is unaware of all this latent code. Only with the last step does the keystone check-box need to be made visible, which can be done in a relatively short time. This way all latent code can be integrated and be part of the system going into production, reducing the problems that come with a long-lived feature branch.

The latent code does need to be tested to the same degree of confidence that it would be if it were active. This can be done providing the architecture of the system is setup so that most testing isn't done through the user interface. Unit Tests and other lower layers of the Test Pyramid should be easy to run this way. Even Broad Stack Tests can be run providing there is a mechanism to make them Subcutaneous Tests. In some cases there will a significant amount of behavior within the UI itself, but this can also be tested if the design allows the visible UI to be a Humble Object.

Not all applications are built in such a way that they can be extensively tested in a subcutaneous manner - but the effort required to do this is worthwhile even without the capability to use a keystone. Tests running through the UI are always more trouble to setup, even with the best tools to automate the process. Moving more tests to subcutaneous and lower level tests, especially unit tests, can dramatically speed up Deployment Pipelines and enable Continuous Delivery.

Of course, most UIs will be more than a check-box, although often they aren't that much more work to keystone. In a web app, a complex feature will often be an independent web page, that can be built and tested in full, and the keystone is merely a link. A desktop may have several screens where the keystone is the menu-item to make them visible.

That said, there are cases when the UI can't be packaged into a simple keystone. When that's the case then it's time to use Feature Toggles. Even in this case, however, thinking of a keystone can be useful by ensuring that the feature toggle only applies to the UI. This avoids scattering lots of toggle points through the back end code, reduces the complexity of applying the toggle, allows the use of simple toggle mechanisms, and makes it easier to remove when the time comes.

There is a general danger with developing a UI last, in that the back-end code may be designed in a way that doesn't work with the UI once it's built, or the UI isn't given the attention it needs until late, leading to a lack of iteration and a poor user experience. For those reasons a keystone approach works best within an overall approach that encourages building a product through thin vertical slices that lead to releasing small but fully working features rapidly.

I've used the example of a user-interface here, but of course the same approach can be used with any other interface, such as an API. By building the consumer's interface last, and keeping it simple, we can build and integrate even large features in small chunks.

Dark Launching is a variation where the new feature is called once its built, but no results are shown to the user. This is done to measure the impact on the back-end systems, which is useful for some changes. Once all is good, we can add the keystone.


I first came across the metaphor of a keystone for this technique in the second edition of Kent Beck's Extreme Programming Explained. Pete Hodgson, Brandon Duff, and Stefan Smith reminded me that I'd forgotten this.

Dave Farley, Paul Hammant, and Pete Hodgson commented on drafts on this post.


11 February 2020

Imagine a team writing software for a shopping website. If we look at the team's output, we might consider how many new features they produced in the last quarter, or a cross-functional measure such as a reduction in page load time. An outcome measure, however, would consider measure increased sales revenue, or reduced number of support calls for the product. Focusing on outcomes, rather than output, favors building features that do more to improve the effectiveness of the software's users and customers.

As with any professional activity, those of us involved in software development want to learn what makes us more effective. This is true of an individual developer trying to improve her own performance, for managers looking to improve teams within an organization, or a maven like me trying to raise the game of the entire industry. One of the things that makes this difficult is that there's no clear way to measure the productivity of a software team. And this measurement question gets further complicated by whether we base effectiveness on output or outcome.

I've always been of the opinion that outcome is what we should concentrate on. If a team delivers lots of functionality - whether we measure it in lines of code, function points, or stories - that functionality doesn't matter if it doesn't help the user improve their activity. Lots of unused features are wasted effort, indeed worse than that they bloat the code base making it harder to add new features in the future. Such a software development team needs to care about the usefulness of the new functionality, they improve as they deliver less features, but of greater utility.

One argument I've heard against using outcome-based observations is that it's harder to come up with repeatable measures for outcomes than it is for output. I find this point difficult to fathom. Measuring pure output for software is famously difficult. Lines of code are a useless measure even if they weren't so easily gamed. There's poor replicability with Function Point or Story Points - different people will give the same things different point scores. Compared to this, we are very good at measuring financial outcomes. Of course, many outcome observations are more tricky to make - consider customer satisfaction - but I don't see any of them as more difficult than software functionality.

Just calling something an “outcome”, of course, doesn't make something the right thing to focus on, and there is certainly a skill to picking the right outcomes to observe. One handy notion is that of Seiden, who says that an outcome should be a change in behavior of a user, employee, or customer that drives a good thing for the organization. He makes a distinction between “outcomes”, which are behavioral changes that are easier to observe, and “impacts” which are broader effects upon the organization. In developing EDGE, Highsmith, Luu, and Robinson advise that outcomes about customer value (reliability of a dishwasher) should be given more weight than outcomes about business value (warranty repair costs).

A consequential concern about using outcome observations is that it's harder to apportion them to a software development team. Consider a customer team that uses software to help them track the quality of goods in their supply chain. If we assess them by how many rejects there are by the final consumer, how much of that is due to the software, how much due the quality control procedures developed by quality analysts, and how much due to a separate initiative to improve the quality of raw materials? This difficulty of apportionment is a huge hurdle if we want to compare different software teams, perhaps in order to judge whether using Clojure has helped teams be more effective. Similarly there is the case that the developers work well and deliver excellent and valuable software to track quality, but the quality control procedures are no good. Consequently rejects don't go down and the initiative is seen as a failure, despite the developers doing a great job on their part.

But the problems of apportionment shouldn't be taken as a reason to observe the wrong thing. The common phrase says "you get what you measure", in this case it's more like "you get what you try to measure". If you focus appraisal of success on output, then everyone is thinking about how to increase the output. So even if it's tricky to determine how a team's work affects outcome, the fact that people are instead thinking about outcomes and how to improve them is worth more than any effort to compare teams' proficiency in producing the wrong things.

Further Reading

Seiden provides a nice framework for thinking of outcomes, one that's informed by experiences with non-profits who have a similarly tricky job of evaluating the impact of their work.

My colleagues developed EDGE as an operating model for transforming businesses to work in the digital world. Focusing on outcomes is a core part of their philosophy.

Focusing on outcomes naturally leads to favoring Outcome Oriented teams.


My fellow pioneers in the early days of Extreme Programming were very aware of the faults of assessing software development in terms of output. I remember Ron Jeffries and I arguing at an early agile conference workshop that any measures of a team's effectiveness should focus on outcome rather than output - although we did not use those words yet. That thinking is also reflected in my post Cannot Measure Productivity.

I recall starting to hear my colleagues at Thoughtworks talking about a distinction between outcome and output appearing in the 2000s, leading Daniel Terhorst-North to suggest that outcome over features should be a fifth agile value. This favor to outcomes is a regular theme in Thoughtworks-birthed books such as Lean Enterprise, EDGE, and the Digital Transformation Game Plan.

Alexander Steinhart, Alexandra Mogul, Andy Birds, Dale Peakall, Dean Eyre, Gabriel Sixel, Jeff Mangan, Job Rwebembera, Kief Morris, Linus Karsai, Mariela Barzallo, Peter Gillard-Moss, Steven Wilhelm, Vanessa Towers, Vikrant Kardam, and Xiao Ran discussed drafts of this post on our internal mailing list. Peter Gillard-Moss led me to the Seiden book and other work from the non-profit world.


18 November 2019

Exploratory testing is a style of testing that emphasizes a rapid cycle of learning, test design, and test execution. Rather than trying to verify that the software conforms to a pre-written test script, exploratory testing explores the characteristics of the software, raising discoveries that will then be classified as reasonable behavior or failures.

The exploratory testing mindset is a contrast to that of scripted testing. In scripted testing, test designers create a script of tests, where each manipulation of the software is written down, together with the expected behavior of the software. These scripts are executed separately, usually many times, and usually by different actors than those who wrote them. If any test demonstrates behavior that doesn't match the expected behavior designed by the test, then we consider this a failure.

For a long time scripted tests were usually executed by testers, and you'd see lots of relatively junior folks in cubicles clicking through screens following the script and checking the result. In large part due to the influence of communities like Extreme Programming, there's been a shift to automating scripted testing. This allows the tests to be executed faster, and eliminates the human error involved in evaluating the expected behavior. I've long been a firm advocate of automated testing like this, and have seen great success with its use drastically reducing bugs.

But even the most determined automated testers realize that there are fundamental limitations with the technique, which are limitations of any form of scripted testing. Scripted testing can only verify what is in the script, catching only conditions that are known about. Such tests can be a fine net that catches any bugs that try to get through it, but how do we know that the net covers all it ought to?

Exploratory testing seeks to test the boundaries of the net, finding new behaviors that aren't in any of the scripts. Often it will find new failures that can be added to the scripts, sometimes it exposes behaviors that are benign, even welcome, but not thought of before.

Exploratory testing is a much more fluid and informal process than scripted testing, but it still requires discipline to be done well. A good way to do this is to carry out exploratory testing in time-boxed sessions. These sessions focus on a particular aspect of the software. A charter that identifies the target of the session and what information you hope to find is a fine mechanism to provide this focus.

Elisabeth Hendrickson is one of the most articulate exponents of exploratory testing, and her book is the first choice to dig for more information on how to do this well.

Such a charter can act as focus, but shouldn't attempt to define details of what will happen in the session. Exploratory testing involves trying things, learning more about what the software does, applying that learning to generate questions and hypotheses, and generating new tests in the moment to gather more information. Often this will spur questions outside the bounds of the charter, that can be explored in later sessions.

Exploratory testing requires skilled and curious testers, who are comfortable with learning about the software and coming up with new test designs during a session. They also need to be observant, on the lookout for any behavior that might seem odd, and worth further investigation. Often, however, they don't have to be full-time testers. Some teams like to have the whole team carry out exploratory testing, perhaps in pairs or in a single mob.

Exploratory testing should be a regular activity occurring throughout the software development process. Sadly it's hard to find any guidelines on how much should be done within a project. I'd suggest starting with a one hour session every couple of weeks and see what kinds of information the sessions unearth. Some teams like to arrange half-an-hour or so of exploratory testing whenever they complete a story.

If you find bugs are getting through to production, that's a sign that there are gaps in the testing regimen. It's worth looking at any bug that escapes to production and thinking about what measures could be taken to either prevent the bug from getting there, or detecting it rapidly when in production. This analysis will help you decide whether you need more exploratory testing. Bear in mind that it will take time to build up the skill to do exploratory testing well, if you haven't done much exploratory testing before.

I would consider it a red flag if a team isn't doing exploratory testing at all - even if their automated testing was excellent. Even the best automated testing is inherently scripted testing - and that alone is not good enough.


Almost all I know about Exploratory Testing comes from Elisabeth Hendrickson's fine book, which is also where I pinched the net metaphor from.

Aida Manna, Alex Fraser, Bharath Kumar Hemachandran, Chris Ford, Claire Sudbery, Daniel Mondria, David Corrales, David Cullen, David Salazar Villegas, Lina Zubyte, and Philip Peter discussed drafts of this article on our internal mailing list.