CanaryRelease

delivery · lean

tags:

Canary release is a technique to reduce the risk of introducing a new software version in production by slowly rolling out the change to a small subset of users before rolling it out to the entire infrastructure and making it available to everybody.

Similar to a BlueGreenDeployment, you start by deploying the new version of your software to a subset of your infrastructure, to which no users are routed.

When you are happy with the new version, you can start routing a few selected users to it. There are different strategies to choose which users will see the new version: a simple strategy is to use a random sample; some companies choose to release the new version to their internal users and employees before releasing to the world; another more sophisticated approach is to choose users based on their profile and other demographics.

As you gain more confidence in the new version, you can start releasing it to more servers in your infrastructure and routing more users to it. A good practice to rollout the new version is to repurpose your existing infrastructure using PhoenixServers or to provision new infrastructure and decommission the old one using ImmutableServers.

Canary release is an application of ParallelChange, where the migrate phase lasts until all the users have been routed to the new version. At that point, you can decomission the old infrastructure. If you find any problems with the new version, the rollback strategy is simply to reroute users back to the old version until you have fixed the problem.

A benefit of using canary releases is the ability to do capacity testing of the new version in a production environment with a safe rollback strategy if issues are found. By slowly ramping up the load, you can monitor and capture metrics about how the new version impacts the production environment. This is an alternative approach to creating an entirely separate capacity testing environment, because the environment will be as production-like as it can be.

Although the name for this technique might not be familiar [1], the practice of canary releasing has been adopted for some time. Sometimes it is referred to as a phased rollout or an incremental rollout.

In large, distributed scenarios, instead of using a router to decide which users will be redirected to the new version, it is also common to use different partitioning strategies. For example: if you have geographically distributed users, you can rollout the new version to a region or a specific location first; if you have multiple brands you can rollout to a single brand first, etc. Facebook chooses to use a strategy with multiple canaries, the first one being visible only to their internal employees and having all the FeatureToggles turned on so they can detect problems with new features early.

Canary releases can be used as a way to implement A/B testing due to similarities in the technical implementation. However, it is preferable to avoid conflating these two concerns: while canary releases are a good way to detect problems and regressions, A/B testing is a way to test a hypothesis using variant implementations. If you monitor business metrics to detect regressions with a canary [2], also using it for A/B testing could interfere with the results. On a more practical note, it can take days to gather enough data to demonstrate statistical significance from an A/B test, while you would want a canary rollout to complete in minutes or hours.

One drawback of using canary releases is that you have to manage multiple versions of your software at once. You can even decide to have more than two versions running in production at the same time, however it is best to keep the number of concurrent versions to a minimum.

Another scenario where using canary releases is hard is when you distribute software that is installed in the users' computers or mobile devices. In this case, you have less control over when the upgrade to the new version happens. If the distributed software communicates with a backend, you can use ParallelChange to support both versions and monitor which client versions are being used. Once the usage numbers fall to a certain level, you can then contract the backend to only support the new version.

Managing database changes also requires attention when doing canary releases. Again, using ParallelChange is a technique to mitigate this problem. It allows the database to support both versions of the application during the rollout phase.

Further Reading

Canary release is described by Jez Humble and Dave Farley in the book Continuous Delivery.

In this talk, Chuck Rossi describes Facebook's release process and their use of canary releases in more detail.

Acknowledgements

Thanks to many ThoughtWorks colleagues for their feedback: Jez Humble, Rohith Rajagopal, Charles Haynes, Andrew Maddison, Mark Taylor, Sunit Parekh, and Sam Newman.

Notes

1: The name for this technique originates from miners who would carry a canary in a cage down the coal mines. If toxic gases leaked into the mine, it would kill the canary before killing the miners. A canary release provides a similar form of early warning for potential problems before impacting your entire production infrastructure or user base.

2: The technique of monitoring business metrics and automatically rolling back a release on a statistically significant regression is known as a cluster immune system and was pioneered by IMVU. They describe this and other practices in their Continuous Deployment approach in this blog post.


ParallelChange

evolutionary design · API design · refactoring

tags:

Making a change to an interface that impacts all its consumers requires two thinking modes: implementing the change itself, and then updating all its usages. This can be hard when you try to do both at the same time, especially if the change is on a PublishedInterface with multiple or external clients.

Parallel change, also known as expand and contract, is a pattern to implement backward-incompatible changes to an interface in a safe manner, by breaking the change into three distinct phases: expand, migrate, and contract.

To understand the pattern, let's use an example of a simple Grid class that stores and provides information about its cells using a pair of x and y integer coordinates. Cells are stored internally in a two-dimentional array and clients can use the addCell(), fetchCell() and isEmpty() methods to interact with the grid.

  class Grid {
    private Cell[][] cells;
    …

    public void addCell(int x, int y, Cell cell) {
      cells[x][y] = cell;
    }

    public Cell fetchCell(int x, int y) {
      return cells[x][y];
    }

    public boolean isEmpty(int x, int y) {
      return cells[x][y] == null;
    }
  }
  

As part of refactoring, we detect that x and y are a DataClump and decide to introduce a new Coordinate class. However, this will be a backwards-incompatible change to clients of the Grid class. Instead of changing all the methods and the internal data structure at once, we decide to apply the parallel change pattern.

In the expand phase you augment the interface to support both the old and the new versions. In our example, we introduce a new Map<Coordinate, Cell> data structure and the new methods that can receive Coordinate instances without changing the existing code.

  class Grid {
    private Cell[][] cells;
    private Map<Coordinate, Cell> newCells;
    …

    public void addCell(int x, int y, Cell cell) {
      cells[x][y] = cell;
    }

    public void addCell(Coordinate coordinate, Cell cell) {
      newCells.put(coordinate, cell);
    }

    public Cell fetchCell(int x, int y) {
      return cells[x][y];
    }

    public Cell fetchCell(Coordinate coordinate) {
      return newCells.get(coordinate);
    }

    public boolean isEmpty(int x, int y) {
      return cells[x][y] == null;
    }

    public boolean isEmpty(Coordinate coordinate) {
      return !newCells.containsKey(coordinate);
    }
  }
  

Existing clients will continue to consume the old version, and the new changes can be introduced incrementally without affecting them.

During the migrate phase you update all clients using the old version to the new version. This can be done incrementally and, in the case of external clients, this will be the longest phase.

Once all usages have been migrated to the new version, you perform the contract phase to remove the old version and change the interface so that it only supports the new version.

In our example, since the internal two-dimentional array is not used anymore after the old methods have been deleted, we can safely remove that data structure and rename newCells back to cells.

  class Grid {
    private Map<Coordinate, Cell> cells;
    …

    public void addCell(Coordinate coordinate, Cell cell) {
      cells.put(coordinate, cell);
    }

    public Cell fetchCell(Coordinate coordinate) {
      return cells.get(coordinate);
    }

    public boolean isEmpty(Coordinate coordinate) {
      return !cells.containsKey(coordinate);
    }
  }
  

This pattern is particularly useful when practicing ContinuousDelivery because it allows your code to be released in any of these three phases. It also lowers the risk of change by allowing you to migrate clients and to test the new version incrementally.

Even when you have control over all usages of the interface, following this pattern is still useful because it prevents you from spreading breakage across the entire codebase all at once. The migrate phase can be short, but it is an alternative to leaning on the compiler to find all the usages that need to be fixed.

Some example applications of this pattern are:

During the migrate phase, a FeatureToggle can be used to control which version of the interface is used. A feature toggle on the client side allows it to be forward-compatible with the new version of the supplier, which decouples the release of the supplier from the client.

When implementing BranchByAbstraction, parallel change is a good way to introduce the abstraction layer between the clients and the supplier. It is also an alternative way to perform a large-scale change without introducing the abstraction layer as a seam for replacement on the supplier side. However, when you have a large number of clients, using branch by abstraction is a better strategy to narrow the surface of change and reduce confusion during the migrate phase.

The downside of using parallel change is that during the migrate phase the supplier has to support two different versions, and clients could get confused about which version is new versus old. If the contract phase is not executed you might end up in a worse state than you started, therefore you need discipline to finish the transition successfully. Adding deprecation notes, documentation or TODO notes might help inform clients and other developers working on the same codebase about which version is in the process of being replaced.

Further Reading

Industrial Logic's refactoring album documents and demonstrates an example of performing a parallel change.

Acknowledgements

This technique was first documented as a refactoring strategy by Joshua Kerievsky in 2006 and presented in his talk The Limited Red Society presented at the Lean Software and Systems Conference in 2010.

Thanks to Joshua Kerievsky for giving feedback on the first draft of this post. Also thanks to many ThoughtWorks colleagues for their feedback: Greg Dutcher, Badrinath Janakiraman, Praful Todkar, Rick Carragher, Filipe Esperandio, Jason Yip, Tushar Madhukar, Pete Hodgson, and Kief Morris.


UnitTest

testing · extreme programming

tags:

Unit testing is often talked about in software development, and is a term that I've been familiar with during my whole time writing programs. Like most software development terminology, however, it's very ill-defined, and I see confusion can often occur when people think that it's more tightly defined than it actually is.

Although I'd done plenty of unit testing before, my definitive exposure was when I started working with Kent Beck and used the Xunit family of unit testing tools. (Indeed I sometimes think a good term for this style of testing might be "xunit testing.") Unit testing also became a signature activity of ExtremeProgramming (XP), and led quickly to TestDrivenDevelopment.

There were definitional concerns about XP's use of unit testing right from the early days. I have a distinct memory of a discussion on a usenet discussion group where us XPers were berated by a testing expert for misusing the term "unit test." We asked him for his definition and he replied with something like "in the morning of my training course I cover 24 different definitions of unit test."

Despite the variations, there are some common elements. Firstly there is a notion that unit tests are low-level, focusing on a small part of the software system. Secondly unit tests are usually written these days by the programmers themselves using their regular tools - the only difference being the use of some sort of unit testing framework [1]. Thirdly unit tests are expected to be significantly faster than other kinds of tests.

So there's some common elements, but there are also differences. One difference is what people consider to be a unit. Object-oriented design tends to treat a class as the unit, procedural or functional approaches might consider a single function as a unit. But really it's a situational thing - the team decides what makes sense to be a unit for the purposes of their understanding of the system and its testing. Although I start with the notion of the unit being a class, I often take a bunch of closely related classes and treat them as a single unit. Rarely I might take a subset of methods in a class as a unit. However you define it doesn't really matter.


Isolation

A more important distinction is whether the unit you're testing should be isolated from its collaborators. Imagine you're testing an order class's price method. The price method needs to invoke some functions on the product and customer classes. If you follow the principle of collaborator isolation you don't want to use the real product or customer classes here, because a fault in the customer class would cause the order class's tests to fail. Instead you use TestDoubles for the collaborators.

But not all unit testers use this isolation. Indeed when xunit testing began in the 90's we made no attempt to isolate unless communicating with the collaborators was awkward (such as a remote credit card verification system). We didn't find it difficult to track down the actual fault, even if it caused neighboring tests to fail. So we felt isolation wasn't an issue in practice.

Indeed this lack of isolation was one of the reasons we were criticized for our use of the term "unit testing". I think that the term "unit testing" is appropriate because these tests are tests of the behavior of a single unit. We write the tests assuming everything other than that unit is working correctly.

As xunit testing became more popular in the 2000's the notion of isolation came back, at least for some people. We saw the rise of Mock Objects and frameworks to support mocking. Two schools of xunit testing developed, which I call the classic and mockist styles. Classic xunit testers don't worry about isolation but mockists do. Today I know and respect xunit testers of both styles (personally I've stayed with classic style).

Even a classic tester like myself uses test doubles when there's an awkward collaboration. They are invaluable to remove non-determinism when talking to remote services. Indeed some classicist xunit testers also argue that any collaboration with external resources, such as a database or filesystem, should use doubles. Partly this is due to non-determinism risk, partly due to speed. While I think this is a useful guideline, I don't treat using doubles for external resources as an absolute rule. If talking to the resource is stable and fast enough for you then there's no reason not to do it in your unit tests.


Speed

The common properties of unit tests — small scope, done by the programmer herself, and fast — mean that they can be run very frequently when programming. Indeed this is one of the key characteristics of SelfTestingCode. In this situation programmers run unit tests after any change to the code. I may run unit tests several times a minute, any time I have code that's worth compiling. I do this because should I accidentally break something, I want to know right away. If I've introduced the defect with my last change it's much easier for me to spot the bug because I don't have far to look.

When you run unit tests so frequently, you may not run all the unit tests. Usually you only need to run those tests that are operating over the part of the code you're currently working on. As usual, you trade off the depth of testing with how long it takes to run the test suite. I'll call this suite the compile suite, since it's what I run whenever I think of compiling - even in an interpreted language like Ruby.

If you are using Continuous Integration you should run a test suite as part of it. It's common for this suite, which I call the commit suite, to include all the unit tests. It may also include a few BroadStackTests. As a programmer you should run this commit suite several times a day, certainly before any shared commit to version control, but also at any other time you have the opportunity - when you take a break, or have to go to a meeting. The faster the commit suite is, the more often you can run it. [2]

Different people have different standards for the speed of unit tests and of their test suites. David Heinemeier Hansson is happy with a compile suite that takes a few seconds and a commit suite that takes a few minutes. Gary Bernhardt finds that unbearably slow, insisting on a compile suite of around 300ms and Dan Bodart doesn't want his commit suite to be more than ten seconds

I don't think there's an absolute answer here. Personally I don't notice a difference between a compile suite that's sub-second or a few seconds. I like Kent Beck's rule of thumb that the commit suite should run in no more than ten minutes. But the real point is that your test suites should run fast enough that you're not discouraged from running them frequently enough. And frequently enough is so that when they detect a bug there's a sufficiently small amount of work to look through that you can find it quickly.

Notes

1: I say "these days" because this is certainly something that has changed due to XP. In the turn-of-the-century debates, XPers were strongly criticized for this as the common view was that programmers should never test their own code. Some shops had specialized unit testers whose entire job would be to write unit tests for code written earlier by developers. The reasons for this included: people having a conceptual blindness to testing their own code, programmers not being good testers, and it was good to have a adversarial relationship between developers and testers. The XPer view was that programmers could learn to be effective testers, at least at the unit level, and that if you involved a separate group the feedback loop that tests gave you would be hopelessly slow. Xunit played an essential role here, it was designed specifically to minimize the friction for programmers writing tests.

2: If you have tests that are useful, but take longer than you want the commit suite to run, then you should build a DeploymentPipeline and put the slower tests in a later stage of the pipeline.


SelfTestingCode

agile · delivery · testing · extreme programming · clean code · continuous integration · refactoring

tags:

Self-Testing Code is the name I used in Refactoring to refer to the practice of writing comprehensive automated tests in conjunction with the functional software. When done well this allows you to invoke a single command that executes the tests - and you are confident that these tests will illuminate any bugs hiding in your code.

I first ran into the thought at an OOPSLA conference listening to "Beddara" Dave Thomas say that every object should be able to test itself. I suddenly had the vision of typing a command and having my whole software system do a self-test, much in the way that you used to see hardware memory tests when booting. Soon I was exploring this approach in my own projects and being very happy with the benefits. A couple of years later I did some work with Kent Beck and discovered he did the same thing, but in a much more sophisticated way than I did. This was shortly before Kent (and Erich Gamma) produced JUnit - a tool that became the underpinning of much of thinking and practice of self-testing code (and its sister: TestDrivenDevelopment).

You have self-testing code when you can run a series of automated tests against the code base and be confident that, should the tests pass, your code is free of any substantial defects. One way I think of it is that as well as building your software system, you simultaneously build a bug detector that's able to detect any faults inside the system. Should anyone in the team accidentally introduce a bug, the detector goes off. By running the test suite frequently, at least several times a day, you're able to detect such bugs soon after they are introduced, so you can just look in the recent changes, which makes it much easier to find them. No programming episode is complete without working code and the tests to keep it working. Our attitude is to assume that any non-trivial code without tests is broken.

Self-testing code is a key part of Continuous Integration, indeed I say that you aren't really doing continuous integration unless you have self-testing code. As a pillar of Continuous Integration, it is also a necessary part of Continuous Delivery.

One obvious benefit of self-testing code is that it can drastically reduce the number of bugs that get into production software. At the heart of this is building up a testing culture that where developers are naturally thinking about writing code and tests together.

But the biggest benefit isn't about merely avoiding production bugs, it's about the confidence that you get to make changes to the system. Old codebases are often terrifying places, where developers fear to change working code. Even fixing a bug can be dangerous, because you can create more bugs than you fix. In such circumstances not just is it horribly slow to add more features, you also end up afraid to refactor the system, thus increasing TechnicalDebt, and getting into a steadily worsening spiral where every change makes people more fearful of more change.

With self-testing code, it's a different picture. Here people are confident that fixing small problems to clean the code can be done safely, because should you make a mistake (or rather "when I make a mistake") the bug detector will go off and you can quickly recover and continue. With that safety net, you can spend time keeping the code in good shape, and end up in a virtuous spiral where you get steadily faster at adding new features.

These kinds of benefits are often talked about with respect to TestDrivenDevelopment (TDD), but it's useful to separate the concepts of TDD and self-testing code. I think of TDD as a particular practice whose benefits include producing self-testing code. It's a great way to do it, and TDD is a technique I'm a big fan of. But you can also produce self-testing code by writing tests after writing code - although you can't consider your work to be done until you have the tests (and they pass). The important point of self-testing code is that you have the tests, not how you got to them.

Increasingly these days we're seeing another dimension to self-testing, with more emphasis put on monitoring in production. Continuous Delivery allows you to quickly deploy new versions of software into production. In this situation teams put more effort into spotting bugs once in production and rapidly fixing them by either deploying a new fixed version or rolling back to the last-known-good version.

This entry was originally published (in a much smaller form) on May 5th 2005.

ReportingDatabase

database · application architecture

tags:

Most EnterpriseApplications store persistent data with a database. This database supports operational updates of the application's state, and also various reports used for decision support and analysis. The operational needs and the reporting needs are, however, often quite different - with different requirements from a schema and different data access patterns. When this happens it's often a wise idea to separate the reporting needs into a reporting database, which takes a copy of the essential operational data but represents it in a different schema.

Such a reporting database is a completely different database to the operational database. It may be a completely different database product, using PolyglotPersistence. It should be designed around the reporting needs.

A reporting database has a number of advantages:

The downside to a reporting database is that its data has to be kept up to date. The easiest case is when you do something like use an overnight run to populate the reporting database. This often works quite well since many reporting needs work perfectly well with yesterday's data. If you need more timely data you can use a messaging system so that any changes to the operational database are forwarded to the reporting database. This is more complicated, but the data can be kept fresher. Often most reports can use slightly stale data and you can produce special case reports for things that really need to have this second's data [1].

A variation on this is to use views. This encapsulates the operational data and allows you to denormalize. It doesn't allow you to separate the operational load from the reporting load. More seriously you are limited to what views can derive and you can't take advantage of derivations that are written in an in-memory programming environment.

A reporting database fits well when you have a lot of domain logic in a domain model or other in-memory code. The domain logic can be used to process updates to the operational data, but also to calculate derived data which to enrich the reporting database.

I originally wrote this entry on April 2nd 2004. I took advantage of its ten-year anniversary to update the text.

Notes

1: These days the desire seems to be for near-real time analytics. I'm skeptical of the value of this. Often when analyzing data trends you don't need to react right away, and your thinking improves when you give it time for a proper mulling. Reacting too quickly leads to a form of information hysteresis, where you react badly to data that's changing too rapidly to get a proper picture of what's going on.


EnterpriseApplication

application integration · application architecture

tags:

In the early part of this century, I worked on my book Patterns of Enterprise Application Architecture. One of the problems I had when writing the book was how to title it, or rather what to call the kinds of software systems that I was writing about. I've always been conscious that my experience of software development has always been focused on one particular form of software - things like health care records, foreign exchange trading, payroll, and lease accounting. These are very different to embedded software inside printers, games, flight control software, or telephone switches. I needed a name to describe these kinds of systems and settled on the term "enterprise application".

As I so often have to say, there is no formal definition for this term. However there are some characteristics that enterprise applications have in common.

Enterprise applications usually have a lot of persistent data, usually managed by some kind of database management system. Usually this database is relational, but increasingly we're seeing NoSQL alternatives. This data will usually be longer lasting and more valuable than the applications that process it.

This data is accessed and manipulated concurrently. The numbers vary a lot, in-house applications may have a few tens of users, but customer-facing web applications can easily have tens of thousands. Despite high levels of concurrency, many enterprise application developers don't think much about critical regions, race conditions and other elements of classic concurrent programming. Instead they build their thinking on top of transactions managed by databases or specialized transaction management tools.

With so much data, enterprise applications have a lot of user interface screens to handle it. Usually the same data is manipulated in different ways in different contexts. Users vary from regular to occasional users, so the interfaces need to match different levels of familiarity. There is also a significant amount of offline (batch) processing that is easily forgotten.

Even if you are building a brand new enterprise application, you don't do so in isolation. Instead you'll need to integrate with other enterprise applications. These systems are built by a wide range of teams, some from vendors who sell to many customers, others built internally just for your organization. These applications will have been written over many decades in a host of different technologies, some of which you'll have to ask your mother about. There are many integration mechanisms to deal with - file exchange, shared databases, messaging middleware. Every so often there will be an attempt to rationalize all this communication technology, but they never entirely succeed leaving behind more complexity in their wake.

Even when different applications access the same data there is considerable conceptual dissonance between them, a customer may mean something quite different to the sales organization than it does to technical support. The same sounding entity has different fields in different contexts, or worse have fields with the same name yet different meanings.

And then there's so-called "business logic". When you are writing an operating system you strive to keep the whole thing logical and stive to discover and implement simplifications to keep the software straightforward and reliable. But business rules are given to you as they stand, and if you want to change them you need sixty-seven meetings and three vice-presidents retiring. They are usually a haphazard array of strange conditions that interact in surprising ways. Their insanity derives from a good reason, each one is a case where salesman could close a particular deal by offerring some special one-off condition. Do this a thousand times and you have the complex business "illogic" that lies in the heart of many enterprise applications.

Enterprise applications can be large or small. Often discussion focuses on large, complex applications, but there is also a challenge for smaller applications that need to be built quickly. Big systems make a lot of noise when they go wrong, yet the cumulative effect of small systems can have a surprising effect on an enterprises's health.

Coming up with names for things is always tricky. You need to use a minimum number of words, and want them to trigger the right connotations in the readers' minds, so that you don't have to constantly remind them what the definition means. On the whole I've been reasonably happy with my choice, but since I finished the book the word enterprise has taken on connotations which don't quite fit my usage.

One problem that's emerged since the book is that "enterprise" now usually means a large, well-established company. People think of G.E. or Siemens rather than Facebook, Etsy, or a company of a hundred people producing custom T-shirts. But according to my definition above, even small start-ups rely on software that I would call an enterprise application. So even though the Ruby on Rails community has ended up using enterprise as an insult, I would call Ruby on Rails a framework for building enterprise applications and BaseCamp a classic example of an enterprise application. (Just don't tell DHH I said so or he'll turn me into a hood ornament.)

These connotations around "enterprise" have made me muse about whether we need a different term. When I was writing P of EAA my working title was "Information Systems Architecture", but we felt that "information systems" had its own undesirable connotations of elder technologies. I guess I could go really retro and use "data processing", but on the whole "enterprise application" still seems a better term than anything else I could come up with.

This post is adapted from the definition of Enterprise Application in the introduction of P of EAA.


CircuitBreaker

delivery · application architecture

tags:

It's common for software systems to make remote calls to software running in different processes, probably on different machines across a network. One of the big differences between in-memory calls and remote calls is that remote calls can fail, or hang without a response until some timeout limit is reached. What's worse if you have many callers on a unresponsive supplier, then you can run out of critical resources leading to cascading failures across multiple systems. In his excellent book Release It, Michael Nygard popularized the Circuit Breaker pattern to prevent this kind of catastrophic cascade.

The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all. Usually you'll also want some kind of monitor alert if the circuit breaker trips.

Here's a simple example of this behavior in Ruby, protecting against timeouts.

I set up the breaker with a block (Lambda) which is the protected call.

cb = CircuitBreaker.new {|arg| @supplier.func arg}

The breaker stores the block, initializes various parameters (for thresholds, timeouts, and monitoring), and resets the breaker into its closed state.

class CircuitBreaker...
  attr_accessor :invocation_timeout, :failure_threshold, :monitor
  def initialize &block
    @circuit = block
    @invocation_timeout = 0.01
    @failure_threshold = 5
    @monitor = acquire_monitor
    reset
  end

Calling the circuit breaker will call the underlying block if the circuit is closed, but return an error if it's open

# client code
    aCircuitBreaker.call(5)


class CircuitBreaker...
  def call args
    case state
    when :closed
      begin
        do_call args
      rescue Timeout::Error
        record_failure
        raise $!
      end
    when :open then raise CircuitBreaker::Open
    else raise "Unreachable Code"
    end
  end
  def do_call args
    result = Timeout::timeout(@invocation_timeout) do
      @circuit.call args
    end
    reset
    return result
  end

Should we get a timeout, we increment the failure counter, successful calls reset it back to zero.

class CircuitBreaker...
  def record_failure
    @failure_count += 1
    @monitor.alert(:open_circuit) if :open == state
  end
  def reset
    @failure_count = 0
    @monitor.alert :reset_circuit
  end

I determine the state of the breaker comparing the failure count to the threshold

class CircuitBreaker...
  def state
     (@failure_count >= @failure_threshold) ? :open : :closed
  end

This simple circuit breaker avoids making the protected call when the circuit is open, but would need an external intervention to reset it when things are well again. This is a reasonable approach with electrical circuit breakers in buildings, but for software circuit breakers we can have the breaker itself detect if the underlying calls are working again. We can implement this self-resetting behavior by trying the protected call again after a suitable interval, and resetting the breaker should it succeed.

Creating this kind of breaker means adding a threshold for trying the reset and setting up a variable to hold the time of the last error.

class ResetCircuitBreaker...
  def initialize &block
    @circuit = block
    @invocation_timeout = 0.01
    @failure_threshold = 5
    @monitor = BreakerMonitor.new
    @reset_timeout = 0.1
    reset
  end
  def reset
    @failure_count = 0
    @last_failure_time = nil
    @monitor.alert :reset_circuit
  end

There is now a third state present - half open - meaning the circuit is ready to make a real call as trial to see if the problem is fixed.

class ResetCircuitBreaker...
  def state
    case
    when (@failure_count >= @failure_threshold) && 
        (Time.now - @last_failure_time) > @reset_timeout
      :half_open
    when (@failure_count >= @failure_threshold)
      :open
    else
      :closed
    end
  end

Asked to call in the half-open state results in a trial call, which will either reset the breaker if successful or restart the timeout if not.

class ResetCircuitBreaker...
  def call args
    case state
    when :closed, :half_open
      begin
        do_call args
      rescue Timeout::Error
        record_failure
        raise $!
      end
    when :open
      raise CircuitBreaker::Open
    else
      raise "Unreachable"
    end
  end
  def record_failure
    @failure_count += 1
    @monitor.alert(:open_circuit) if :open == state
    @last_failure_time = Time.now
  end

This example is a simple explanatory one, in practice circuit breakers provide a good bit more features and parameterization. Often they will protect against a range of errors that protected call could raise, such as network connection failures. Not all errors should trip the circuit, some should reflect normal failures and be dealt with as part of regular logic.

With lots of traffic, you can have problems with many calls just waiting for the initial timeout. Since remote calls are often slow, it's often a good idea to put each call on a different thread using a future or promise to handle the results when they come back. By drawing these threads from a thread pool, you can arrange for the circuit to break when the thread pool is exhausted.

The example shows a simple way to trip the breaker — a count that resets on a successful call. A more sophisticated approach might look at frequency of errors, tripping once you get, say, a 50% failure rate. You might also have different thresholds for different errors, such as a threshold of 10 for timeouts but 3 for connection failures.

The example I've shown is a circuit breaker for synchronous calls, but circuit breakers are also useful for asynchronous communications. A common technique here is to put all requests on a queue, which the supplier consumes at its speed - a useful technique to avoid overloading servers. In this case the circuit breaks when the queue fills up.

On their own, circuit breakers help reduce resources tied up in operations which are likely to fail. You avoid waiting on timeouts for the client, and a broken circuit avoids putting load on a struggling server. I talk here about remote calls, which are a common case for circuit breakers, but they can be used in any situation where you want to protect parts of a system from failures in other parts.

Circuit breakers are a valuable place for monitoring. Any change in breaker state should be logged and breakers should reveal details of their state for deeper monitoring. Breaker behavior is often a good source of warnings about deeper troubles in the environment. Operations staff should be able to trip or reset breakers.

Breakers on their own are valuable, but clients using them need to react to breaker failures. As with any remote invocation you need to consider what to do in case of failure. Does it fail the operation you're carrying out, or are there workarounds you can do? A credit card authorization could be put on a queue to deal with later, failure to get some data may be mitigated by showing some stale data that's good enough to display.

Further Reading

The netflix tech blog contains a lot of useful information on improving reliability of systems with lots of services. Their Dependency Command talks about using circuit breakers and a thread pool limit.

Netflix have open-sourced Hystrix, a sophisticated tool for dealing with latency and fault tolerance for distributed systems. It includes an implementation of the circuit breaker pattern with the thread pool limit

There are other open-source implementations of the circuit breaker pattern in Ruby, Java, Grails Plugin, C#, AspectJ, and Scala