CatastrophicFailover design 7 March 2005 Reactions

(Update at end)

One of the oft advertised features of modern application servers is that they provide failover in a cluster. Clustering improves the reliability of your application, if one of your servers goes down, you have some more up to server your customers. Failover can add even more reliability, if a server goes down in the middle of a interaction the cluster can move that interaction to another server.

However this can be a problem.

A request may do something that causes a server to crash, perhaps by unwittingly exposing a bug in the server software. So when the failover kicks in, the deadly requests gets moved to another server which it can then bring down in turn. Get the timing just right and by the time the first server has rebooted, it will be ready to receive that request again.

(In case you're wondering, this is a true story.)

So if you see your servers repeatedly going down, an errant transaction could well be the cause. To prevent this, you need a check to ensure that you don't migrate a request that's already been in a couple of failovers. It's good to failover, but you don't want your farm to do it too often.

Update: Christopher Baus pointed out that this problem suggests you should deliberately use different equipment on your cluster. So if you're running a Java application, consider using a mix of different app servers, operating systems, and hardware. A mix is more complex to manage, of course, but greatly reduces the chance of this problem happening.


Links
home
bliki
feed 
Translations
Japanese
Spanish
Korean
Chinese
Thai
Categories
agile
design
dsl
leisure
refactoring
ruby
thoughtWorks
tools
uml
writing
Blog Roll
ThoughtBlogs
TW Alumni
Nicholas Carr
Steve Cook
Brian Foote
Simon Harris
Gregor Hohpe
/\ndy Hunt
Ralph Johnson
Patrick Logan
David Ing
Brian Marick
Jeremy Miller
Jimmy Nilsson
Samuel Pepys
Keith Ray
Johanna Rothman
Kathy Sierra
Dave Thomas