R is for Reliability

37 Signals was in the news last year for the wrong reasons: they experienced multiple outages in a short time period.  From their own site, The virtual machines had severely degraded network and CPU performance causing them to grind to a halt.” Sounds unfortunate, but these things just happen, right?

Well, there’s more to it than that. Though accidents happen, reliability isn’t luck.  The severe problems they had were really due to a series of choices they’ve made over the years, one of which was to take their hosting in-house.  Hosting some low-usage internal company document server is one thing-but if the core reason for their business relies on those servers, trying to minimize that hosting cost at the possible cost of reliability might be a cost-saving way to shoot themselves in the foot.  But enough picking on 37 signals; they are no different from many of us doing dev work-we all tend to think of the development as the “real” work and everything else as something that’s not a big deal, something we maybe don’t need pay attention to, something we don’t need to pay so much for.

Reliability is not simple.  It’s easy enough to get something working, but what if people come to rely on the application on that server being up at all times?  What if getting people to depend on you always being available is part of your business strategy?  People wanting high uptime throw around nines of uptime, as in 3-nines or 5-nines (i.e. 99.999% uptime)Think about that: that means you get only a few minutes a year for your service to be down, and that year includes thanksgiving, Black Friday, 4am on any random day, etc.  For really high uptime, by the time you get the alert you’re already near the end of your SLA.
 
To get that kind of reliability, your infrastructure has got to be be ready for even the rare cases-what if the power goes out? What if it stays out for 4 days?  What if internet is down? What if the internet provider’s pipe goes down? What if a server’s fan goes bad and that server’s fan type isn’t sold in the area? What if the area is flooded by a freak storm? What if the server room catches on fire? What if someone trips over a cord? What if all the support people going to lunch together get hit by a car? What if your so-far-been-reliable data supplier with the 99.999% SLA has a big failure of their own? What if there’s a bad code deploy...or a bad system update? Maybe these questions seem paranoid, perhaps these things seem silly and too unlikely.  And honestly, for many of these possible problems, any one of them is a little rare to much worry about on its own.    What you’re concerned about isn’t single risks on their own, it’s your entire system-level risk profile-how likely is anything of any kind going to happen.  Calculating, and then bringing that risk down, ends up looking at the small things in service of understanding that big picture. There’s a pretty good chance at least one of these sorts of things will happen this year, or maybe next year-the thing is you don’t know what thing it’ll be, so for high reliability, you have to be ready for a lot.

What does “ready” mean? Well, for the internet, it may mean paying to have multiple internet providers, It may mean replacing servers while they are still running well to ensure parts are available. Generators, along with fuel and staff that know how to work it are a must.  Many of these require having several complete systems ready to take over at any sign of trouble, located other places in the country, already spun up, with staff.  It may mean having more duplicate systems to bring OS updates and patches online with the ability to fall back at any sign of trouble. It may mean spending much more on software development to build in the ability for the software to handle a multitude of seemingly unlikely scenarios, be it slowdowns, flaky connections, incorrect data, 20x the normal load, or dependencies failing.  Certainly, there may be some risks that one might not want to prepare for-either because they are too unlikely, it's too expensive to prepare, or it's just a risk that changes the whole picture.  Take a space alien invasion of Austin, TX for instance.  This could be a severe risk for my company's  projects-however it's a risk that we are OK not mitigating against as 1) to the best of my knowledge a space alien invasion is pretty unlikely 2) fully-office-ready alienproof bunkers are really expensive, and 3) We all might have other things to worry about that point.

How to handle each type of potential failure is a choice; a choice you may not realize you are making, but a choice nonetheless, and we’ve all made them on all of the projects we’ve worked on, though we’ve probably not thought about it for most of them.  All those lines of protective code you didn’t write when playing around on a toy project? You could have written them but you didn’t (probably a good idea in that case).    On the other hand, for a development shop working on a project that requires high reliability, it makes sense to build reliability into the software itself, and it makes sense for the dev shop to handle that internally as part of their core competency; however, for aspects of reliability that are outside of the dev shop’s core skillset, the company faces a choice: 1)make all of the minutia another core competency, devoting resources accordingly(i.e. you are no longer just a dev shop, you have specializations in development, disaster mitigation, recovery and server/hosting), 2)outsource those areas to someone for whom it is a core competency, or 3)accept that reliability just isn’t going to be what one might like, and set client expectations accordingly.   The right choice here is going to depend on company size, resources, and desire for focus;  more important than the actual choice made is to make it purposefully and with fullunderstanding-otherwise, there is a fourth possibility-promising a lot more than you’re ready to deliver.

I should mention that a really excellent book on the subject of reliability is Release It! If you’ve ever made a promise of uptime and not thought much about what you were promising, this book will give you a lot to think about.

 


Posted 01-08-2011 6:59 PM by Anne Epstein

[Advertisement]

Comments

link building wrote re: R is for Reliability
on 07-19-2014 1:27 AM

nmooEQ Say, you got a nice article post.Really thank you! Want more.

Add a Comment

(required)  
(optional)
(required)  
Remember Me?

About The CodeBetter.Com Blog Network
CodeBetter.Com FAQ

Our Mission

Advertisers should contact Brendan

Subscribe
Google Reader or Homepage

del.icio.us CodeBetter.com Latest Items
Add to My Yahoo!
Subscribe with Bloglines
Subscribe in NewsGator Online
Subscribe with myFeedster
Add to My AOL
Furl CodeBetter.com Latest Items
Subscribe in Rojo

Member Projects
DimeCasts.Net - Derik Whittaker

Friends of Devlicio.us
Red-Gate Tools For SQL and .NET

NDepend

SlickEdit
 
SmartInspect .NET Logging
NGEDIT: ViEmu and Codekana
LiteAccounting.Com
DevExpress
Fixx
NHibernate Profiler
Unfuddle
Balsamiq Mockups
Scrumy
JetBrains - ReSharper
Umbraco
NServiceBus
RavenDb
Web Sequence Diagrams
Ducksboard<-- NEW Friend!

 



Site Copyright © 2007 CodeBetter.Com
Content Copyright Individual Bloggers

 

Community Server (Commercial Edition)