The 9’s of availability

As an IT Project Manager, it’s important for you to understand availability, how customers view it, and frankly what’s realistic.

You’ll hear people say “I want five 9s” or “My app needs four 9s” of availability.  In my experience, most people don’t really know what this means…  5 9’s is the in-flight-magazine gold standard for availability.  Everyone thinks their app needs 5 9’s of availability, but do they understand it?  Let’s take a look at the math:

Nines of Reliability: (Hours / Minutes / Seconds) of downtime per year

  • 2 9’s (99%) = up to 87.6h / 5256.0m / 315360.0 seconds of downtime per year.
  • 3 9’s (99.9%) = up to 8.76h / 525.6m / 31536.0 seconds of downtime per year.
  • 4 9’s (99.99%) = up to 0.876h / 52.56m / 3153.6 seconds of downtime per year.
  • 5 9’s (99.999%) = up to 0.0876h / 5.26m / 315.36 seconds of downtime per year.
  • 6 9’s (99.9999%) = up to 0.00876h / 0.53m / 31.536 seconds of downtime per year.
  • 7 9’s (99.99999%) = up to 8.76E-4h / 0.05m / 3.1536 seconds of downtime per year.

 

5 9’s means less than 5 1/2 minutes of downtime per year!  Now, how long does it take a server just to boot once?   Take in a high-end server, put a lot of boards in it, memory, CPUs, hang a lot of storage on it, and the POST alone will take 10+ minutes.   Just one reboot in a year and 5 9’s are out the window.

Five nines of availability of a single component is a incredibly hard goal to attain.  Five nines availability of service is possible, but it depends on an almost perfect architecture, super reliable methods of failure detection and failover, and very tight controls on change and access.

Even if an app and server combo somehow got 100% uptime, a network outage, a human error, or even a storm that takes out building power will drop your availability below 5 9’s.   

Many companies will define services in terms of criticality – mission critical, business critical, business operational, and so on.  You need to understand your company’s definitions and see if and how they relate to the 9’s of availability.  Often these definitions talk more about RTO and RPO for the service.

  • RTO – Recovery Time Objective – The length of time a service can be down before it impacts business functions.  This could be thought of as a link to the 9’s of availability, but typically an RTO talks only about the acceptable duration of a single outage, not the number of allowable outages over time.
  • RPO – Recovery Point Objective – The amount of data loss that can be tolerated before business is impacted.  For instance you might be able to lose the previous 4 hours of data entered before the system went down.  The business will have to re-enter that data once the system is operational again – this might be acceptable for a payroll system.  On the other hand the business might not be able to tolerate ANY data loss, all records must be fully recovered, perhaps in your ERP system.

Again, none of this really ties to the business buzzword bingo of the 9’s of availability.  This is not to say we should not strive for such things, just make sure you understand what’s being asked of you and what your org can commit to.

At this point if you don’t have Toto’s “99” stuck in your head, I’m vastly disappointed in you.  (iTunes Music Store link)

Advertisements

~ by brianherman on May 21, 2008.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: