Calculating uptime is easy, right? Just take the number of minutes your server was available (up) and divide it by the number of minutes within a given time period (say a month).
# of Minutes Available
X 100 = Uptime %
# of Minutes in Month
The world knows this as the 9’s chart (fig. 1.1), and marketing teams across the globe continue to find new ways of touting the stability of networks and infrastructures by attaching 2 Nine’s, 3 Nine’s, 4 Nine’s, 5 Nine’s or the new top of the heap, 6 Nine’s. While the math is fairly straightforward, the real world of uptime is not as simple. Let’s look at a few reasons why such an easy calculation has become less and less meaningful.
First of all (and this isn’t exactly a new issue), a server being “up” doesn’t inherently mean that it is doing what it is supposed to be doing. Servers have roles to play within an infrastructure. Responding to an ICMP/SNMP packet is rarely its sole purpose. We have all encountered a server that is completely accessible via RDP or is pingable, but is not serving up the application it is supposed to be delivering. So logging into this machine and checking its “uptime” can be deceiving.
When this happens, ICMP or even basic SNMP monitoring are providing a false sense of security, all while your customers are cursing your existence and searching for alternatives. Furthermore, with many architectures being deployed in high availability scenarios, is a single server’s uptime really all that relevant? Dropping a single web server from a farm requires attention to restore full functionality, but it is hardly the end of the world. End of the world angst should be reserved for when sites like Reddit, Vine or Imgur go down.
So what’s a better alternative? Well the official answer can be found in the ITSM (IT Service Management) philosophy, fortified with the practices and policies contained with ITIL. However I am not here to lecture anyone on those teachings. I would rather strip out the rhetoric of “why” and get to “what”.
To properly ensure you are not on the receiving end of a verbal tirade from customers and executives, you should alter your approach to monitoring. Server level monitoring remains important for both real-time information and alerts. It is also important for forensic information should something go wrong. What is needed in addition to that is defining a “service,” and then monitoring that more holistic “entity.” A service is usually a compilation of several servers all designed to deliver a specific “product.”
Take email as an example. Current best practice email server configurations are typically setup using a minimum of three servers. Between CAS servers, Front End Servers and Mailbox Stores, everything is working in order to achieve desired results. Rather than just pinging or polling the servers for availability via SNMP, the better metric is to watch for an “end-to-end” email test.
You should be working with your monitoring software provider to have an email sent to your environment every x minutes, and set the auto reply to shoot it back. This way you can track things like true mail flow as well as potential latency issues. After all, nobody cares that the CAS server is “up” if email isn’t being delivered.
Additionally for web farms or applications involving transactions (like ecommerce or SaaS), it is far from a fail-safe approach to watch just the individual pieces of the puzzle in hopes of catching when one of them isn’t working and therefore preventing you from completing the picture.
There are several variations of web transaction-level monitoring. One simple approach is creating a dummy page on your site that essentially validates connectivity all the way through to the database is available (Fig 1.2). While simple to do, the problems here can be security and the inability to mirror the real-world experience. The preferred method would be to setup the more complex (but more telling) web transaction. This process mimics the end user behavior all the way through committing a shopping cart purchase (if applicable).
So without the additional monitoring configurations, you might not be getting the whole story and inviting someone to ruin your day with a flurry of complaints. “But I’ll still have my SLA from my provider to fall back on and receive financial compensation for my heartache, right?” Maybe, and that is a big maybe at that. It also brings me to reason number two as to why you should re-evaluate your thoughts on uptime.
I stated the math is simple on calculating uptime, and in the example provided, it is. So why would different companies have different interpretations of such an easy concept? Simple. Financial penalties (SLA’s). When credits are at stake, the definition of uptime needs to be both agreeable to the consumer as well as in the best interest of the provider. That’s where we get the variables. Individual component failure is not a question of “if” but rather “when.” Likely downplayed during a sales cycle, the unfortunate reality of all things digital is that it will break at some point. This is why providers go to great lengths to build in redundancies to their infrastructure.
But what if you don’t take advantage of that? Keeping them anonymous, many providers emphatically state thatan outage is not an outage unless ALL of your servers are completely inaccessible across an entire region.Furthermore, when it comes to storage availability, IO performance needs to be at zero vs. perhaps performing poorly due to excessive latency.
Those are some major caveats. On top of that, all outages caused as a direct result of any provider maintenance activities are not covered. So where does that leave you? Helpless? Not entirely. But it does force your hand on building out a highly available infrastructure if you want to have any financial restitution for provider issues.
If you do protect yourself via proper design, the importance of the aforementioned service monitoring is even more vital since your individual servers will be spread across multiple networks and geographic areas. Chasing a single instance alert on a specific server might not be time well spent. Focusing on the “entity” or service would be a far better way to spend your admin cycles.
So the next time you ask (or are asked) about uptime, make sure all the variables are discussed. Then determine if that measurement is going to match what you need for your business and personal sanity. Having chased SLA credits through a maze of semantics, I can assure you that knowing the definitions is the only map out of the labyrinth. All the 9’s in the world won’t help if you are down and don’t know it — and aren’t covered for it.