What exactly is an Outage?
Before you can decide what the optimum RTOs and RPOs are for your business, and select the right DR (Disaster Recovery) and BC (Business Continuity) services, you need to understand what constitutes an “outage”. It’s not as obvious as it sounds.
To understand the concept clearly, let’s break it down into four components:
- Awareness
- Resolution
- Failover
- Recovery
An outage includes all four stages, and the time to deal with an outage completely includes all the time required to get through all these steps.
Recovery is usually the shortest part of an outage, and, as a result, you’ll often hear vendors focus almost exclusively on their short recovery times. Fine, as far as it goes. But ignoring the other three stages – and failing to take them into account – ¬will get you into hot water. When you’re making promises to your boss about how short any potential outages will be, it pays to be realistic.
Awareness
The first stage is usually the longest phase in the whole process: Figuring out that you actually have an outage.
IT often finds out about a problem when users start calling to complain that they can’t work. At this point, the outage has been underway for some time and is already having an impact on the business.
Understanding what is really going on – is it an outage or user error, for example – can often take an hour or more. Once you’ve confirmed that you’re dealing with an outage, you can move immediately to Stage 2.
Resolution
Now you must triage the system and make some decisions. Is it something you can fix quickly or do you need to failover to backup systems?
Sometimes what looks like a system outage is highly localized. For example, a virus-ridden laptop might be causing problems for someone, or even a group of people, that looks like there’s a problem at the server level. This type of local problem is still an outage as far as your people are concerned, of course, but you can usually deal with it quickly and without affecting the rest of the organization.
It takes another half hour minimum to confirm you’re looking at a system-level issue that requires a failover, which is where NEWCOM comes in.