While there are many aspects of a quality Business Continuity Plan (BCP), I’d like to focus on failback. Failback is the term for moving your IT environment back to primary production. It’s essential to know your failback options because failover is not a permanent state. Here are a few issues that should be considered.
First, your company likely has to adhere to compliance regulations. Review any relevant policies when designing your failover site. PCI DSS, for example, no longer allows a customer to run applications in a Disaster Recovery (DR) environment that doesn’t have an identical level of security. This affects the failback strategy as you may have to add security measures to the DR site (like file integrity scanning and security event long management) in order to resume processing cardholder data.
The next issue is the application stack. Many software products, such as Microsoft’s SQL Server, have licensing limits that must be addressed prior to running production workloads for an extended time.
The third issue is capacity. Suppose your BCP was designed to operate at 20% of your production capacity. The longer you are in your failover site, the greater the risk due to a lack of resources.
Finally, there are limitations if you are leveraging a third-party company to assist in your data backup and disaster recovery plan. Most restrict how long you can occupy the space they provide before you must transition your environment back to its original site or convert your contract with the vendor to make the site permanent.
In addition, consider:
- Time needed to acquire replacement hardware.
- The method used to failover to your DR site. This will typically be the same process followed to failback.
- Costs (hardware, software, facility, failover declaration fees and fees) associated with turning your DR site to a permanent site.
- Your recovery time objective (RTO), which is how long the failback method will take as this can sometimes be more painful than the actual failover event.
Failover Site Selection and Failback Capability
Your choice of a failover site can also affect the capabilities of your failback, whether it entails restoring existing infrastructure, buying new infrastructure, or moving to a production cloud.
Using colocation with failback to existing infrastructure is relatively inexpensive but can be labor- and time-intensive. Restoring large amounts of data (more than 5 or 10 terabytes) can take days when restoring it from tape.
The effectiveness of this strategy further diminishes new infrastructure is required. Capital expenditure costs run high as do labor costs. Buying and configuring new hardware and simultaneously restoring 5 to 10 terabytes of data, all within the recovery window, is going to be a serious challenge.
A public cloud service offers an easy option for storing data during a DR event. With low front-end costs, it’s great for small businesses with limited IT staff or that lack the in-house services to protect data. But if you’re a mid-sized company facing a disaster, prepare yourself.
Back-end costs can run high. Retrieving data is going to be expensive, given the scope and scale of what you likely have stored there. You will be paying for every gigabyte you take it out.
When one of my customers was a startup company, they pushed copies of their data to the cloud and would occasionally pull that data back. As time progressed, their data set grew from under 5 TB to 50 TB and then to 100 TB. Now pulling that data set has become a very costly expense just to test their failover.
With a DRaaS solution, you failover to a predesignated DR cloud. Failover is smooth, because your DRaaS provider assisted you in designing the solution and testing it to ensure any issues were worked out.
DRaaS solutions typically leverage continuous data protection (CDP) technologies that offer low levels of data loss by replicating production data. Mission-critical production can operate throughout the recovery period as server security configurations and network services are duplicated at the DR site.
The failback options for DRaaS are the same as for colocation and public cloud services: restoring existing infrastructure, buying new infrastructure, or moving to a production cloud. In any of these cases, the labor involved for failback falls on the service provider, saving you time when you need it most.
The failback event itself is simpler too. It requires only three steps:
- Recover your infrastructure (or keep production in the cloud).
- Reload your hypervisor.
- Install one virtual machine (VM).
From here, your DRaaS provider should take over the replication and eventual failback into your production environment.
Additional benefits include:
- Minimized downtime, ensuring your company is protected from financial losses.
- Protection for your company from risk of compliance penalties.
- Stronger security against breaches. Oftentimes, a company’s DR site has all its data. But without the security in place to protect it, all too often the DR servers are neglected and missing critical security patches. With a DRaaS solution, the DR VMs are staying in sync and typically in a powered off state so the data is not even accessible until the time of declaration.
However, not all DRaaS providers are created equal. Some have production capacity, while others do not. Ask in advance.
It’s just as important to consider failback as it is your failover strategy. The issues — how long your methodology will take, the labor involved, the ease of transport, the costs of returning to or finding a new production site — should not be left to resolve during the recovery period. That’s when company resources will be tight, and your energy diminished.