Recovering Right: How to Improve at IT Disaster Recovery

Written by Richard Long | Sep 18, 2024 7:50:57 PM

Far from relieving organizations of the responsibility of recovering their IT systems, today’s cloud-based and hybrid environments make it more important than ever that companies know how to bring their systems back up in the event of an outage. In today’s post we’ll look at some of the things organizations need to know and do in order to be able to “recover right.”

Related on MHA Consulting: The Cloud Is Not a Magic Kingdom: Misconceptions About Cloud-Based IT/DR

The Cloud Is Not a Plan

A common misconception today is that the shift from company-owned data centers toward cloud-based environments means companies can quit worrying about IT disaster recovery (IT/DR). (IT/DR is the part of business continuity that deals with restoring computing systems, applications, and data following a disruption.) We often hear people say, “We’re in the cloud, so we don’t have to an IT/DR plan. The cloud is our plan.”

This assumption is misguided. Many organizations today use a hybrid infrastructure, so if the company cannot recover its part of the environment, full recovery would be impossible.

In addition, Software as a Service (SaaS) and hosted solutions tend to involve numerous connections with company resources including point-to-point connections. As a result, the company might need to perform various actions to allow recovery to occur. An unprepared company will not know what actions to take or be able to take them, delaying or preventing recovery, with all the attendant impacts on production, revenue, and reputation.

If anything, contemporary computing environments are more complicated than ever. This requires greater diligence on the part of company staff responsible for keeping them running and greater sophistication in IT/DR recovery planning.

Moreover, cloud-services providers are themselves susceptible to outages and failed recoveries. Any company that places its well-being entirely in the hands of such a provider is making itself a hostage to fortune.

Two Additional Misconceptions

Two more things are worth mentioning in this context: It is a mistake to assume that recovering from a DR event is similar to day-t0-day troubleshooting or recovering from an outage to a single component or application. There is an order of magnitude difference between the two. Just because you have experience at the latter does not mean you are prepared to handle the former.

Second, there is a common tendency for organizations to assume that a large-scale outage, should one occur, would likely be caused by a major public disaster such as a big storm, and that in light of this the public will respond with patience and understanding. In fact, many if not most outages are caused by a mistake on the part of someone in the company, such as allowing bugs in environments or making errors when implementing changes. If the company’s ability to serve its customers goes down because of such a mistake, and the company does not have an adequate recovery plan, the people who are paying for and depending on the its services are likely to be anything but understanding.

For these reasons, it is essential for every organization to get serious about IT disaster recovery planning.

The Recovery Plan vs. the Plan Document

One of the first things a company has to understand in order to improve at IT/DR is that the terms “IT/DR recovery plan” and “IT/DR recovery plan documentation” are not synonymous. You can have a plan without having any documentation (just like a family can have an agreed-upon plan for escaping their house in the event of a fire without anything being written down).

If it’s a choice between one or the other, it’s much better to have a plan (an idea of what you would do to achieve a recovery) than the documentation (a written description of the plan). The written plan is secondary though it has many benefits and may be needed to pass an audit by an agency or customer. Many clients call us up and ask us to help them write an IT/DR recovery plan document, skipping the most important phase: developing the plan.

Let’s look at each of these entities in detail. The plan includes the solutions, communications, preparations, and actions needed to recover the technical environment in the event of an outage, enabling it to resume supporting business functions. The plan considers the overall strategy and high-level order of system, technology, and application recovery. It also encompasses testing and validation and making sure that everyone involved knows what they should do and how to do it.

The documentation encompasses, in a typical situation, multiple written documents containing the information necessary to support the recovery. Often, each technology or environment will have an individual technical recovery document. Both the high level and individual plans should be formal, reviewed documents that are updated regularly

(e.g., annually), as well as when significant changes are made to the environment. (Tip: One of the best ways to develop documentation is during exercises, by having someone take notes and make screenshots as others perform the recovery.)

Essential Elements of an IT/DR Plan

Now that we’ve established the difference between the recovery plan and the plan documentation, we can look at the elements that should be present in both. These items need to be worked out as part of the plan and codified in the documentation.

The first type of content to include is institutional information, specifically anything outside the norm from an institutional perspective. This might include a non-standard configuration or settings unique to the business that need to be adjusted before things are turned on. Information about the integration between systems should also be included, along with architecture information, personnel needed, and validation steps (the steps that have to be taken to make sure needed services are truly functioning correctly).

Remember that in crafting the written plan, you don’t need to write down every last detail. Just provide the information that a competent professional would need to do the job, and don’t forget to write your documents in checklist form, keeping items succinct and putting explanatory information in the appendix.

Getting Better at IT Disaster Recovery

In the modern IT landscape, the misconception that cloud-based environments eliminate the need for companies to do their own IT disaster recovery planning can be costly. Factors such as hybrid setups, Software as a Service connections, and the vulnerability of cloud-services providers make it essential for responsible organizations to develop comprehensive IT/DR plans.

A first step in getting better at IT disaster recovery is to recognize that there is a difference between the recovery plan and the plan documentation, with the former being most important. Ultimately, the IT/DR plan and the plan document should address such critical elements as non-standard configurations, integration details, personnel requirements, and validation steps, with the written plan being formulated in a clear, concise checklist style with explanatory information relegated to the appendices.