Many organizations lack a clear, recognized understanding of when the metaphorical switch will be flipped to start the recovery time objective (RTO) countdown timer. There are two options, either of which can work provided the organization takes a few key considerations into account.
Related on MHA Consulting: All About RTOs: What They Are and Why You Have To Get Them Right
A common source of confusion at many organizations is when the countdown for the organization’s RTOs begins.
An RTO is a time window within which, in the event of an outage, a critical business process or application needs to be returned to a fully productive state in order to prevent an unacceptable level of harm to the organization (as previously determined by a business impact analysis).
Note that the problem under discussion is mostly an issue with highly critical processes that have very short RTOs, such as four hours or less. (This discussion also pertains to outages resulting from major events, not day-to-day availability issues.)
Typically, some people will assume the countdown for the RTO begins at the time of the outage. Others will operate on the understanding it doesn’t begin until a disaster or recovery event is declared.
One consequence of this confusion is worry and frustration among people who incorrectly think the organization is at risk of missing or has missed an RTO.
A more fundamental problem is when a lack of clarity about the company’s preferred approach leads to RTOs that don’t allow for the necessary decision-making time on the part of senior management. (More on this below.)
As stated previously, this discussion mostly pertains to highly critical processes and apps with short RTOs.
However, within that group there is a subset of processes and apps—usually very small—that are so critical they can never be down (or if the RTO is missed by even a few minutes there will be significant harm). These stand apart from the current discussion because they should already be architected to be in a high availability state.
This blog is about functions that have fairly low RTOs but do not require immediate recovery.
Most organizations choose to have their RTO countdown begin at the time a recovery or disaster is formally declared. Such a declaration can be made within minutes or take over an hour.
This can be considered the standard approach.
The reason that a lengthy delay might occur before the recovery is declared is because crisis teams and management need time to investigate the outage and decide whether it’s worthwhile to perform a recovery, a demanding and expensive undertaking.
The other possible approach is to have the RTO countdown start automatically at the time of the outage.
Organizations that use this method will still need time to analyze the outage and decide whether to mount a full recovery.
However, with this approach, the time consumed by investigation and decision-making eats up part of the RTO window, leaving that much less time to recover the app or process.
This is a less common approach but some organizations might have good reasons for doing it this way.
Both methods can work provided the organization takes the following points into account:
By taking these items into account, an organization can achieve success no matter when it decides to start the metaphorical countdown timer for its RTOs.
At many organizations, confusion reigns regarding when, in the event of an outage, the RTO countdown timer begins. This confusion can have consequences ranging from unnecessary turmoil to unrealistic and badly missed RTOs.
There are two possible approaches to deciding when to initiate the RTO countdown. The standard approach is for the timer to start when a recovery is formally called or approved. The other possibility of having the countdown start automatically at the time of the event might work for some organizations.
Either approach can work provided a handful of key considerations are taken into account. These include referencing the chosen approach when setting the individual RTOs and making sure it is clearly communicated throughout the organization.
For more information on RTO countdowns and other topics in BC and IT/disaster recovery, check out these recent posts: