Let’s Get Functional: Learning from the CrowdStrike Outage

Written by Richard Long | Aug 5, 2024 7:57:00 AM

The global CrowdStrike outage last month is a reminder that disruptions can happen anytime, anywhere. It is also a golden opportunity for business continuity professionals to refocus their attention on the most important thing in BC: ensuring that the enterprise is functionally recoverable.

Seeing the Blue Screen of Death

CrowdStrike is one of the world’s leading cybersecurity firms, the trusted choice of banks, airlines, health care providers, government agencies, and other organizations around the world. On July 19, the company issued a flawed software update that caused 8.5 million systems to crash, entering the blue screen of death. (CrowdStrike wasn’t the first organization to be tripped up by human error, and it won’t be the last.)

The error forced the machines into a bootloop, making then unusable—and resulted in canceled flights, interrupted broadcasts, nonfunctioning checkout terminals, and other problems, in Europe, the U.S., and elsewhere. The issue was mostly resolved over the following days, but current estimates of the damage done are around $10 billion.

Widespread relief that the outage wasn’t caused by a cyberattack was mixed with a feeling of irony that it was brought about by a vendor companies had hired to keep their systems safe.

Remembering What Matters

Often when people narrowly avoid a serious accident, they reassess their priorities and focus their attention on the things that really matter. From the BC practitioner’s point of view, the CrowdStrike outage could potentially do some good if it leads us to refocus our attention on what really matters in business continuity.

What this is can be summed up in two words: functional recovery. Can the organization’s most critically time sensitive business processes and systems be quickly restored in the event of an outage? Do you know what those processes and systems and their dependencies are? That is what really counts in BC. Everything else is window dressing.

The Components of Functional Recovery

The importance of the message given above emerged from my conversations with numerous MHA Consulting clients in the days following the outage. Those conversations and the reflections they sparked surfaced the following specific points:

Many organizations are completely unprepared for an event. Many others have programs that are overly complicated, with too much effort going into aspects of BC that are secondary (e.g., documentation). At the same time, the thing that really does matter—verifiable functional recoverability—is neglected. Few companies hit the Goldilocks spot of being just right.

Activity does not equal preparedness. Just because an organization spends a lot of time and money on BC does not mean it is functionally recoverable.

Objectivity is in short supply. Many organizations flatter themselves that their programs are better than they are.

Preparing for specific events is not worthwhile. Who could have predicted that the vaunted CrowdStrike would be the source of a worldwide system outage? It’s better to prepare for different types of impact, such as loss of facilities, systems, or human resources.

For functional recovery, the important thing is to know what your four or five key business processes are, which servers support them, and the order in which you need to bring the servers back up (i.e., the order of operations). You also need to have quick have access to your key contact information.

Another essential for functional recovery is technology mapping. You need to know which applications support your critical processes and which servers and technologies support those applications. Then you can focus on recovering those servers (and only those servers) in the immediate wake of an outage.

The same goes for databases and their servers. It’s common for multiple databases to run on one server. There are some databases you’ll need right away; you need to know which those are so you can restore their servers immediately. The others can wait until you have more time.

BIAs, risk assessments, and other forms of documentation are great, but if that’s all you have—if you haven’t successfully practiced recovering your key business processes—then you’re not ready.

Similarly, tabletop exercises have their place, but if you haven’t verified that you can functionally recover then you are missing the most important piece of the puzzle.

It’s great if you have a list of vendors and customers, but can you actually put your hands on the list in a hurry? It’s wonderful to have a list of open orders, but can you fill those orders during a system outage? These are the sorts of capabilities that make up functional recovery. These are the things that really count.

We see a lot of companies make optimistic assumptions about what they will be able to do in an event—and never put those assumptions to the test with realistic exercises. Then an outage comes along and they run into one unexpected issue after another, delaying recovery by hours.

Functional recovery is more important than smooth crisis management. Having a crisis team identified, getting everyone on a Zoom call, assessing the situation, and determining the impact are important. But quick functional recovery is even more valuable.

Many companies express a greater interest in such bells and whistles as automatic updates and nice dashboards than they do in the nuts and bolts of functional recovery. This is putting the cart before the horse.

Critical lists and information (such as the order of operations) should be kept in multiple forms of highly available storage, including off-site storage or even printed copies.

The bottom line is simple and straightforward. The most important aspect of any organization’s BC program is for it to develop functional recoverability and be able to verify it through exercises. The things that go into that should be the BC office’s primary focus. Everything else is nice but secondary.

A Dramatic Reminder

The CrowdStrike outage serves as a dramatic reminder that even the most trusted systems can fail, emphasizing the need for robust business continuity planning. It’s crucial for organizations to prioritize functional recovery, ensuring that critical processes can be quickly restored during disruptions.

While comprehensive documentation and crisis management are important, they must not overshadow the necessity of practical, verifiable recovery capabilities. By focusing on what truly matters, businesses can better navigate unexpected challenges, minimizing downtime and limiting impacts.

Let’s Get Functional: Learning from the CrowdStrike Outage

Seeing the Blue Screen of Death

Remembering What Matters

The Components of Functional Recovery

A Dramatic Reminder

Further Reading