The CrowdStrike Outage: Key Lessons for Resilient IT Operations

The Bocada Team | August 20, 2024

In the matter of a few hours on Friday, July 19, 2024, hundreds of the world’s largest corporations came screeching to a halt. People around the world began to wonder: “Is this ransomware again? Or (heaven forbid) a nuclear strike?”

As it turns out, all it took to cripple much of the global IT infrastructure was a routine software update from a market-leading cybersecurity software company.

IT operators around the world were forced to scramble over the next few days to manually patch affected machines, leaving widespread outages in the meantime. All told, the economic damage will likely surpass more than $10 billion globally, with much of it uninsured (i.e., because this wasn’t a malicious act or cybercrime).

In this piece, we’ll examine some of the critical lessons to be learned for technology providers and for the IT operations teams tasked with keeping systems operational.

 

First, What Happened

CrowdStrike, a leading cybersecurity software provider, pushed a faulty software update (see root cause analysis) for its Falcon product that caused an estimated ~8.5 million affected Windows computers to experience BSOD errors and become nonoperational.

While CrowdStrike quickly provided a manual patch/fix for the update, the sheer volume of affected computers meant widespread systems outages that lasted for several days (and weeks in some cases) in hundreds of countries around the world.

CrowdStrike’s stock price saw an immediate decline of nearly 50%, and the company is facing customer demands for renumeration and likely future lawsuits. The damage to its affected enterprise customers has also been considerable.

Delta, for example, experienced more than 7,000 flight cancellations and is now facing 176,000 reimbursement/refund requests and total estimated losses between $350 million to 500 million.

Given the extent of the economic loss this outage has brought about to both CrowdStrike and to their customers, it is important to learn from the incident so that future occurrences can be mitigated.

 

Lessons for IT Teams

That a single vendor’s buggy software update was capable of so much disruption serves as a powerful reminder about IT resilience.

Here are a few relevant considerations and tips:

  • It’s not just cyberattacks that can compromise your systems. With so much organizational focus (and budget) dedicated to preventing ransomware and other cyber threats, this major incident is a reminder that cyberattacks aren’t the only risk vector threatening an organization’s business continuity and data resilience.
  • Manage third-party software vulnerabilities. Is your third-party software set to “auto-update”? Has your third-party software been vetted (and is there a plan to continuously monitor it for vulnerabilities)? Is your organization using unsupported or EOL/EOSL (end of life / end of service life) software? (Learn why using EOL/EOSL software is a bad idea.) Prioritize and remediate potential vulnerabilities in your third-party software stack.
  • Avoid vendor lock-in. With many IT organizations moving to consolidate onto one or a few providers for key functions such as cybersecurity, backup & recovery, cloud computing, and IT monitoring, the CrowdStrike outage might give pause for reconsideration. By diversifying across multiple products/platforms, IT organizations can diversify their risk and future-proof against product-specific failures.
  • Ensure backup & recovery excellence. With the broad array of risk vectors threatening modern organizations, backup & recovery excellence becomes even more important to data resilience. Automate your backup monitoring and backup failure remediation with a proven backup monitoring tool such as Bocada to ensure compliance with your organization’s RPO/RTO (recovery point objective / recovery time objective) mandates.

Lessons for Software Providers

Like other providers in the cybersecurity space, CrowdStrike is responsible for addressing new vulnerabilities and threats as they emerge. Speed is of the essence in their line of work. It is likely that this need to be highly responsive to new risks/vulnerabilities has created tradeoffs or shortcuts in the processes CrowdStrike uses to deploy its software updates.

Here are a few things software providers (not just in the security space) must consider:

  • Balance speed & QA. As powerful as the motivation to publish a software update on time can be, it’s important to ensure proper testing and QA, even at the risk of missing a deadline. For example, deploying an update on an internal staging environment before rolling it out to production environments can ensure an update is working as intended before it’s deployed to customers.
  • Implement rolling updates. In many organizations, software updates are provided to customers in batches (on a rolling basis). When combined with active monitoring and user feedback loops, this gives providers the ability to halt and rollback a faulty update before the damage becomes widespread.

 

It’s All About Resilience

As organizations and IT teams navigate an increasingly complex and globalized IT landscape, resilience must be at the forefront of their strategies.

For IT teams, this means taking a holistic view of risks, not just focusing on cyber threats but also on operational vulnerabilities. Diversifying vendors, rigorously managing software updates (and migrating off EOL/EOSL software) and ensuring robust backup and recovery operations are critical to mitigating future such outages.

Ultimately, resilience is about preparation and adaptability. By learning from the CrowdStrike outage, both IT teams and software providers can build stronger, more resilient operations that withstand the unexpected, safeguarding business continuity and protecting against potentially devastating economic losses.