Dear Reader,
Please accept my apologies for skipping two newsletters. On top of a high workload, my wife and I bought a plot for our new home in St. Gallen. No, it is not the posh one in Switzerland, but the serene one in Austria. It lies in the Gesäuse National Park, a scenic and quiet part of the Eastern Alps. Dozens of mountain hikes will start at our new home. For the last two months, we have been busy designing our house and garden. The big move will be in summer or autumn 2025.
Although I could use our two-year search for a plot as a good example for applying agile principles, I’ll make a hard cut here and move to the top tech and businesses story of the last month: the Crowdstrike update disaster. Crowdstrike, its customers and Microsoft behaved like dilettantes.
CrowdStrike didn’t test configurations.
CrowdStrike rolled out an untested update to all computers at once.
Microsoft allows updates in kernel mode - without having an automatic fallback.
Customers like Delta Airlines run mission-critical systems without (enough) contingency systems.
At least, these companies gave us a master lesson in how not to do OTA updates.
Enjoy reading,
Burkhard
CrowdStrike: How Not to Do OTA Updates
What happened?
On July 19, more than 8.5 million Windows computers crashed and showed the half-forgotten blue screen of death. Thousands of flights were canceled. Doctors couldn’t access their patients’ files, surgeries had to be postponed. Banks, retailers and manufacturers were also hit. A few emergency systems were disrupted. The NYTimes article What We Know About the Global Microsoft Outage has a good overview of the events on July 19.
The damages are staggering:
The massive CrowdStrike outage that affected millions of Microsoft devices is predicted to cost U.S. Fortune 500 companies $5.4 billion in total direct financial loss, with an average loss of $44 million per Fortune 500 company, according to new data from cloud monitoring and insurance firm Parametrix.
Mark Haranas (CNN): CrowdStrike-Microsoft Outage To Cost $44M Per Fortune 500 Company
Delta Airlines was hit much harder by the CrowdStrike outage than other airlines.
The CEO of Delta Air Lines says the massive CrowdStrike outage […] has cost the airline as much as $500 million.
[…] The airline was forced to cancel more than 5,000 flights
[…] [The outage] caused massive disruption to Delta’s crew-tracking system, a mission critical tool used to pair pilots and flight attendants with flights.
[…] To help get back online, the company had to manually reset 40,000 servers.Jason Breslow (NPR), Delta’s CEO says the CrowdStrike outage cost the airline $500 million in 5 days
Of course, Delta is planning to sue CrowdStrike. The companies’ lawyers are already tasting blood. Legal quarrels tend to reveal many interesting facts.
But while many carriers recovered within a day or two, Delta struggled to restore its operations. The airline canceled about 5,000 flights, about 37 percent of its schedule, over four days, according to FlightAware, a service that monitors air travel. About three in four of the airline’s remaining flights were delayed.
Lauren Hirsch and Niraj Chokshi (The New York Times): CrowdStrike Hits Back in Heated Spat With Delta Over Global Tech Outage
The obvious question is: Why did Delta struggle so much harder than their peers to get a mission-critical system back to normal operation? I’ll come back to this later.
System administrators had to boot these 8.5 million crashed Windows computers manually into Safe Mode. Then, they had to delete a single configuration file of CrowdStrike’s cyber-security software Falcon and to restart the computer. After these simple steps all is good again (see Akshay Aryan’s LinkedIn Post Technical Breakdown: Crowdstrike's Update and the Worldwide BSOD Crisis).
If fixing one computer took 5 minutes, fixing 8.5 million computers would take nearly 30,000 full days! Assuming a miserly hourly rate of $25, we are looking at costs of over 1 billion dollars!!
And all of this happened, because CrowdStrike botched the seemingly innocuous update of a configuration file for its cyber-security software Falcon. But hang on! CrowdStrike is not the only culprit. Microsoft and CrowdStrike’s customers like Delta Airlines are equally responsible for this disaster.
Why did this happen?
In its Preliminary Post Incident Review, CrowdStrike explains the root cause of the outage disaster in a very detailed way. They perform two types of releases: software and configuration releases, which they call Sensor Content and Rapid Response Content releases, respectively. These types differ in a crucial point.
The [Sensor Content] release process begins with automated testing, both prior to and after merging into our code base. This includes unit testing, integration testing, performance testing and stress testing. This culminates in a staged sensor rollout process that starts with dogfooding internally at CrowdStrike, followed by early adopters. It is then made generally available to customers. Customers then have the option of selecting which parts of their fleet should install the latest sensor release (‘N’), or one version older (‘N-1’) or two versions older (‘N-2’) through Sensor Update Policies.
CrowdStrike, Preliminary Post Incident Review (emphasis mine)
This is exactly how you release software these days: thorough testing and a staged rollout process. Note that CrowdStrike recommends to its customers to perform a staged rollout to their computers as well. Customers should never install a new software release to all of its computers. I don’t know whether customers are forced to follow a staged rollout. But they certainly should.
Rapid Response Content provides visibility and detections on the sensor without requiring sensor code changes. This capability is used by threat detection engineers to gather telemetry, identify indicators of adversary behavior and perform detections and preventions.
[…] Rapid Response Content is delivered as content configuration updates to the Falcon sensor.CrowdStrike, Preliminary Post Incident Review (emphasis mine)
Rapid Response Content gives CrowdStrike a way to avoid full software releases. Unfortunately, CrowdStrike also avoid thorough testing and a staged rollout process. They push configuration updates to all computers. They treated configurations not as code but as “content” or dumb data.
However, configurations are parsed and interpreted by the software. The Falcon software uses configurations to detect new threats and send warnings to system administrators. In short, configurations are code. The crash was caused by a mundane out-of-bounds array access.
[From CrowdStrike’s Root Cause Analysis (pdf):] "At the next IPC notification from the operating system, the new IPC Template Instances [- that is, the configurations -] were evaluated, specifying a comparison against the 21st input value. The Content Interpreter expected only 20 values. Therefore, the attempt to access the 21st value produced an out-of-bounds memory read beyond the end of the input data array and resulted in a system crash."
Ravie Lakshmanan (The Hacker News): CrowdStrike Reveals Root Cause of Global System Outages
CrowdStrike’s tests missed the out-of-bounds access. Their software didn’t perform runtime array boundary checks. The Falcon software lacks simple runtime checks, although it runs in Windows kernel mode with the potential to bring down the whole system and it is used in mission-critical systems like Delta’s crew-tracking system.
This raises quite a few interesting questions. Did CrowdStrike do a proper risk analysis of the damage their software might cause? If yes, did the company’s leaders communicate clearly to the software developers that quality is the prime goal? Did the leaders listen to developers telling them about these risks?
How to avoid such disasters?
Predictably, CrowdStrike, Microsoft and Delta are blaming each other for the disaster. None of them is the aggrieved party. The travellers who stranded at airports, the patients who had their surgeries postponed, the people who couldn’t withdraw money from their accounts and many other normal people incurred the real damages. These companies should be deeply ashamed. They did a lousy job. They knew that such a historical outage was a disaster waiting to happen.
CrowdStrike, Microsoft and their customers are responsible for this disaster together. They should work out together how to avoid such disasters in the future. Let us look at the three main villains and learn some lessons for updating embedded systems over-the-air.
CrowdStrike
CrowdStrike must treat configurations as code and put it through the same rigorous and thorough testing as any other code. They must double or even triple up on testing, as their software runs in Windows kernel mode and can take down the whole system. The damage can be enormous especially when mission-critical systems go down.
They must also release configuration updates in the same staged rollout process as for software. They should force their customers to use staged rollouts. Rollouts to all devices shouldn’t be possible at all. Fleet management servers like Mender or Memfault support staged rollouts but they don’t enforce them. Never allow rollouts to all devices. Your customers will thank you later. Never do this!
Microsoft
I was surprised to learn that you must manually boot Windows computers into Safe Mode when the Windows kernel crashes repeatedly. I was surprised, because I implemented booting into an alternate Linux system just earlier this year.
My embedded device uses an A/B update strategy. Hence, it always has two complete copies of the system: one copy contains the current system, the other copy contains an older system. One copy could also be a small recovery system like the Safe Mode in Windows.
With boot counting enabled, the bootloader counts how many times the kernel or the init process fail to start. When the boot limit - the number of tolerated crashes - is reached, the device boots automatically into the alternate system copy or into the recovery system. If, after an update, the device boots up properly and passes some self tests, it can disable boot counting until the next update.
When the device boots automatically into the alternate full system copy, users can work normally with the slightly older system. Once there is a new update available with the kernel crash fixed, the device tries another update. This is all fully automatic. No human intervention needed.
When the device boots automatically into the recovery system, it could wait until a fix becomes available as an update, apply the fix (e.g., deleting the problematic configuration file) and reboot into the full system. Then, users can continue their work. This is also automatic, but users can’t work until the fix becomes available. The outage is longer than with the A/B strategy. Of course, the recovery system allows system administrators to fix the problem manually. But that should be a last resort.
I am sure that Microsoft could implement such a recovery strategy on Wintel computers. It is not rocket science. I don’t understand why they haven’t done this long ago.
Microsoft could also learn from Apple, which don’t have to run cyber-security software in kernel mode.
Apple MacOS was not affected by Friday’s crash, as it runs Apple Endpoint Security Framework, an API that anti-virus providers use to obtain telemetry information from the core MacOS operating system. This means that they do not need to have their code running within the core MacOS at Ring Zero, which is where the Windows version of CrowdStrike’s Falcon needed to run.
Cliff Saran (ComputerWeekly.com): Why is CrowdStrike allowed to run in the Windows kernel?
Sadly, Microsoft prefers to blame the European Commission (EC) - everyone’s favourite scapegoat. The EC allegedly forces Microsoft to allow third-party companies to run their anti-virus software in kernel mode. Otherwise, Microsoft would have an unfair competitive advantage over other software providers. What a lame excuse!
By the way, Saran wrongly claims “that Linux servers experienced a similar issue in April with CrowdStrike”. TheRegister’s Simon Sharwood initially made the same mistake but corrected it at the end of his article. The kernel crash was caused by a bug in the Berkeley Packet Format (BPF) running in kernel space and used by the CrowdStrike software running in user space.
Delta Airlines
“We have no choice,” [Ed Bastian, Delta’s CEO] said about potential action against CrowdStrike. “We’re not looking to wipe them out, but we’re looking to make certain that we get compensated however they decide to for what they cost us. Half a billion dollars in five days.”
Jason Breslow (NPR), Delta’s CEO says the CrowdStrike outage cost the airline $500 million in 5 days
Delta certainly see themselves as the victim. But are they? I don’t think so. Microsoft doesn’t think so either. Microsoft wrote the following in a letter to Delta.
“Our preliminary review suggests that Delta, unlike its competitors, apparently has not modernized its I.T. infrastructure, either for the benefit of its customers or for its pilots and flight attendants, “ Microsoft said in the letter.
Niraj Chokshi (The New York Times): Microsoft Says Delta Was Largely Responsible for Flight Cancellations
The article also reveals that IBM and other service companies are responsible for Delta’s crew-tracking system that went down so spectacularly. Delta might have to sue a couple of more companies. Or, they could man up and accept their own responsibility for the disaster.
Given the more than 5,000 flight cancellations, the crew-tracking system is clearly mission-critical to Delta. Why didn’t Delta have enough contingency systems? Why didn’t they implement a staged rollout?
CrowdStrike’s customers should have enough spare systems for mission-critical systems. If some of the active systems fail, the spares automatically take over with minimal downtime. When updating their systems, they update a subset of the spares and switch over to them. If they crash, they switch back to the original system. Otherwise, all is good and they can update the original systems as well. Then, they repeat the process with a bigger number of spares and so on until all spare and original systems are updated.
This is nothing else but the A/B update strategy I suggested for Microsoft above - coupled with a staged rollout. Delta could also use CrowdStrike’s software on one half of their systems and the cyber-security software of another vendor on the other half of their systems.
There are numerous approaches to achieve more resilience against computer outages. And the best thing is: all these approaches are well known. The other airlines, which were much less hit than Delta, are proof. They have done their homework, Delta have not. And make no mistake: Delta’s leadership is responsible that their own IT department and their IT service provides implement these approaches.