Here's What Really Caused 8.5 Million Computers to Crash
How one security product crippled the world because of bad programming
On July 19th, 2024 the largest cyber event in history hit the world. It affected 8.5 million Windows machines used for finance, healthcare, and other sectors.
The event was caused by the cybersecurity software, CrowdStrike Falcon. I mean, you couldn't make this up.
The software designed to protect against attacks was what caused the problem.
Although 8.5 million computers is less than 1% of Windows machines worldwide. The last major cyber event in 2017 affected 300,000 computers. So this is a massive step up.
But what exactly happened?
Estimated reading time: 4 minutes 55 seconds
What is CrowdStrike Falcon?
You've most likely heard of antivirus software like Norton or McAfee.
Falcon is like that, but it focuses on protecting a large network instead of a single computer. This is known as endpoint security.
It's cloud-based, but each machine in the network (endpoint) installs a small piece of software called a Falcon Sensor.
Once installed, the sensor constantly monitors and sends information to Falcon's servers.
These servers analyze all data from sensors using machine learning and threat intelligence. Basically, CrowdStrike uses its vast knowledge of cyber threats and hackers to check if a machine is infected.
To add to that, CrowdStrike has a team of researchers checking for new threats and adding their findings back to Falcon's database.
The sensors themselves have a 'detection engine' that also uses machine learning. This analyzes files and system processes.
Security engineers using Falcon have access to a web interface. This shows all monitored endpoints and alerts them if any threats are detected.
Sounds impressive.
If I owned a large organization with hundreds or thousands of machines to protect, I'd definitely buy Falcon.
But what's interesting about the sensors is that they operate at the kernel level. This is something few other programs do.
It has complete access to a machine. Meaning it can check all system activities, including hardware info such as memory and disk usage.
Sidenote: The Kernel
The kernel is a program in the operating system. It sits between hardware and software, managing their communication.
If your browser needs more memory, it doesn't need to know the type or amount of memory available; it just asks the kernel.
The kernel also stops software from accessing core system functions. These include CPU control, system configuration, and power management.
Third-party software installed at the kernel level sits between the kernel and the software. So it can access core system functions.
They can also be installed and updated from the internet in the background, without the user knowing.
If the kernel encounters a critical error that it doesn't know how to handle, known as a kernel panic, it will typically cause a blue screen on Windows.
As well as being able to receive data from sensors, Falcon servers can also send data to sensors (such as updates).
In this case, a malformed configuration update was sent to all Windows sensors, which caused those machines to stop working.
Because the issue was at the kernel level, it prevented Windows from starting correctly.
Who would have thought that such a small change would cause such a big problem?
The Configuration File
As mentioned before, the cause of the mass outage was a configuration update sent to all Falcon sensors for Windows.
Configuration updates are placed inside ‘channel files.’ In this case, the affected file was channel file 291, or C-00000291.sys. These are located in the Windows critical systems folder.
This specific update changed how Falcon analyzes 'named pipes' in Windows. A way for different processes, or programs, to communicate with each other even if they are on completely different machines.
Sidenote: Named Pipes
Imagine you have two small programs, one for adding two numbers (add) and one for doubling a number (double).
If you wanted to add a number and then double the result of that addition, you could do it like this:
add 1 2 | double # result 6 (3 x 2)
The |
character is called a pipe and it's used to connect the output of one program to the input of another.
This connection is temporary, so it closes when the program finishes. But if you want a more permanent connection, you could use a named pipe.
This is a special file that acts as a communication channel between two programs. So using a named pipe for the previous example would look like this:
add 1 2 > myNamedPipe
double < myNamedPipe # result 6
This is a very simple example, but the same concept applies to most of the programs on your computer.
For example, a browser extension wanting to communicate with a local password manager would use a named pipe.
The team discovered that hackers could use named pipes as an attack technique, so the configuration update was sent to detect this.
What's interesting is that this update was first released at the end of February 2024 and was stress-tested at the beginning of March.
Which means it was available in a customer’s sensor, but was not used until after it was stress tested.
A confusing way to release something, but it wasn’t breaking Windows when it was first released.
The July 19th update was a bug fix to what was tested in March. The main feature was already stress-tested, so the team saw no need to heavily test this specific fix.
Although light testing was conducted, there was a bug in their testing system that prevented an issue in the bug fix from being caught.
It was only after they released it that they noticed channel file 291 was causing errors when trying to access some memory that didn't exist.
Which was something the kernel didn't know how to handle, resulting in a Windows crash.
Some speculate that this crash was caused by a file full of null bytes or characters, which means the update file was full of zeros.
But CrowdStrike has responded saying, that's just how Windows works.
When a program is creating a new file, Windows first fills the new file with null bytes (a bunch of zeros) before adding data to that file.
Meaning the crash could have happened while a new file was being created.
What Happens Now?
Thankfully, it took CrowdStrike 78 minutes to release a fix.
Meaning affected machines should automatically download and install the update after a good old restart.
But if the machine crashes again after being restarted, the fix would need to be applied the hard way.
Someone would need to download the fix from a working computer to a USB drive.
Then put that USB into the broken machine before turning it on. When it does turn on, a menu should show, allowing the fix to be selected.
CrowdStrike has outlined steps they will take to prevent this from happening again, which include:
improved developer testing
improved stress testing
improving their light testing stack
and improving their error handling so that future errors do not cause crashes
Wrapping Things Up
I truly sympathize with those affected by this outage. It caused missed or canceled flights, delayed mail, and failed 911 calls.
It goes to show just how much damage the code we write can cause when it's not properly reviewed and tested.
I wish I could say this will never happen again, but we're human; we make mistakes, so it’s only a matter of time.
But hopefully not on this scale.
Anyway, I hope you enjoyed this article.
If you would like more of these, then be sure to subscribe.
PS: Enjoyed this newsletter? Please forward it to a pal or follow us on socials (LinkedIn, Twitter, YouTube, Instagram). It only takes 10 seconds. Making this one took 20 hours.