this post was submitted on 19 Jul 2024
830 points (98.5% liked)

Technology

60098 readers
2755 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 2 years ago
MODERATORS
 

…according to a Twitter post by the Chief Informational Security Officer of Grand Canyon Education.

So, does anyone else find it odd that the file that caused everything CrowdStrike to freak out, C-00000291-
00000000-00000032.sys was 42KB of blank/null values, while the replacement file C-00000291-00000000-
00000.033.sys was 35KB and looked like a normal, if not obfuscated sys/.conf file?

Also, apparently CrowdStrike had at least 5 hours to work on the problem between the time it was discovered and the time it was fixed.

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 22 points 5 months ago (5 children)

How can all of those zeroes cause a major OS crash?

[–] [email protected] 212 points 5 months ago (9 children)

If I send you on stage at the Olympic Games opening ceremony with a sealed envelope

And I say "This contains your script, just open it and read it"

And then when you open it, the script is blank

You're gonna freak out

[–] [email protected] 42 points 5 months ago (9 children)

Ah, makes sense. I guess a driver would completely freak out if that file gave no instructions and was just like "..."

load more comments (9 replies)
[–] [email protected] 17 points 5 months ago (5 children)

Except "freak out" could have various manifestations.

In this case it was "burn down the venue".

It should have been "I'm sorry, there's been an issue, let's move on to the next speaker"

[–] [email protected] 17 points 5 months ago (1 children)

Except since it was an antivirus software the system is basically told "I must be running for you to finish booting", which does make sense as it means the antivirus can watch the system before any malicious code can get it's hooks into things.

[–] [email protected] 9 points 5 months ago

I don't think the kernel could continue like that. The driver runs in kernel mode and took a null pointer exception. The kernel can't know how badly it's been screwed by that, the only feasible option is to BSOD.

The driver itself is where the error handling should take place. First off it ought to have static checks to prove it can't have trivial memory errors like this. Secondly, if a configuration file fails to load, it should make a determination about whether it's safe to continue or halt the system to prevent a potential exploit. You know, instead of shitting its pants and letting Windows handle it.

[–] [email protected] 12 points 5 months ago (1 children)

In this case it was "burn down the venue".

It was more like "barricade the doors until a swat team sniper gets a clear shot at you".

[–] [email protected] 11 points 5 months ago

Hmmmm.

More like standing there and loudly shitting your pants and spreading it around the stage.

[–] [email protected] 5 points 5 months ago

The envelope contains a barrel of diesel and a lit flare

[–] [email protected] 4 points 5 months ago

Computers have social anxiety.

[–] [email protected] 2 points 5 months ago (1 children)

You're right of course and that should be on Microsoft to better implement their driver loading. But yes.

[–] [email protected] 8 points 5 months ago

The driver is in kernel mode. If it crashes, the kernel has no idea if any internal structures have been left in an inconsistent state. If it doesn't halt then it has the potential to cause all sorts of damage.

[–] [email protected] 13 points 5 months ago

Great layman's explanation.

[–] [email protected] 10 points 5 months ago (2 children)

Maybe. But I'd like to think I'd just say something clever like, "says here that this year the pummel horse will be replaced by yours truly!"

[–] [email protected] 17 points 5 months ago (1 children)

Problem is that software cannot deal with unexpected situations like a human brain can. Computers do exactly what a programmer tells it to do, nothing more nothing less. So if a situation arises that the programmer hasn't written code for, then there will be a crash.

[–] [email protected] 2 points 5 months ago (3 children)

Poorly written code can't.

In this case:

  1. Load config data
  2. If data is valid:
    1. Use config data
  3. If data is invalid:
    1. Crash entire OS

Is just poor code.

[–] [email protected] 14 points 5 months ago (4 children)

When talking about the driver level, you can't always just proceed to the next thing when an error happens.

Imagine if you went in for open heart surgery but the doctor forgot to put in the new valve while he was in there. He can't just stitch you up and tell you to get on with it, you'll be bleeding away inside.

In this specific case we're talking about security for business devices and critical infrastructure. If a security driver is compromised, in a lot of cases it may legitimately be better for the computer to not run at all, because a security compromise could mean it's open season for hackers on your sensitive device. We've seen hospitals held random, we've seen customer data swiped from major businesses. A day of downtime is arguably better than those outcomes.

The real answer here is crowdstrike needs a more reliable CI/CD pipeline. A failure of this magnitude is inexcusable and represents a major systemic failure in their development process. But the OS crashing as a result of that systemic failure may actually be the most reasonable desirable outcome compared to any other possible outcome.

[–] [email protected] 5 points 5 months ago

This error isn't intentionally crashing because of a security risk, though that could happen. It's a null pointer exception, so there are no static or runtime checks that could have prevented or handled this more gracefully. This was presumably a bug in the driver for a long time, then a faulty config file came and triggered the crashes. Better static analysis and testing of the kernel driver is one aspect, how these live config updates are deployed and monitored is another.

[–] [email protected] 2 points 5 months ago

But the OS crashing as a result of that systemic failure may actually be the most reasonable desirable outcome compared to any other possible outcome.

In which case this should've been documented behaviour and probably configurable.

load more comments (2 replies)
[–] [email protected] 11 points 5 months ago (6 children)

If AV suddenly stops working, it could mean the AV is compromised. A BSOD is a desirable outcome in that case. Booting a compromised system anyway is bad code.

load more comments (6 replies)
[–] [email protected] 10 points 5 months ago (3 children)

I agree that the code is probably poor but I doubt it was a conscious decision to crash the OS.

The code is probably just:

  1. Load config data
  2. Do something with data

And 2 fails unexpectedly because the data is garbage and wasn't checked if it's valid.

[–] [email protected] 3 points 5 months ago

You can still catch the error at runtime and do something appropriate. That might be to say this update might have been tampered with and refuse to boot, but more likely it'd be to just send an error report back to the developers that an unexpected condition is being hit and just continuing without loading that one faulty definition file.

load more comments (2 replies)
[–] [email protected] 7 points 5 months ago (1 children)

I'm gonna take from this that we should have AI doing disaster recovery on all deployments. Tech CEO's have been hyping AI up so much, what could possibly go wrong?

[–] Couldbealeotard 7 points 5 months ago

What are the chances that Crowdstrike started using ai to do their update deployments, and they just won't admit it?

[–] [email protected] 9 points 5 months ago

Nice analogy, except you'd check the script before you tried to use it. Computers are really good at crc/hash checking files to verify their integrity, and that's exactly what a privileged process like antivirus should do with every source of information.

[–] Cocodapuf 9 points 5 months ago

I'm nominating this for the "best metaphor of the day" award.

Well done!

[–] [email protected] 8 points 5 months ago* (last edited 5 months ago)

The funny bit is, I'm sure more than a few people at Crowdstrike are preparing 3 envelopes right now.

[–] crystalmerchant 7 points 5 months ago

This guy ELI5s

load more comments (1 replies)
[–] MajinBlayze 58 points 5 months ago (2 children)

Because it's supposed to be something else

[–] [email protected] 35 points 5 months ago (1 children)

At least a few 1's I imagine.

[–] Iheartcheese 29 points 5 months ago (2 children)
[–] [email protected] 15 points 5 months ago

Society isn’t ready for that

[–] [email protected] 16 points 5 months ago

Well, you see, the front fell off.

[–] [email protected] 11 points 5 months ago* (last edited 5 months ago) (2 children)

The file is used to store values to use as denominators on some divisions down the process. Being all zeros is caused a division by zero erro. Pretty rookie mistake, you should do IFERROR(;0) when using divisions to avoid that.

[–] [email protected] 17 points 5 months ago (1 children)

I disagree. I'd rather things crash than silently succeed or change the computation. They should have done better input and output validation, and gracefully fail into a recoverable state that sends a message to an admin to correct. A divide by zero doesn't crash a system, it's a recoverable error they should 100% detect and handle, hot sweep under the rug.

[–] [email protected] 11 points 5 months ago (2 children)

Life pro tip: if you're a python programmer you should use try: func() except: continue every time you run a function, that way ypu would never have errors on your code.

load more comments (1 replies)
[–] [email protected] 5 points 5 months ago

IFERROR(;0)

Maybe they should use a more appropriate development tool for their critical security platform than Excel.

[–] [email protected] 9 points 5 months ago (1 children)

Well, the file shouldn't be zeroes

[–] [email protected] 8 points 5 months ago

The front of the file fell off

load more comments (1 replies)