With cosmic ray soft errors occurring quite often, how comes my non-ECC RAM/SSD doesn’t lead to blue screens at all?

696 views
0

With cosmic ray soft errors occurring quite often, how comes my non-ECC RAM/SSD doesn’t lead to blue screens at all?

In: Technology

Statistics are your friend here.

In 1996 IBM estimated 1 soft error per month per 256 MB of RAM. If you’ve got 8 GB of RAM that means you should expect a single soft error per 22.5 hours of run time. Most of your RAM isn’t doing anything important most of the time, if you assume that 70% is used for precaching data that’ll get thrown out then you’re down to a single soft error in a section that matters every 67.5 hours. Now what are the chances that even in that sector that matters that you’ll get that error and read from it before it gets overwritten again?

The end result of all this is that while you may get 1 soft error per day, it is extremely unlikely to impact a critical section of RAM mostly because most things in RAM aren’t critical to the OS operation. You’d have to time and target your bit flip extremely well to cause the OS to crash due to cosmic rays

Now that isn’t to say it doesn’t happen. I guarantee someone in the world had an unexpected crash of some system due to a bitflip from cosmic rays, but the chances of it happening to your system while you’re using it are quite low

a) the error rate is low, perhaps one error bit per month per 256 Megabytes of ram

b) not all errors are bad. If you have an error in code that doesn’t get executed or stack space that’s not in use (so it’s written again before it’s read) – no harm no foul.

c) even bad errors don’t need to cause BSOD. Turn an “a” to a “c” on some help screen, even if the user sees it it’s a typo, not a processing fault. Similarly, even changing an executable instruction (from an ADD to a SUB) might give the wrong answer but not cause the program to die.

d) Even if an error occurs, the OS tries hard to clean up or terminate the bad process without crashing the whole OS. The OS code is a smaller part of memory than it once was, as memory gets bigger.

BTW, I doubt you have non-ECC SSD, that sort of thing is pretty standard in SSD controllers.