Do single event upsets ever effect normal computing?

621 views

I just read about [single event upsets](https://en.wikipedia.org/wiki/Single-event_upset) and it’s pretty fascinating. One thing that got me was that a speedrunner of Super Mario 64 experienced a single event upset.

So that leads me to believe that commercial electronics and regular CPUs and GPUs must have a chance to experience these single event upsets. When I research it, there’s only discussion on how it affects space electronics and FPGAs. But there’s gotta be a chance it affects my normal laptop, right? Why would FPGAs be more susceptible to SEUs than CPUs?

If I’m writing a Python script and I set a boolean to False, what’s the probability it gets set to True instead? If I’m logging into a website, what’s the possibility that the server side misinterprets my input? If it can affect an N64 in someone’s living room, there’s gotta be a non-zero chance, right?

In: Engineering

7 Answers

Anonymous 0 Comments

Computers will often do a thing where they check to see if any circuits aren’t working: they’ll take a “vote” on what the answer to an equation is. If a circuit gets it wrong, it’ll be ignored after that. This prevents an incorrect circuit from operating after that. It’s an oversimplification, but the idea is true.

Anonymous 0 Comments

For a python script that chance is impossibly small, as a Boolean is not a single bit. Modern programming languages don’t typically express things is such primitive representations anymore. Moreover, computers have techniques for handling these errors. They are a non issue for the vast majority of computational need.

Anonymous 0 Comments

FPGAs are basically on the very edge of physics, grinding out maximally fast electronic responses that can only be beat by designing a whole damn custom circuit board (ASIC). There is no room for error checking.

Indeed, if two FPGAs are communicating at different clock speeds, the time it takes for one “pulse” of info to fire has a period of instability. Reading the data at this time in the pulse gives you a 50/50 of getting the right data. This isn’t even the space lazer bs yet, this is something that happens in almost every device, and solutions to reduce the impact slow down both systems.

Once you’re literally counting individual bits ASAP and calculating results off of that single data read, you gotta account for anything, and I mean ANYTHING.

Anonymous 0 Comments

SEUs (or “soft errors”) are a non-zero hazard in modern computing. There are a lot of variables, but in general the issue becomes more of a risk on smaller-geometry (higher density) circuits. If your computer has ECC memory, it can deal with single upsets. Your microprocessor and GPU generally can’t. (Well, they could, but it would make everything more expensive so they don’t.) They typically don’t even use ECC on their cache memory arrays.

The biggest hazards come from cosmic rays and alpha radiation from materials used in packaging of the integrated circuits. The atmosphere helps quite a bit with cosmic radiation, but not much else practical can. Unless you like using a lot of lead or rock around your computer.

Fortunately, your error rate at sea-level with low-alpha packaging materials is fairly low. Using your computer on an airplane flight makes your chances go way up, ~10-30x the last time I did the calculations. Even going to high terrestrial altitudes makes a significant difference.

FPGAs are not intrinsically more susceptible. In fact, space applications often use them in older technologies to help mitigate the risk. You can also instantiate voting systems for computations, but that increases costs.

Run a standard PC long enough, and it will eventually have an SEU. It can cause a hang, a crash, or data corruption. (Or nothing at all, but those don’t really count.) I can’t cite a number, since there are a lot of variables. But for a typical PC, I’d say that it won’t go a full year (24/7) without a good chance of an event.

Source: *I’m a former semiconductor reliability engineer who did a fair bit of SER work.*

Anonymous 0 Comments

> CPUs and GPUs must have a chance to experience these single event upsets

Yes, CPUs, GPUs and all electronics can experience SEUs. It is more common in the memory component (as the memory is often the bulk of the space in these chips).

> Why would FPGAs be more susceptible to SEUs than CPUs?

They aren’t by design. It’s just that SEUs are very common in space so using CPUs and GPUs is error prone. They are just not designed to handle lots of SEUs. So you need redundancy. At the same time, its not financially worth it to make a custom chip with all this redundancy, so people opt for FPGAs. You can fill the FPGA up with multiple copies of the same logic and have them ‘vote’ so an error in 1 is nullified.

> If I’m writing a Python script and I set a boolean to False, what’s the probability it gets set to True instead?

The chances of these happening on earth are both really really small. But not zero as you said with the case of the Super Mario 64 speedrunner. A nintendo 64 wouldn’t have any protection against that. But in modern systems if an SEU happens on say the system’s main memory, error correcting codes would catch and fix it. So that would further reduce the chances that a user would even know an SEU happened.

Anonymous 0 Comments

Just to add in some perspective here – A single bit error would in most circumstances mean that a single pixel out of 2,073,600 pixels on a image that flashed up on your screen for a 60th of a second is a slightly wrong shade of which the difference is only barely distinguishable to the human eye in ideal circumstances.

Anonymous 0 Comments

>When I research it, there’s only discussion on how it affects space electronics and FPGAs.

Space has 2 reasons to deal with it. The first is obviously that your equipment is costly and crucial so errors can be devastaging. And the other is cosmic background radiation which gets more intense if you leave the protective atmospheric shield of earth. So there is simply more radiation and thus a higher than usual chance of that happening in the first place.

Also even True and False in Python are likely not just 1 bit values so changing one bit might not have the effect of inverting it. And in terms of N64s. The earlier the console the more likely they tried to tickle out as much performance as possible with the least material, so it’s way more likely that at some points they went “fingers crossed that will not break most of the time”. I mean that’s literally a problem you could solve with a reset button or stuff like that. Not because it works, but because it’s unlikely to replicate.