software failures

682 viewsOtherTechnology

I understand that machines fail for numerous physical reasons, however, I’ve never understood how computer or programs that were working fine all along can suddenly crash or break down if there’s no moving parts and the code hasn’t been otherwise recently patched or updated. This has bugged me for over 25 years and it finally occurred to me that I should it.

In: Technology

10 Answers

Anonymous 0 Comments

> I’ve never understood how computer or programs that were working fine all along can suddenly crash or break down if there’s no moving parts

That’s the thing, though. There *are* moving parts.

Sometimes the moving parts are physical, ~~like an old processor losing performance over time,~~ (edit: apparently this isn’t real under normal usage and maintenance) or thermal paste drying up and conducting heat less efficiency, or cat hair clogging up your PCs air intake filter ~~because you never clean it what is wrong with you~~. These can cause your hardware to be less powerful, which can affect your software.

But sometimes the moving parts aren’t physical. A piece of software might work fine with a certain amount of resources, but changes to other programs might either increase the amount of resources used when the program is run (think anti-virus or anti-cheat software for games), or might change the amount of resources available to the program without interacting with the program itself (think another program running in the background that hogs all the RAM), or might reduce the number of resources or types of things the program is allowed to request (think changes to the operating system).

All these can cause a program to slow down or even crash despite the program itself not changing.

Anonymous 0 Comments

But there are moving parts – electrons. In fact, there are a lot of moving parts, easily in the billions. Imagine them as water running through a system of channels an be gates that have to open at just the right time to create the perfect combination of open/close across the whole system so it does exactly what it’s supposed to be. What’s more, the gates are controlled by the very water they are guiding through the channels.

So what software is is creating a set of rules that define what the gates do under what circumstances. But there also are billions of gates, and by extension an even greater number of combinations they can be in… but as soon as a an open/close combination is reached that is not defined in the rules, what is going to happen?

Without a rule, the water goes where it wants. I’m the best case, this will result in a position that is covered again. But how probable is that with the myriad of possible combinations? 

Now with computers, there are a whole hierarchy of systems in place that will prevent the whole thing coming to a stop – a program may be terminated, but you won’t have it delete your hard drive. But the point is that computers are a thing of mind blowing complexity deep down. 

Anonymous 0 Comments

We pretty much all fail at reliability in software. People, even practitioners simply don’t make it a priority. The incentives are difficult at best.

There are ways to improve this and almost nobody has even heard of them.

Anonymous 0 Comments

These failures usually don’t just appear. They have always been there, it’s just that noone has ever created the very specific circumstances that make them happen.

Anonymous 0 Comments

Software fails because software is essentially trains of thought frozen in time.

It’s like when you make a plan, but the world has another. Or when you write a guide for something, and then people return with the most unthinkable problems.

Anonymous 0 Comments

Like others have pointed out, sometimes the circumstances for it are just very, very specific. One real world example that happens is called integer overflow.

For background, numbers in computers are most often stored using a method called two’s complement. I’ll try to keep it simple, but basically, it’s like a number line, it starts at 0, and then the negative numbers are after the positives (0 1 2 3 -4 -3 -2 -1). This is only 5 bits worth of data, and older computers usually used 32.

So, for example, you might be counting the amount of milliseconds your server has been running for. With 32 bits, that won’t overflow for about 24 days. So let’s say you plan to restart your server every week so the count resets. It might take years before you ever have to run for long enough for it to overflow into negatives

There are ways to account for this, like using a bigger number, not allowing negatives, or resetting your count occasionally. Usually this kind of problem happens when they never expected the program to need to run that long or it’s circumstances you just didn’t anticipate during planning

Other people have already said other good things about how and why, but hopefully this concrete exactly can help you conceptualize how it can happen and how it can fail to show up until years later

Anonymous 0 Comments

As many pointed out. Software failures are caused by flaws on the logic that was written into code. These flaws are only triggered on very specific, ultra rare circumstances.

Anonymous 0 Comments

There could be a dependency on another piece of software (Operating system, 3rd party code) that got updated.

Anonymous 0 Comments

To add on to the other answers, it is possibly for software that is working correctly, running on an operating system operating correctly, all running on hardware that does not otherwise have any problems to suddenly fail.

It’s rare, but because of the very very very small size of the individual electronic components on an integrated circuit, they can experience interference from cosmic rays. This can cause things like a single bit to flip the wrong way.

It’s possible to engineer software to be tolerant of single bit failures, but it isn’t cheap to do so and is almost never worth the investment for common software products.

Where it is worth the investment is for things like avionics boxes that control weapons release on aircraft. There are government regulations for things like this where a command to launch a weapon cannot be a single bit in a message. It would have to be at least two bits, they have to be opposite, and they have to be on different message words (words are usually 8, 16, or 32 bit sizes) and they can’t be adjacent. There are other reasons for having such regulations, but the general idea is a single isolated bit failure can’t cause a weapon to inadvertently launch (There are usually a lot of other things that prevent weapons from launching, but each step in the process gets this kind of treatment).

Anonymous 0 Comments

Failures happen when something unexpected happens. Computers follow instructions perfectly. But what happens if something out of the program’s control happens. Someone enters invalid data, a file gets corrupted, the computer runs out of memory, etc. Programs can’t test for every unexpected occurrence.

It’s like going through a maze if you have the directions through it. What if one of the passages is blocked. Then what.. you get lost.