Are computers infallible?

We all know that computers can “crash”.

But can anyone provide a really good explanation of why memory corruption occurs? And are computer calculations infallible?

Allow me to share with you a little story. When I was studying thermodynamics at university, a very strange thing happened: a relatively old computer essentially made a mistake!

What happened was this. We were in the laboratory, measuring a quantity for an experiment and then tabulating the results in an excel spreadsheet with the aid of an old computer.

But one of those cells refused to play nicely. It didn’t produce the appropriate result. It was way off. Not just a little off. It was way, way off. It was so far off we couldn’t help but notice it.

The excel spreadsheet has failed. Either the software or the hardware had failed. Something had clearly failed. How can the same mathematical algorithm generate one inconsisent result among many rows of similar cells?

We investigated further. We looked at the individual cells but the equations in each one were all identical. We looked at the references to the other cells and they were all correct.

I remember looking at conditional formatting, number format and a load of other things. Nothing. Everything should have worked. And none of us were excel juniors. We were advanced users!

So being conscientious students with lab-work to do, eventually we just had to let the incident go and get on with the assigned experiment.

We had to acknowledge, that, this time –this one time– the computer had made the mistake, not the human.  We are told that this is impossible. We are taught that this is impossible.

It was bizarre. Computers do not, cannot, make mistakes. One plus one always equals two. But this time, there were two other witnesses! One witness was my friend and lab partner. The other witness was the thermodynamics professor himself! We were all equally baffled and perplexed. What had happened?

What’s interesting is the professor’s response. Do you know how the professor dealt with this strange situation? He said that “God must have sneezed”. And he said it more than once, half-jokingly, as you might in front of two young science students upon having no decent explanation.

The truth is, we never found the real reason for this odd behaviour. But I personally prefer to deal with that incident by saying that the CPU was struck by a gamma ray emitted by an external source —perhaps a stray cosmic ray originating from outer space— leading to the unexpected result. One rogue particle leading to one rogue answer. Who knows?

I can’t recall whether we just deleted the equation in the cell or started a new worksheet template or had to restart the computer in order to reset its memory. I don’t actually remember the particulars about the numbers or formulae either. I don’t even recall anything to do with the experiment. But what is important, what I did I remember is that computers can and do make mistakes.

Newer electronics never seem to be as reliable as older electronics, do they? That’s why it was so odd when this particular old 80286 computer made that one mistake. Because those intel processors were built like tanks in comparison to the fine nano-scale architecture that we see today. A sturdily built abacus would be even less likely to break, would it not?

Yet as electronic devices age, they are more likely to make mistakes. Why is that?

It’s because the smaller the internal parts get, the less reliable they become. Smaller things are more susceptible to the effects of galvanic corrosion. And other things occur at microscopic scales; things like tin whiskers growing along the soldered joints in lead-free electronic tracks for instance.

Or the sudden jolt of your smart phone being dropped on the floor and so on (I have dropped my phone several times). It’s almost a gamble, a veritable lottery these days, to see whether a modern smart phone will survive a fall or not…

But what can happen to some modern electronic devices is that they just stop working altogether. Rather than produce something unexpected, the normal outcome is that they simply don’t produce any result at all — then you’re less likely to notice the mathematical mistakes they might otherwise have made. Perhaps they break because more than one component is damaged simultaneously and the integrated circuits can no longer compute anything correctly and therefore no result appears. Either that or the code breaks before the hardware.

So my takeaway point is that computers can and do make mistakes. That and you might say that modern electronic systems have too much complexity to notice some of these mistakes.