Monday, July 4, 2011

Computers Make Mistakes

Error Free Computers
Are A Myth

By the blog author

You have probably heard some nonsense from computer nerds that "computers don’t make mistakes." Yes, they do. There are two central reasons that computers make mistakes: 1) the method of manufacturing them and 2) programming errors.

                              Manufacturing Errors

An integrated circuit represents layers of wafers which have circuits built into them, usually through lithography. The entire process occurs in a "clean room," as the silicon wafers are quite subject to degraded performance when exposed to impurities of any sort.

On the other hand, specific impurities are added because of the electrical properties induced into the
material, creating doped transistors. The modern technique for this is called ion implantation. The doping is followed by setting and sealing through heat ("furnace anneal") or, for advanced integrated circuits, rapid thermal anneal (RTA). Transistors and capacitors are built in to particular layers. Then the layers are fully interconnected. Then they are tested for operation, called "the wafer test." Then they are packaged and tested again.

Detail available at http://en.wikipedia.org/wiki/Integrated_circuit_fabrication





This is the same programmer from Poland who won a computer chess award as discussed in yesterday’s blog. Note the room fans used to force air past the processors and keep them cool!
I’m going into these details to make a vital point. Integrate circuits are delicate, they won’t function properly if impurities are introduced during manufacture, they are baked like cookies and tested after transistor layering and capacitor layering are added, then tested again after connections are added, then tested again once packaged.

How many integrated circuits "die" in the clean room because they fail a test during the manufacturing process? "Thirty to forty percent." The other sixty to seventy percent pass and are put into use. Let’s be a little more precise and logical in stating this. Sixty to seventy percent of clean room output fails to fail the tests during manufacture. That means that sixty to seventy percent are approved for use with no known built-in shortcomings. The computer nerds are wrong when they say that computers don’t make mistakes, obviously. Nerds may counter that it is only a one in a million possibility that a mistake may be made. But computer programs use hundreds of thousands of lines of coding. Many iterations may be needed to get a result. So a computer does millions and millions of operations during regular processing. The potential for error is rampant. Usually, the computer will lock or get stuck in a loop or refuse to function if there is a serious error. This is called "crashing." Any operating system that can crash can also proceed with erroneous data or calculations and continue functioning, producing incorrect results undetectably. Crashing can be caused by corrupt data, manufacturing error, equipment operating outside its reliable temperature range, or programming errors (discussed below).

Undetectable errors may be most likely the result of some weakness or manufacturing problem in the annealing process, where the wafer is "baked like a cookie" as I described above. If you will, the sixty to seventy percent of integrated circuits that pass all the factory tests will eventually drop to nearly zero percent that continue to perform correctly outside the narrow temperature range for which they were designed. Specifically, if you heat any integrated circuit enough, it will certainly malfunction.

The layers of an integrated circuit are, themselves, getting thinner and thinner. This makes the device faster and able to store more data. Yet this miniaturization also makes it more susceptible to certain sorts of interference, perhaps surges in voltage or outside interference (such as static electricity).

Going back half a century, the Department of Defense used circuit boards mounted with individual transistors. These circuit boards were stacked in expensive, water-cooled cabinets. Errors still occurred. Master computer technicians used to carry rubber mallets which they would use to bang on the cabinet to insure the trays of circuit boards were connected, often fixing an operative machine by brute force. Now the cabinet is all on a chip and the chip is tested after its layers are connected together, but such a test is not an absolute guarantee of error-free functioning.

Further, there is an inherent potential for error in solid-state computer devices such as transistors and integrated circuits. Such devices use a semiconductor such as silicon dioxide, properly "doped" with an intentional impurity for specific electrical properties. Current is passed through junctions of material, usually P-N-P junctions or N-P-N junctions. Solid state devices are much smaller and use much less power than the original computing devices, vacuum tubes. But vacuum tubes had one simple superiority: either the current was passed or it was blocked. Completely. Properly manufactured tubes don’t "leak" electrons when set to 0 even when the vacuum tube itself is powered up. P-N-P junctions and N-P-N junctions are inherently capable of leaks, again, especially, outside the designed operating temperatures. These leaks may be associated with the problem that power surges or AC fluctuation can case computational errors or loss of data.

= = = = = = = = = = = = = = = = = = = = = = = = =

PNP and NPN Transistor Vulnerabilities (from Wikipedia)

Vulnerabilities
Exposure of the transistor to ionizing radiation causes radiation damage. Radiation causes a buildup of 'defects' in the base region that act as recombination centers. The resulting reduction in minority carrier lifetime causes gradual loss of gain of the transistor.

Power BJTs are subject to a failure mode called secondary breakdown, in which excessive current and normal imperfections in the silicon die cause portions of the silicon inside the device to become disproportionately hotter than the others. The doped silicon has a negative temperature coefficient, meaning that it conducts more current at higher temperatures. Thus, the hottest part of the die conducts the most current, causing its conductivity to increase, which then causes it to become progressively hotter again, until the device fails internally. The thermal runaway process associated with secondary breakdown, once triggered, occurs almost instantly and may catastrophically damage the transistor package.

If the emitter-base junction is reverse biased into avalance or Zener mode and current flows for a short period of time, the current gain of the BJT will be permanently degraded.

http://en.wikipedia.org/wiki/PNP_transistor#PNP

= = = = = = = = = = = = = = = = = = = = = = = = =

                                Programming Errors

There are very basic errors that are easy to make in programming. A program is a collection of "routines" and a routine contains processes that are performed over and over again ("loops") as well as outside processes brought in from time to time to perform certain functions ("subroutines"). It is the programmer’s responsibility to assure that no data dependencies exist between the subroutine and the appropriate loop(s). Local variables in a routine should be labeled automatic instead static. And allocating local variables to a stack can result in stack overflow.

It’s not so simple. See http://download.oracle.com/docs/cd/E19205-01/819-5262/aeujh/index.html .

Do programmers check all these things meticulously? No, they write a program and then use test data. If the test data produce the right answer, they assume the program is error free. But such programming may include needless additional steps, taking longer to compute and possibly introducing errors bringing the data from one operation to another. A "fix" for this is to use a dense and compact programming language, for which complex functions are reduced to simple coding. APL is such a computer programming language.

The difficulty here is that the language itself is so dense that it is difficult to proof it for errors.

Computers are data preparation and comparison devices using base 2. The initial purpose of computers was to "crack" the cryptographic codes of enemy messages during World War II. This was a brilliant, perhaps war-winning, accomplishment for British intelligence, a special use computer ("The Bronze Goddess") which cracked messages scrambled by the German Enigma cipher machine). Later in America, special use computers were created to develop the "explosive lens" needed to implode radioactive material together so densely that an atomic explosion occurred. The individual central to that accomplishment was John von Neumann, who, shortly after the war, changed and developed the computer by introducing the "central processing unit" or CPU, an architecture that remains the standard to this day.

The establishment of central processing units opened the door for standard instructions, a process greatly accelerated by the development of the FORTRAN computer language in 1956.

Computers remain the most efficient way to crack coded messages and perform certain mathematical calculations. But they can be twisted into setting up and calculating a spreadsheet, performing word processing functions, and into database management. But it takes a lot of twisting to use computers in this inefficient way.

It takes hundreds of thousands of lines of computer programming for each such application. The Department of Defense figured out years ago that it is impossible to write massive coding without making mistakes.  It is unwise to state that a large program has been tested and reviewed to the point of certainty that it is perfect.

In yesterday’s blog post, a winning computer chess tournament program was featured. The trophy was taken back because the contestant used programming from someone else, which he had tweaked further.
Let’s get beyond the judgmental question of whether the contest judges should have taken the trophy away. There is another and more important consideration here. The programmer, rather than write a new subprogram for a certain set of circumstance, cut and pasted existing programming. This saves time and is probably no more error prone that doing it from scratch himself.

We can say that most programs are like this, a wall full of paste-it notes drafted by various programmers other than the anthologizer who assembles a program from many, usually remote and unrelated, programmers.

There’s nothing inherently wrong with this. The problem is simpler and deeper. Programming lacks what a biologist would call bench protocols or what a chemist would call lab protocols or an engineer would call protocols. These are standard procedures everyone adopts to use certain equipment safely. In chemistry, it also involves using equipment accurately and correctly (infamously, including sparklingly clean and uncontaminated glassware for titrations and reactions).

An analog to software engineering would be set pieces of mathematical functions, universally available and uncorrupt, which can be used in programming without worrying about inaccuracies or corruption. These perfect building blocks do not appear to exist yet, though programming is a skill that has been around for more than a half-century.

When a software engineer talks about protocols, it means this:

"In computing, a protocol is a set of rules which is used by computers to communicate with each other across a network. A protocol is a convention or standard that controls or enables the connection, communication, and data transfer between computing endpoints. In its simplest form, a protocol can be defined as the rules governing the syntax, semantics, and synchronization of communication. Protocols may be implemented by hardware, software, or a combination of the two. At the lowest level, a protocol defines the behavior of a hardware connection."

Read more at: http://wiki.answers.com/Q/What_are_software_protocols#ixzz1RCOSaRds
 
That’s syntax, that’s continuity, but it isn’t what a scientist would call a protocol.

Add to this the "dense" computer language used to save computer time (such as APL), the sheer length and complexity of the coding instructions, and programs written to resist being reconstructed by competitors, and there is an overwhelming tendency for errors. This is why "Beta" versions of new programs are given out to computer nerds and hackers to see whether they can find the shortcomings.

My point is that they seldom find all the bugs. The programs are too large to be error free, and there is no perfect bug-finding set of data. And there are things that computers simply cannot do. They can’t find the shortest route to multiple destinations without an exhaustive search, although bees can do this in nature. They can’t truly generate an absolutely random series of numbers, although dice can be thrown that perform that function. Further, and dangerously, computers and their programmers don’t fully appreciate the difference between an iteration and genuine outside scientific lab results. There is a dangerous tendency to equate these two (which error was central to the financial crash of September, 2008).

Conclusion: Computers make errors. Programmers often make errors. Computer owners (insultingly dismissed as "users") are ignored when they suffer the consequences.

No comments:

Post a Comment