Tuesday, December 19, 2017

Basics of Troubleshooting

Troubleshooting or dépanneuring is a form of problem solving, often applied to repair failed products or processes on a machine or a system. It is a logical, systematic search for the source of a problem in order to solve it, and make the product or process operational again. Troubleshooting is needed to identify the symptoms. Determining the most likely cause is a process of elimination—eliminating potential causes of a problem. Finally, troubleshooting requires confirmation that the solution restores the product or process to its working state.

In general, troubleshooting is the identification or diagnosis of "trouble" in the management flow of a corporation or a system caused by a failure of some kind. The problem is initially described as symptoms of malfunction, and troubleshooting is the process of determining and remedying the causes of these symptoms.

A system can be described in terms of its expected, desired or intended behavior (usually, for artificial systems, its purpose). Events or inputs to the system are expected to generate specific results or outputs. (For example, selecting the "print" option from various computer applications is intended to result in a hardcopy emerging from some specific device). Any unexpected or undesirable behavior is a symptom. Troubleshooting is the process of isolating the specific cause or causes of the symptom. Frequently the symptom is a failure of the product or process to produce any results. (Nothing was printed, for example). Corrective action can then be taken to prevent further failures of a similar kind.

The methods of forensic engineering are especially useful in tracing problems in products or processes, and a wide range of analytical techniques are available to determine the cause or causes of specific failures. Corrective action can then be taken to prevent further failure of a similar kind. Preventative action is possible using failure mode and effects (FMEA) and fault tree analysis (FTA) before full-scale production, and these methods can also be used for failure analysis.

Aspects of Troubleshooting

Usually troubleshooting is applied to something that has suddenly stopped working, since its previously working state forms the expectations about its continued behavior. So the initial focus is often on recent changes to the system or to the environment in which it exists. (For example, a printer that "was working when it was plugged in over there"). However, there is a well known principle that correlation does not imply causality. (For example, the failure of a device shortly after it has been plugged into a different outlet doesn't necessarily mean that the events were related. The failure could have been a matter of coincidence.) Therefore, troubleshooting demands critical thinking rather than magical thinking.

It is useful to consider the common experiences we have with light bulbs. Light bulbs "burn out" more or less at random; eventually the repeated heating and cooling of its filament, and fluctuations in the power supplied to it cause the filament to crack or vaporize. The same principle applies to most other electronic devices and similar principles apply to mechanical devices. Some failures are part of the normal wear-and-tear of components in a system.

A basic principle in troubleshooting is to start from the simplest and most probable possible problems first. This is illustrated by the old saying "When you see hoof prints, look for horses, not zebras", or to use another maxim, use the KISS principle. This principle results in the common complaint about help desks or manuals, that they sometimes first ask: "Is it plugged in and does that receptacle have power?", but this should not be taken as an affront, rather it should serve as a reminder or conditioning to always check the simple things first before calling for help.

A troubleshooter could check each component in a system one by one, substituting known good components for each potentially suspect one. However, this process of "serial substitution" can be considered degenerate when components are substituted without regard to a hypothesis concerning how their failure could result in the symptoms being diagnosed.

Simple and intermediate systems are characterized by lists or trees of dependencies among their components or subsystems. More complex systems contain cyclical dependencies or interactions (feedback loops). Such systems are less amenable to "bisection" troubleshooting techniques.

It also helps to start from a known good state, the best example being a computer reboot. A cognitive walkthrough is also a good thing to try. Comprehensive documentation produced by proficient technical writers is very helpful, especially if it provides a theory of operation for the subject device or system.

A common cause of problems is bad design, for example bad human factors design, where a device could be inserted backward or upside down due to the lack of an appropriate forcing function (behavior-shaping constraint), or a lack of error-tolerant design. This is especially bad if accompanied by habituation, where the user just doesn't notice the incorrect usage, for instance if two parts have different functions but share a common case so that it is not apparent on a casual inspection which part is being used.

Troubleshooting can also take the form of a systematic checklist, troubleshooting procedure, flowchart or table that is made before a problem occurs. Developing troubleshooting procedures in advance allows sufficient thought about the steps to take in troubleshooting and organizing the troubleshooting into the most efficient troubleshooting process. Troubleshooting tables can be computerized to make them more efficient for users.

Some computerized troubleshooting services (such as Primefax, later renamed Maxserve), immediately show the top 10 solutions with the highest probability of fixing the underlying problem. The technician can either answer additional questions to advance through the troubleshooting procedure, each step narrowing the list of solutions, or immediately implement the solution he feels will fix the problem. These services give a rebate if the technician takes an additional step after the problem is solved: report back the solution that actually fixed the problem. The computer uses these reports to update its estimates of which solutions have the highest probability of fixing that particular set of symptoms.

No comments:

Post a Comment