Wednesday, December 27, 2017

Byzantine Fault Tolerance (BFT) in Computers


In fault-tolerant computer systems, and in particular distributed computing systems, Byzantine fault tolerance (BFT) is the characteristic of a system that tolerates the class of failures known as the Byzantine Generals' Problem, which is a generalized version of the Two Generals' Problem – for which there is an unsolvability proof. The phrases interactive consistency or source congruency have been used to refer to Byzantine fault tolerance, particularly among the members of some early implementation teams. It is also referred to as error avalanche, Byzantine agreement problem, Byzantine generals problem and Byzantine failure.

Byzantine failures are considered the most general and most difficult class of failures among the failure modes. The so-called fail-stop failure mode occupies the simplest end of the spectrum. Whereas fail-stop failure model simply means that the only way to fail is a node crash, detected by other nodes, Byzantine failures imply no restrictions, which means that the failed node can generate arbitrary data, pretending to be a correct one. Thus, Byzantine failures can confuse failure detection systems, which makes fault tolerance difficult.

Definition and Background

A Byzantine fault is any fault presenting different symptoms to different observers. A Byzantine failure is the loss of a system service due to a Byzantine fault in systems that require consensus.

The objective of Byzantine fault tolerance is to be able to defend against Byzantine failures, in which components of a system fail with symptoms that prevent some components of the system from reaching agreement among themselves, where such agreement is needed for the correct operation of the system. Correctly functioning components of a Byzantine fault tolerant system will be able to provide the system's service, assuming there are not too many faulty components.

The terms fault and failure are used here according to the standard definitions originally created by a joint committee on "Fundamental Concepts and Terminology" formed by the IEEE Computer Society's Technical Committee on Dependable Computing and Fault-Tolerance and IFIP Working Group 10.4 on Dependable Computing and Fault Tolerance. A version of these definitions is also described in the Dependability Wikipedia page.

Byzantine Generals’ Problem

Byzantine refers to the Byzantine Generals' Problem, an agreement problem (described by Leslie Lamport, Robert Shostak and Marshall Pease in their 1982 paper, "The Byzantine Generals Problem") in which a group of generals, each commanding a portion of the Byzantine army, encircle a city. These generals wish to formulate a plan for attacking the city. In its simplest form, the generals must only decide whether to attack or retreat. Some generals may prefer to attack, while others prefer to retreat. The important thing is that every general agrees on a common decision, for a halfhearted attack by a few generals would become a rout and be worse than a coordinated attack or a coordinated retreat.

The problem is complicated by the presence of traitorous generals who may not only cast a vote for a suboptimal strategy, they may do so selectively. For instance, if nine generals are voting, four of whom support attacking while four others are in favor of retreat, the ninth general may send a vote of retreat to those generals in favor of retreat, and a vote of attack to the rest. Those who received a retreat vote from the ninth general will retreat, while the rest will attack (which may not go well for the attackers). The problem is complicated further by the generals being physically separated and having to send their votes via messengers who may fail to deliver votes or may forge false votes.

Byzantine fault tolerance can be achieved if the loyal (non-faulty) generals have a majority agreement on their strategy. Note that there can be a default vote value given to missing messages. For example, missing messages can be given the value <Null>. Further, if the agreement is that the <Null> votes are in the majority, a pre-assigned default strategy can be used (e.g., retreat).

The typical mapping of this story onto computer systems is that the computers are the generals and their digital communication system links are the messengers. Although the problem is formulated in the analogy as a decision-making and security problem, in electronics, it cannot be solved simply by cryptographic digital signatures, because failures like incorrect voltages can simply propagate through the encryption process. Thus, a component may appear functioning to one component and faulty to another, which prevents forming a consensus if the component is faulty or not.

Examples of Byzantine Failures

Several examples of Byzantine failures that have occurred are given in two equivalent journal papers. These and other examples are described on the NASA DASHlink web pages. These web pages also describe some phenomenology that can cause Byzantine faults.

Byzantine errors were observed infrequently and at irregular points during endurance testing for the then-newly constructed Virginia class submarines, at least through 2005 (when the issues were publicly reported).

A similar problem faces honeybee swarms. They have to find a new home, and the many scouts and wider participants have to reach consensus about which of perhaps several candidate homes to fly to. And then they all have to fly there, with their queen. The bees' approach works reliably, but when researchers offer two hives, equally attractive by all the criteria bees apply, catastrophe ensues, the swarm breaks up, and all the bees die.

Byzantine Fault Tolerance in Practice

One example of BFT in use is bitcoin, a peer-to-peer digital currency system. The bitcoin network works in parallel to generate a chain of Hashcash style proof-of-work. The proof-of-work chain is the key to overcome Byzantine failures and to reach a coherent global view of the system state.

Some aircraft systems, such as the Boeing 777 Aircraft Information Management System (via its ARINC 659 SAFEbus network), the Boeing 777 flight control system, and the Boeing 787 flight control systems, use Byzantine fault tolerance. Because these are real-time systems, their Byzantine fault tolerance solutions must have very low latency. For example, SAFEbus can achieve Byzantine fault tolerance with on the order of a microsecond of added latency.

Some spacecraft such as the SpaceX Dragon flight system consider Byzantine fault tolerance in their design.

Byzantine fault tolerance mechanisms use components that repeat an incoming message (or just its signature) to other recipients of that incoming message. All these mechanisms make the assumption that the act of repeating a message blocks the propagation of Byzantine symptoms. For systems that have a high degree of safety or security criticality, these assumptions must be proven to be true to an acceptable level of fault coverage. When providing proof through testing, one difficulty is creating a sufficiently wide range of signals with Byzantine symptoms. Such testing likely will require specialized fault injectors.

                          https://en.wikipedia.org/wiki/Byzantine_fault_tolerance

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

Footnote by the Blog Author

Wirlds’ Hashgraph program has a new alternative solution to the Byzantine problem.  This software is described in detail in the December 23, 2017 blog entry.

No comments:

Post a Comment