Past ProjectsPast_Projects.htmlPast_Projects.htmlshapeimage_2_link_0
About UsAbout_Us.htmlAbout_Us.htmlshapeimage_3_link_0
Contact UsContact_Us.htmlContact_Us.htmlshapeimage_4_link_0
ConsultingConsulting.htmlConsulting.htmlshapeimage_5_link_0
ResourcesResources.htmlhttp://www.embedded.com.au/pages/Resources.htmlshapeimage_6_link_0

Fault Tolerance and Triple Modular Redundancy (TMR)


John Catsoulis


Industrial electronics and avionics/flight computers are increasingly being implemented as softcore processors within FPGAs. Using an FPGA as the basis for control has a number of distinct advantages. Developing safety-critical certified hardware is a very expensive exercise. By using a single FPGA-based hardware platform, the NRE (development costs) may be shared across many products and versions. It is only the configuration for the FPGA and the sensors and external interfaces that are likely to be application specific. An FPGA allows the application to be redesigned many times without re-qualifying the hardware. Further, using an FPGA as the basis for the industrial computer allows the hardware to be pre-qualified prior to the overall design and functionality being finalised. The design is also changeable, allowing for design errors to be corrected during flight (in space), and this also can reduce qualification time. It is also possible to implement new functionality even once the electronics are in use, or allow better use of the existing resources by reconfiguring the computer for different uses within the application.

Often, electronics must operate under extreme and harsh conditions. This is especially true with industrial electronics, and spacecraft and aircraft avionics. Without due care to the design, such extremes can prove lethal. Further, the system must possess a design fidelity and reliability beyond that of conventional electronics systems. It must perform its task consistently and dependably as any failure to do so can have catastrophic results. The task of the system designer is to create a system sufficiently robust such that hazards are mitigated, or at the very least, reduced to within acceptable limits.


Fault tolerance is important in the control systems for mission-critical and safety-critical systems such as spacecraft avionics, fly-by-wire aircraft avionics, drive-by-wire automotive electronics, railway signalling systems, medical and life-support systems, and nuclear and chemical processing.


All fault-tolerant systems employ some form of redundancy. The redundancy may be time-based, physical or functional. A system employing time-based redundancy will mitigate the fault condition by either performing a cold or warm restart or by reattempting the errant operation. A physically redundant system will duplicate some or all of the hardware. The duplicate hardware may function as a backup system, or act as a fault-detection and/or correction system operating in parallel with the primary control system. A functionally redundant system employs some form of fault tolerance or fault mitigation as part of the design. Error correction codes are an example of a functionally redundant system.


Time-based redundancy is often implemented as a self-testing system. In such an implementation, the system periodically performs consistency checks by evaluating the functionality of hardware, or determining that the system is in a known or particular state at or within a given time. A simple example of such a system is the watchdog timer.


Time-based redundancy is limited due to a lack of real-time response, often important in mission-critical applications. Further, it is only capable of dealing with soft errors pertaining to state information. Physical damage is unlikely to be solved by a time-based redundant system. Duplicate hardware is able to determine when a fault has occurred, but is unable to identify in which unit the fault lies. It is therefore unable to mask the fault within the operational system. Typically, duplicate hardware is used in conjunction with other techniques.


Triple Modular Redundancy, or TMR, is based on the assumption that the probability of an SEU-generated fault is finite, small and localised. TMR triplicates the functional systems of the electronics; the output of each is passed through a voter circuit. TMR is commonly used in spacecraft flight avionics.

In the case of an SEU-generated fault, the affected subsystem will produce an output that is contrary to that of the other two subsystems. In comparing the output of the subsystems, the voter circuit disregards the erroneous output (Figure 3-9). In a TMR implementation, each subsystem must produce an identical output when operating nominally. Thus, each subsystem must be functionally and algorithmically identical and in an identical state. Further, the inputs to each subsystem within a TMR implementation must be the same and synchronised.

On detection of an error, the faulty subsystem may be restarted and resynchronised with the other two subsystems. How a TMR system responds to an error varies from implementation to implementation. It is important that the SEU-induced fault is corrected, otherwise faults will accumulate with time and render the system non functional.


TMR is based on the assumption that the probability of a second error occurring before this restart is very small. For systems where this assumption is invalid, TMR is sometimes extended to Quad Modular Redundancy (QMR), where four subsystems drive a voter circuit. In this instance, when a fault is detected in one of the four units, there is a still a TMR fault-tolerant system in place. Further, should two errors occur they are extremely unlikely to produce the same error condition. In this instance, the output of the two faulty subsystems will be ignored and the voter will pass through the output of the two subsystems in agreement.


The Space Shuttle uses a variation of QMR for its flight avionics. The primary system is a QMR implementation, with a fifth backup system in place. In the unlikely event that three of the four subsystems in the QMR group disagree, all four are shut down and control is transferred to the fifth system.


TMR is conventionally implemented as three separate and functionally identical avionics computers feeding a voting circuit. This configuration I term coarse-grained TMR.

Recently, a number of spacecraft flight computers have been implemented using three processors separated from a common system bus (and the memory and I/O) by a voting circuit. This is a variation of a shared-memory, MIMD parallel computing architecture adapted into a fault-tolerant TMR scheme. I term this configuration medium-grained TMR.

Usually, both coarse-grained and medium-grained TMR place considerable overhead on the system designer to implement. However, Embedded Ltd has significant experience in implementing TMR-based hardware solutions. Further, an additional form of fault tolerance, known as “software synchronisation” is also easy to implement in addition to our fault-tolerant hardware, and may be utilised in addition to TMR techniques for a further layer of error immunity.


Absent from the literature is what I term fine-grained TMR, whereby TMR is implemented within the processor itself. From the system designer’s viewpoint, they are working with a conventional microprocessor with a conventional bus interface. Internal to the processor, functional units such as the ALU, instruction fetch and decode, and so on are triplicated. Internal voter circuits isolate localised errors from other functional units within the processor. The development of such a fine-grained TMR processor, and its evaluation against other forms of TMR appears to be an open avenue of research.

Books

Designing Embedded Hardware steers a course between those books dedicated to writing code for particular microprocessors, and those that stress the philosophy of embedded system design without providing any practical information. Loaded with real examples, this book also provides a roadmap to the pitfalls and traps to avoid. If you want to build your own embedded system, or tweak an existing one, this invaluable book gives you the understanding and practical skills you need.