Search E-Letter Sign Uo Search Site
Home
Products Customer Support Investor Relations About Us News & Events Careers Learning Center Contact Us Home


 

RELATED LINKS

Get Pricing

Recommend
this Page


Implementing DMHA via CompactPCI packet-switched backplanes

CompactPCI/PSB has the inherent ability to construct high-availability systems

** Reproduced from Integrated Communication Design Online November 1, 2001

By Chris Roos and Hank Heneghan

To reduce overall cost, developers often supplant fully redundant but expensive N+2 critical-resource systems by building high-availability (HA) systems using 2N redundancy. The 99.999% availability target can be met by following the PCI Industrial Computer Manufacturers Group (PICMG) 2.1 specification, using CompactPCI "hot swappable" plug-in peripheral boards. But development and integration time for such centralized systems are needlessly elongated, and some critical aspects are inherently limited or difficult to manage.

There is now a significantly faster way to build more easily managed N+1 HA systems that still leverage the popular CompactPCI form factor and avoid the cost of an 2N system. The new embedded systems development model uses distributed managed high-availability (DMHA) architecture and the CompactPCI/packet-switching backplane transport method. Based on the emerging PICMG 2.16 specification, it moves system traffic from CompactPCI's shared- bus architecture to a packet-based, fault-tolerant, embedded switched 10/10/ 1,000- Mbit/sec Ethernet network.

Fault-Management Process

Although "hybrid" systems can be built using both the traditional PICMG 2.1 and new PICMG 2.16 specs, a pure DMHA approach has advantages that become clear by comparing fault management. Both achieve 99.999% availability, but system integration time is greatly reduced with DMHA, while several critical aspects of N+1 system management are simplified or eliminated. With a redundant star architecture, several factors critical to overall system availability are immediately addressed, with the rest readily dealt with using standard networking protocols already deployed in enterprise LANs.

While it varies from application to application, there is a basic framework for all systems. It involves the following six processes essential to ensuring N+1 fault management.

Detecting Faults

The ability to detect a fault and isolate it to a specific component is essential. In a classic N+1 system, there are no redundant resources to fail-over to, as in the 2N redundant model, so quickly identifying a failure and initialing the recovery process is crucial. In a centrally man aged bus-based PICMG 2.1 HA system, the CompactPCI bus and the radial HEALTHY# signal are the means of detecting a fault. A PICMG 2.1-compliant board may assert HEALTHY# when its backend power is good but its backend processor is malfunctioning, thus requiring a CompactPCI bus diagnostic to identify the faulty unit.

Fault-tolerant configuration

With PICMG 2.16, DMHA based N+1 systems, the ability to detect a fault with fine granularity is inherent. Redundant star topology provides a direct link between each fabric and node in the system. Either or both of the dedicated links to each fabric is available for each node to use for reporting faults to any/all system hosts in (or connected to) the embedded-system area network (ESAN). Not having a shared bus to negotiate allows quicker identification of a failure and recovery process initiation. The redundant fabrics and links ensure simultaneous detection and communicating a node fault, even during a fabric fault, which is impossible in a bus-based architecture.

Locating Failed Devices

Locating the failed device while the remainder of the system remains fully available is essential to the overall up time rating. In the PICMG 2.1 HA system model, a hot-swap controller (HSC) -which may or may not reside on the system host-monitors the radial HEALTHY# line while a system host runs HA system software to monitor the usability or "health" of various resources from a software perspective. Because the bus is a shared resource, communication between devices other than the HSC and system host must cease for the system host to determine which device is faulty.

Distributed management high-availability alternatives in a PICMG 2.16 chassis

Using the DMHA model where the fabric-to-node connections are point-to- point, the link reporting a fault is from the node that is faulty. Because the point-to-point link is not a shared resource for other connections in the ESAN, the rest of the system is fully available, enjoying normal operating bandwidth when a node failure occurs. This maximizes overall uptime rating.

Isolating Faulty Devices

Isolating the faulty device is critical, since the user must be able to remove the failed unit without disturbing the remainder of the system. In PICMG 2.1, such isolation is typically accomplished by the system host quiescing the faulty board's device driver and the HSC either resetting the faulty card or de-asserting BD SEL# to the slot. The net result is a high-impedance, isolated bus interface. Failure to fully quiesce the resources of a board before extraction can result in crashing of the system host, bringing the entire system down.

With DMHA, isolating the faulty de vice is guaranteed by the physical interface used. All switched LANs allow the removal and addition of links at will, both for new and replacement nodes, without disturbing the remainder of the system. In the case of PICMG 2.16 Node boards, no system-host drivers need quiescing, and there are no system-critical shared resources requiring management. In the case of PICMG 2.16 fabrics, the protocols running on the ESAN (such as TCP/IP) provide for reliable data transfer.

A bus-based N+1 system has interdependence between components that must be dealt with before a failed unit can be removed from the system. Again, there are typically software drivers on a system host that need to be quiesced before a faulty card can be extracted. Peer-to-peer communication across the CompactPCI bus, such as intelligent I/O (I2O), further complicates that issue because a central "host" is not always the agent that must ensure suspension.

With DMHA there are no interdependencies between components that must be dealt with before a failed unit can be removed from the system. Again, since there are no software drivers on a system host that need to be quiesced, a faulty card can be extracted without disturbance. Peer-to-peer communication across the fabric continues while the faulty card is "hot-replaced." Designers can also choose to have the fabric re-route network traffic to alternate node resources while the repairs or upgrades are being performed.

Forced Error States

In many bus-based N+1 systems, a failure on one unit can force others into error states, each of which must be cleared before normal operation can proceed. Before a faulty card can be extracted, remnants of its system-host drivers that may have failed to quiesce have to be removed (the faulty unit not allowing a clean software disconnect) or the system may be at risk for further errors. Again, peer-to-peer communication across the shared CompactPCI bus, such as I2O supports, further complicates this issue.

With DMHA, a failure on one unit can be made transparent to other units by selecting the proper standard protocols. This means that no error states propagate through the ESAN and normal operation returns as soon as a re placement to the faulty card comes online. Unlike the bus-based approach, a faulty card can be extracted without leaving remnants of its system-host drivers, since it has none. Drivers that fail to quiesce also cease to be a problem, so the system is not at risk for further errors because errant packets are rejected by the 802.3 media access control (MAC) layer protocols. Regardless of the size, complexity or tightness of coupling, peer-to-peer or client-server communication across the ESAN will continue in the event of a single failure. With pro per system architecture, hot swapping a field-replaceable unit (FRU) can be a routine task, regardless of the proximal cause for the change out or the FRU being replaced.

Error Reporting

Error reporting with detail down to the FRU level is required in the PICMG 2.1 HA system model so an operator knows which individual unit has failed and needs replacement. Removing the wrong unit from a live bus-based system, particularly a unit that was not properly quiesced, can have severe and often catastrophic system impact, including crashed system host drivers. Even re moving a failed unit carries certain risks. There may be local resources on the board being removed that, although faulty, may need proper sequencing before removal without causing damage. The likelihood-of-occurrence bit in the HS_CSR is the mechanism for re porting to the operator which unit (board) is to be removed/ replaced. This is done by illuminating the blue LED. While turning on the LED is a simple action, developing the tightly coupled system software that understands the status of each of its boards is a significant task.

Error reporting with granularity down to the FRU level is inherent in the DMHA architecture. Because removal of a node board does not perturb any other connectivity in the ES AN, the board itself can control its own operator-notification LED illumination indicating removal is permitted. The only node resources that must be quiesced are local resources to the board that could be damaged in the event of an untimely removal (i.e., local databases or operating systems and other re sources with might be irreparably damaged by non-orderly power removal). Since the board itself is responsible for managing its own re sources, only local control is needed for proper operator notification. Further, removing the wrong unit from a live, redundant star fabric-based system has no system level impact. If an operator were to "surprise extract" a unit that was not properly quiesced, only that local re source would be affected. System availability would not be compromised, although the board itself may not be reuseable.

In conclusion, the CompactPCI/PSB specification defines a redundant star topology based on CompactPCI packet interconnect of "node" and "fabric" boards via dual Ethernet links. It supports a full 21-plus slot chassis as a single ESAN and is extendable as a "virtual backplane" to an unlimited number of slots. With no single point of failure, it has the inherent ability to construct HA systems that can do more, are more readily managed, and can be delivered to market much faster than bus-based systems.

Chris Roos is a hardware engineer and Hank Heneghan is director of field services at Performance Technologies (Rochester, NY).

 


Return to Top

 

© 2008 Performance Technologies, Inc. All Rights Reserved.