Implementing DMHA via CompactPCI packet-switched backplanes
CompactPCI/PSB has the inherent ability to construct high-availability
systems
** Reproduced from Integrated Communication Design Online November
1, 2001
By Chris Roos and Hank Heneghan
To reduce overall cost, developers often supplant fully redundant but expensive
N+2 critical-resource systems by building high-availability (HA) systems using
2N redundancy. The 99.999% availability target can be met by following the PCI
Industrial Computer Manufacturers Group (PICMG) 2.1 specification, using CompactPCI
"hot swappable" plug-in peripheral boards. But development and integration
time for such centralized systems are needlessly elongated, and some critical
aspects are inherently limited or difficult to manage.
There is now a significantly faster way to build more easily managed N+1 HA
systems that still leverage the popular CompactPCI form factor and avoid the
cost of an 2N system. The new embedded systems development model uses distributed
managed high-availability (DMHA) architecture and the CompactPCI/packet-switching
backplane transport method. Based on the emerging PICMG 2.16 specification,
it moves system traffic from CompactPCI's shared- bus architecture to a packet-based,
fault-tolerant, embedded switched 10/10/ 1,000- Mbit/sec Ethernet network.
Fault-Management Process
Although "hybrid" systems can be built using both the traditional
PICMG 2.1 and new PICMG 2.16 specs, a pure DMHA approach has advantages that
become clear by comparing fault management. Both achieve 99.999% availability,
but system integration time is greatly reduced with DMHA, while several critical
aspects of N+1 system management are simplified or eliminated. With a redundant
star architecture, several factors critical to overall system availability are
immediately addressed, with the rest readily dealt with using standard networking
protocols already deployed in enterprise LANs.
While it varies from application to application, there is a basic framework
for all systems. It involves the following six processes essential to ensuring
N+1 fault management.
Detecting Faults
The ability to detect a fault and isolate it to a specific component is essential.
In a classic N+1 system, there are no redundant resources to fail-over to, as
in the 2N redundant model, so quickly identifying a failure and initialing the
recovery process is crucial. In a centrally man aged bus-based PICMG 2.1 HA
system, the CompactPCI bus and the radial HEALTHY# signal are the means of detecting
a fault. A PICMG 2.1-compliant board may assert HEALTHY# when its backend power
is good but its backend processor is malfunctioning, thus requiring a CompactPCI
bus diagnostic to identify the faulty unit.

Fault-tolerant configuration
With PICMG 2.16, DMHA based N+1 systems, the ability to detect a fault with
fine granularity is inherent. Redundant star topology provides a direct link
between each fabric and node in the system. Either or both of the dedicated
links to each fabric is available for each node to use for reporting faults
to any/all system hosts in (or connected to) the embedded-system area network
(ESAN). Not having a shared bus to negotiate allows quicker identification of
a failure and recovery process initiation. The redundant fabrics and links ensure
simultaneous detection and communicating a node fault, even during a fabric
fault, which is impossible in a bus-based architecture.
Locating Failed Devices
Locating the failed device while the remainder of the system remains fully
available is essential to the overall up time rating. In the PICMG 2.1 HA system
model, a hot-swap controller (HSC) -which may or may not reside on the system
host-monitors the radial HEALTHY# line while a system host runs HA system software
to monitor the usability or "health" of various resources from a software
perspective. Because the bus is a shared resource, communication between devices
other than the HSC and system host must cease for the system host to determine
which device is faulty.

Distributed management high-availability alternatives in
a PICMG 2.16 chassis
Using the DMHA model where the fabric-to-node connections are point-to- point,
the link reporting a fault is from the node that is faulty. Because the point-to-point
link is not a shared resource for other connections in the ESAN, the rest of
the system is fully available, enjoying normal operating bandwidth when a node
failure occurs. This maximizes overall uptime rating.
Isolating Faulty Devices
Isolating the faulty device is critical, since the user must be able to remove
the failed unit without disturbing the remainder of the system. In PICMG 2.1,
such isolation is typically accomplished by the system host quiescing the faulty
board's device driver and the HSC either resetting the faulty card or de-asserting
BD SEL# to the slot. The net result is a high-impedance, isolated bus interface.
Failure to fully quiesce the resources of a board before extraction can result
in crashing of the system host, bringing the entire system down.
With DMHA, isolating the faulty de vice is guaranteed by the physical interface
used. All switched LANs allow the removal and addition of links at will, both
for new and replacement nodes, without disturbing the remainder of the system.
In the case of PICMG 2.16 Node boards, no system-host drivers need quiescing,
and there are no system-critical shared resources requiring management. In the
case of PICMG 2.16 fabrics, the protocols running on the ESAN (such as TCP/IP)
provide for reliable data transfer.
A bus-based N+1 system has interdependence between components that must be
dealt with before a failed unit can be removed from the system. Again, there
are typically software drivers on a system host that need to be quiesced before
a faulty card can be extracted. Peer-to-peer communication across the CompactPCI
bus, such as intelligent I/O (I2O), further complicates that issue
because a central "host" is not always the agent that must ensure
suspension.
With DMHA there are no interdependencies between components that must be dealt
with before a failed unit can be removed from the system. Again, since there
are no software drivers on a system host that need to be quiesced, a faulty
card can be extracted without disturbance. Peer-to-peer communication across
the fabric continues while the faulty card is "hot-replaced." Designers
can also choose to have the fabric re-route network traffic to alternate node
resources while the repairs or upgrades are being performed.
Forced Error States
In many bus-based N+1 systems, a failure on one unit can force others into
error states, each of which must be cleared before normal operation can proceed.
Before a faulty card can be extracted, remnants of its system-host drivers that
may have failed to quiesce have to be removed (the faulty unit not allowing
a clean software disconnect) or the system may be at risk for further errors.
Again, peer-to-peer communication across the shared CompactPCI bus, such as
I2O supports, further complicates this issue.
With DMHA, a failure on one unit can be made transparent to other units by
selecting the proper standard protocols. This means that no error states propagate
through the ESAN and normal operation returns as soon as a re placement to the
faulty card comes online. Unlike the bus-based approach, a faulty card can be
extracted without leaving remnants of its system-host drivers, since it has
none. Drivers that fail to quiesce also cease to be a problem, so the system
is not at risk for further errors because errant packets are rejected by the
802.3 media access control (MAC) layer protocols. Regardless of the size, complexity
or tightness of coupling, peer-to-peer or client-server communication across
the ESAN will continue in the event of a single failure. With pro per system
architecture, hot swapping a field-replaceable unit (FRU) can be a routine task,
regardless of the proximal cause for the change out or the FRU being replaced.
Error Reporting
Error reporting with detail down to the FRU level is required in the PICMG
2.1 HA system model so an operator knows which individual unit has failed and
needs replacement. Removing the wrong unit from a live bus-based system, particularly
a unit that was not properly quiesced, can have severe and often catastrophic
system impact, including crashed system host drivers. Even re moving a failed
unit carries certain risks. There may be local resources on the board being
removed that, although faulty, may need proper sequencing before removal without
causing damage. The likelihood-of-occurrence bit in the HS_CSR is the mechanism
for re porting to the operator which unit (board) is to be removed/ replaced.
This is done by illuminating the blue LED. While turning on the LED is a simple
action, developing the tightly coupled system software that understands the
status of each of its boards is a significant task.
Error reporting with granularity down to the FRU level is inherent in the DMHA
architecture. Because removal of a node board does not perturb any other connectivity
in the ES AN, the board itself can control its own operator-notification LED
illumination indicating removal is permitted. The only node resources that must
be quiesced are local resources to the board that could be damaged in the event
of an untimely removal (i.e., local databases or operating systems and other
re sources with might be irreparably damaged by non-orderly power removal).
Since the board itself is responsible for managing its own re sources, only
local control is needed for proper operator notification. Further, removing
the wrong unit from a live, redundant star fabric-based system has no system
level impact. If an operator were to "surprise extract" a unit that
was not properly quiesced, only that local re source would be affected. System
availability would not be compromised, although the board itself may not be
reuseable.
In conclusion, the CompactPCI/PSB specification defines a redundant star topology
based on CompactPCI packet interconnect of "node" and "fabric"
boards via dual Ethernet links. It supports a full 21-plus slot chassis as a
single ESAN and is extendable as a "virtual backplane" to an unlimited
number of slots. With no single point of failure, it has the inherent ability
to construct HA systems that can do more, are more readily managed, and can
be delivered to market much faster than bus-based systems.
Chris Roos is a hardware engineer and Hank Heneghan is director of field services
at Performance Technologies (Rochester, NY).
|