RS Classic: Single Point Failures

>> Friday, July 16, 2010

Written in 2008.

On NASAWatch, they noted loss of a single component brought down the entire NASA email system (this just after they were down for nearly a week because the server goes through JSC only, so the hurricane [Ike] affected all NASA mail). There are few things, technologically, more irksome, more certain to fail, and more embarrassing when they do so, than single point failures.

Failure tolerance is a good thing. It’s the best defense there is against Murphy’s law. At one time, NASA strived to meet the principle of Fail Operational, Fail Safe. That meant that any single failure did not affect function (or at least not critical function). A second failure might leave the hardware nonfunctional, but it would leave it safe - if the function is critical, of course, nonfunctional is not safe. Even more, it ensured that there were at least three failures required before reaching an unsafe condition.

Now, of course, that isn’t always possible. A nuclear reactor containment vessel is unlikely to be redundant. Extra wings on a plane in case one shears off is likely less than useful. Sometimes adding redundancy and additions paths adds so much complexity, weight, etc. that the design becomes ridiculously unwieldy.

But, in general, it’s a good thing. If 12 bolts can bear all the stress necessary, adding one or two other bolts allows for failure without compromising the overall capability. (This is especially important in space endeavors where a stripped or broken bolt may not be recoverable). If one can accommodate redundancy, especially with unreliable equipment, it’s smart engineering practice to do so. This is more true when hazards can result from failure.

So, if a complex or key system fails because of a single component or a readily foreseeable circumstance, that argues poor engineering. Bad in real life…

But it’s damn useful in science fiction. You want to make an interesting story, twist the plot, add hardship, shake things up a bit? Have a critical component fail. You want to add tension, pressure, up the excitement? Make your critical item one of a kind, no spares, hard to come by, made of something scarce and/or requiring interaction with someone hostile to recover it? Do so, and suddenly things are hot and exciting.

Just remember not to use this little trick too often because, if you do, your engineers will look like total morons or it will become cliché. Not that it isn’t a little cliché already thanks to Star Trek and the like. But it’s still good for excitement if you can exercise a little restraint.

Having said that, I'm not a personal fan of changing the two fault tolerance (fail operational/fail safe) philosophy to a more amorphous fault tolerance highly dependent on probabilistic risk assessment (PRA(and design to minimum risk (which can not be applied to all hardware). NPR 8705.2B, the standard for human rating required by NASA, used to dictate 2 fault tolerance for catastrophic failures. By requiring that, vendors and designers were required to provide redundancy or justify each instance where it was not provided [paragraph 3.1.1 in Rev A, 3.2.2 in Rev B]. In the version released in May 2008, this was undone. Rev B is the current version.

Old version: "Space systems shall be designed so that no two failures result in crew or passenger fatality or permanent disability." [Note that there were caveats in the next requirement for exceptions, but the exceptions had to be brought to management]

New version: "The space system shall provide failure tolerance to catastrophic events (minimum of one failure tolerant), with the specific level of failure tolerance (one, two or more) and implementation (similar or dissimilar redundancy) derived from an integrated design and safety analysis (per the requirement in paragraph 2.3.7.1)(Requirement). Failure of primary structure, structural failure of pressure vessel walls, and failure of pressurized lines are excepted from the failure tolerance requirement provided the potentially catastrophic failures are controlled through a defined process in which approved standards and margins are implemented that account for the absence of failure tolerance. Other potentially catastrophic hazards that cannot be controlled using failure tolerance are excepted from the failure tolerance requirements with concurrence from the Technical Authorities provided the hazards are controlled through a defined process in which approved standards and margins are implemented that account for the absence of failure tolerance. "

I'm not objecting because PRA isn't usefuluseful (it is) but because I've seen a great deal of misuse of good models with bad assumptions or limited understanding of the uncertainties and caveats. Not necessarily by the analysts but by those feeding them data and those wanting "an answer" without taking the trouble to understand what that answer really means. And, I have to say, changing a straight-forward requirement into a morass of legalese, well it just gives this old safety engineer the willies. Requirements should be clear and verifiable.

6 comments:

Post a Comment

Labels

Blog Makeover by LadyJava Creations