One particular aspect of the talk was the increasing relevance of availability in medical software. Much to my pleasure, this topic spawned several interesting discussions. This posting summarizes the conclusions reached.
A major topic we often face when developing medical software is single-fault safety as defined in IEC 60601-1. According to the standard, there are two ways of implementing a function in a way that complies with this demand (slightly simplified here):
- The function is complemented by a risk mitigation measure whose probability of failure is negligible.
- The risk mitigation measure may fail, but it is complemented by yet another measure whose probability of failure is negligible.
The existence of a fail-safe state eases matters dramatically. In general, it is considered sufficient if the device in question reacts to a failure as follows: The failure is detected by some monitoring measure, the system shuts down to the fail-safe state, and the device raises an alarm. The health professional observes the alarm and takes appropriate action.
As a consequence, we tend to stick to architectural patterns that adhere to this strategy when developing software for medical devices: two-channel systems that enter the fail-safe state as soon as one channel realizes that the other is dysfunctional, actuator/monitor pairs, program flow monitoring, self-tests and the like. Such approaches increase safety by increasing the total amount of monitoring in the system, but on the downside, they reduce availability because they rely on reversion to a fail-safe state whenever the monitoring believes a failure has occurred.
How so? The answer is rather simple: monitoring is not a black-and-white matter. On the contrary, the failure criteria employed in monitoring often depend on fuzzy conditions such as mechanical tolerances, electric thresholds, software timing et cetera. Adding more monitoring to a system therefore increases the risk of false alarms — unless additional effort is put into keeping the previous level of availability. In fact, this area of conflict between high safety, high availability and development effort is very similar to the infamous “iron triangle” of project management (where the conflicting items are scope, cost and time).
The talk and the subsequent discussions highlighted two important points regarding availability:
- There are many domains beyond the “Medical” world in which one cannot simply revert to a fail-safe state. One particular example is the field of aviation — a field in which availability is a critical aspect of a system’s safety itself.
- We can already see a partial paradigm shift in the medical community: There is a new emphasis on availability for certain types of devices. In particular, this holds for infusion systems in the sense of IEC 60601-2-24 — systems for which low availability is interpreted as a lack of safety (the magic phrase here is “delay of therapy”).
These two points combined, however, directly bring us to the following realization: Organisations involved in the development of medical software should broaden their view; they should look into other domains where safety-critical systems cannot resort to a fail-safe state when issues arise.
On the positive side, there are many fields we can draw expertise from (avionics, automotive, space flight, rail, process automation and so on). And what’s more, these fields offer great material for reading. I recommend Robert S. Hanmer’s “Patterns for Fault Tolerant Software” (Wiley, 2007) as a starting point and David A. Mindell’s “Digital Apollo” (The MIT Press, 2011) for a historically exciting high-level view on systems that incorporate mechanics, electronics, software and human operators.