Skip to main content

Notes on "Engineering a safer world" - Part I

· 9 min read
Bruno Felix
Digital plumber, organizational archaeologist and occasional pixel pusher

“An accident is an unplanned and undesired loss event” — Nancy G. Leveson

Traditional safety engineering practices inherited from the industrial age have typically focused on component failure, operator fault and linear cause-and-effect relations. While of some value, these factors are insufficient to provide a satisfactory model to reason about accidents, particularly in modern software-intensive systems. STAMP (Systems-theoretic Accident Model and Processes)1 offers an alternative approach. It posits that safety is first and foremost an emergent property of the system (in the socio-technical sense), and changes the emphasis from preventing failures to enforcing behavioral safety constraints. In essence, safety is formulated as a control problem rather than a reliability problem.

This article provides a short, but hopefully useful, summary of why STAMP is relevant, its foundational assumptions, theoretical underpinnings, and key constructs. By examining both the limitations of traditional approaches and the advantages of systems-based thinking, we'll explore how STAMP offers a more effective framework for modern safety engineering.

The issue with traditional safety engineering models

The nature of the systems we are building is changing, and the value of traditional safety engineering models based on establishing a chain of events from a perceived "root cause" to an accident are limited. This model was adequate for old-school electro-mechanical systems, directly observable by operators and governed by physical constraints that limit the complexity in system design, construction and modification. However, today, we are building increasingly ambitious systems comprised of multiple interacting components with software playing a pivotal role, especially in intermediating how operators sense and understand system state.

Several factors make traditional safety models increasingly inadequate:

  • Fast pace of technological change: New technologies emerge before we fully understand their safety implications. What we are seeing with the adoption of LLMs is a great example of this;
  • Reduced learning cycles: Faster time-to-market means less time to learn from experience and understand the potential behaviors and risks of a given technology;
  • Increasing complexity and coupling: Modern systems have more interdependencies and hidden interactions;
  • Changing nature of accidents: More incidents now result from problematic component interactions rather than simple component failures;
  • Human-automation complexity: The relationship between human operators and automated systems introduces new challenges2;
  • Competing priorities: The difficulty in balancing conflicting goals (e.g. financial, ESG, regulatory) in an accelerating technological landscape;

The foundations for a new model to reason about accidents

The existing paradigm is no longer providing the right framework to reason about accidents in modern systems - and by extension aiding in the design of safe systems. STAMP is a relatively new model that challenges the foundations of traditional safety engineering practices and builds up from there. The following table summarizes this shift in key assumptions3:

Old assumptionNew assumption
Increasing system component reliability increases safety.Reliability is neither necessary nor sufficient for safety.
Accidents are caused by chains of related events.Accidents are complex processes involving the entire sociotechnical system. Traditional event chain models cannot adequately describe this complexity.
Probabilistic Risk Analysis (PRA) based on event chains is the best way to assess and communicate safety and risk information.Risk and safety may be best understood and communicated in ways other than PRA.
Most accidents are caused by operator error.Operator error is a product of the environment in which it occurs. To reduce operator "error," we must change the environment in which the operator works.
Highly reliable software is safe.Highly reliable software is not necessarily safe. Increasing software reliability will only have minimal impact on safety.
Major accidents occur from the chance simultaneous occurrence of random events.Systems will tend to migrate toward states of higher risk45. Such migration is predictable and can be prevented by appropriate system design or detected during operations using leading indicators of increasing risk.
Assigning blame is necessary to learn from and prevent accidents.Blame is the enemy of safety. The focus should be on understanding how the system behavior as a whole contributed to the loss and not who or what is to blame for it.6

Goals of a new accident model

Having established that the foundations of traditional safety engineering assumptions are questionable, and that alternative framings are possible, it is important to understand the goals of the new model:

  • Expand the range of considered factors in accident analysis: While operator error and component failure are common proximal events, they are often just the tip of the iceberg. Other factors, including societal, regulatory, and cultural factors should be included.
  • Provide a more scientific way to model accidents: Event-chain models are highly dependent on the subjective choice of which events "matter." A new model should provide a better framework to capture both the factors involved and the adaptations that led to the accident.
  • Include system design errors and dysfunctional system interactions: Event-chain models often fail to capture dysfunctional interactions between system components.
  • Allow for new types of hazard analyses: Risk assessments that go beyond component failures and can address the complex interplay of software and humans.
  • Shift the focus of humans in accidents: Move from viewing humans as sources of normative deviations to understanding the mechanisms and factors that shape human behavior.
  • Focus on understanding the reasons behind accidents: Look beyond just the events and errors that lead to the supposed "cause" of the accident (which is often limiting or arbitrary).
  • Clearly separate facts from interpretation: Recognize that multiple valid viewpoints from different actors may exist and are valuable to capture, while distinguishing these from factual data.
  • Assist in defining operational metrics and analyzing performance data: Provide concrete tools for ongoing system evaluation.

STAMP

The theoretical foundation for how STAMP achieves these goals is grounded in Systems Theory.

Systems Theory emerged around World War II as a response to the limitations that classical analysis techniques encountered when developing cutting-edge systems of that era. While analytic reduction is an extremely powerful technique and indeed forms the foundation of the scientific method, it assumes that a system's behavior can be understood by analyzing each component independently. This approach assumes that components are not subject to feedback loops or nonlinear interactions, and that the principles governing the assembly of components into a whole are straightforward.

Such reductionist approaches prove insufficient for modern "high-tech" systems that exhibit emergent properties resulting from component interactions. These systems feature hierarchical structures that impose constraints on system behavior, and rely on communication and control mechanisms among interrelated components to maintain a state of dynamic equilibrium.

Based on this foundation, STAMP frames safety as a control problem, that is, controlling the behavior of the system by enforcing safety constraints in its design and operation. Note that this approach is much broader than the typical "component reliability" perspective, as component behavior and interactions may be controlled through technical means (e.g., redundant mechanisms), design choices, processes and procedures, and social controls (e.g., management oversight). The prevailing goal is to design a system that imposes safety constraints on the system as a whole.

The three basic constructs of STAMP are:

  • Safety Constraints: These are the system-level safety requirements that the system must respect. Once identified, they should be translated into a set of appropriate controls to enforce them.

  • Hierarchical Controls: Emergent behaviors cannot be controlled at the component level, and therefore there is a need for a hierarchy of control processes that enforce safety constraints at the appropriate levels (which includes both social and technical elements). These controls act as adaptive feedback mechanisms and ultimately should govern the system in such a way that failures or undefined states are avoided.

  • Process Models: These describe the relationship between process variables, the current state of the process, and how the process can change state. The implementation of the process model can vary, from programmed logic in a controller to a mental model that a human operator maintains.

STAMP therefore frames accidents as failures by the control processes to enforce safety constraints, leading to events that culminate in an accident.

Conclusion

The Systems-Theoretic Accident Model and Processes (STAMP) addresses the limitations of traditional safety engineering approaches. By reframing safety as a control problem rather than merely a component reliability issue, STAMP provides a more comprehensive framework for understanding and preventing accidents in complex sociotechnical systems.

As our technological landscape continues to evolve toward increasingly complex and software-intensive systems, approaches like STAMP, which account for the emergent properties, and dynamic interactions provide a very useful lens with which to design and reason about the systems we build and operate.


Footnotes

  1. Engineering a Safer World: Systems Thinking Applied to Safety

  2. The concept of the "line of representation" becomes quite useful in understanding a fundamental difference that software-intensive systems introduce. In these systems, operators cannot directly perceive the current state but must rely on intermediation through dashboards, logs, and other forms of output typically displayed on screens. This imposes significant cognitive effort on operators who must construct mental models of the current state and how the system functions. Importantly, it is very likely that different operators will construct different mental models based on the same representations. The seminal work introducing this concept is the paper Above the Line, Below the Line: The resilience of Internet-facing systems relies on what is above the line of representation.

  3. This is an abbridged version of what is found on Table 2.1 in the book1.

  4. In the presence of conflicting goals and finite resources, seemingly harmless decisions will surely but steadily set sytems on course for failures. Drift Into Failure provides a very interesting account of this phenomenon.

  5. Figure 3 in Jens Rasmussen's Risk Management in a Dynamic Society: A modelling problem provides a very good visual representation of the economic, workload and system performance forces at play in a system. In a nutshell economic and workload pressures move the system to operate at "higher" performance. As most changes do not result in accidents, this will move the system closer to operate outside its safety envelope where accidents will occur. This in turn will lead to corrective actions to adapt by operating the system at a "lower" performance, or expending resources to expand the perceived "safe" operational boundary of the system.

  6. While from a purely engineering perspective the focus should be on understanding how the system evolved into an unsafe state, we may never get away from assigning blame, as this is an important element for instance when reasoning about sanctions and compensation to people impacted by an accident.