Advances in the Prevention and Control of HAIs

Using Socio-Technical Probabilistic Risk Assessment (ST-PRA) to Assess Risk and Improve Patient Safety and Reliability in Health Care Systems

Table of Contents

Anthony D. Slonim, Ebru Bish, Laura Steighner


Health care during hospitalization has become more complex. With this complexity comes additional patient risk that may lead to patient safety events. These events usually occur infrequently, but when they do, they can be catastrophic for patients. Common approaches for analyzing serious or sentinel events in health care include root cause analysis (RCA), which is retrospective in nature and has a number of limitations. Hospitals are also required to prospectively perform a failure modes and effects analysis (FMEA) annually on one high risk event. The FMEA process also is not without limitations. In this paper, we describe an approach to prospectively analyze potential serious and sentinel events by using a tool known as socio-technical probabilistic risk assessment (ST-PRA). This tool, used in several other industries, has seen some moderate success in health care for analyzing risks related to blood transfusions, medication errors, and infections. The major benefits of this tool, in contrast to other available tools, are that it is prospective, accounts for combinations of risk failures, and is both quantitative and qualitative in nature. Unfortunately, because of its complexity, the tool is still used primarily in the research domain and has not yet made its way into mainstream quality improvement efforts.


Socio-technical probabilistic risk assessment (ST-PRA) is a tool that incorporates risk estimates from the literature and uses experiential estimates from health care providers to estimate risks in rare health care outcomes. The tool examines single point failures and failure combinations, thereby allowing investigators to design interventions to reduce risks associated with the performance of process steps in a health care procedure. The tool is most useful for very rare, high-risk events and has been used in health care for a variety of problems ranging from blood transfusion infection risks to medication errors.1,2 For example, for blood transfusion infection risks, the tool helps not only to assess risk with the highest likelihood of error, but also points to important efforts to mitigate those infection risks.3

This paper summarizes the model building steps of ST-PRA, including data collection, literature review, database analysis, the use of technical experts, building ST-PRA fault trees, sensitivity analyses, and designing an intervention. It concludes with a discussion of the strengths and limitations of ST-PRA modeling as a tool aimed at improving patient safety.


Probabilistic risk assessment (PRA) is an engineering tool developed in the 1970s to quantify risks and identify threats to the safety of nuclear power plants.4 Subsequently, it has been applied in settings ranging from aerospace to manufacturing to natural disasters.5,6 PRA is a systematic tool that prospectively identifies a system's risk points. It utilizes quantitative and qualitative data to "map" the risks associated with adverse outcomes. PRA is thus a hybrid between qualitative process analysis techniques and quantitative decision-support models.7 PRA involves a detailed deductive method that utilizes logical relationships and probability theory to construct a model ("fault tree") of how risk points interact with one another and either individually or collectively combine to contribute to the adverse outcome.

ST-PRA expands the basic PRA model by accounting for human performance.8 Most quality and patient safety work involves the interactions of people, systems, and technology, which are all accounted for by the ST-PRA methodology. The challenge with ST-PRA is determining the probabilities associated with breakdowns in human performance that contribute to adverse outcomes.

The process mapped by an ST-PRA model incorporates both internal and external process factors and can disentangle the impact of factors that are related to individuals from those that are related to institutions or the system. In this way, ST-PRA addresses what has previously been described as a major limitation of isolated database analyses where interactions of different-level processes occur simultaneously. To ensure that ST-PRA captures all possible process factors, it is important to use several data sources in building the process map. Next, we describe the data sources used for this purpose.

Data Sources

A variety of data types (i.e., quantitative or qualitative) and sources are used in the development of ST-PRA fault tree models. Each source informs the data collection effort for other sources in an iterative fashion: information gleaned during a literature review can inform the analysis of databases; similarly, information collected during site visits, technical expert interviews, or focus groups can inform additional data analyses and literature searches.

Literature Review

A literature review can assist in identifying the potential risk factors associated with a patient safety event. The literature also provides discrete probability estimates and ranges for inclusion in the models and sensitivity testing.

Readily available search engines, including PubMed®, the Cochrane Collaborative, and the Cumulative Index to Nursing and Allied Health Literature (CINAHL), can be used for data collection. As with any literature review, keyword search terms can assist in better understanding the literature on the topic under study. The intent of the literature review is to ensure that relevant work is incorporated into the risk models. Potential articles are reviewed for relevance. For example, often an entire article will be reviewed only to identify a single probability estimate for a specific content area (e.g., the compliance of health care providers with hand washing). The reference lists may also be reviewed to ensure that the review is as inclusive as possible. General inclusion and exclusion criteria for an article's inclusion in the literature review need to be established.

A grey literature review can also provide important information. This category includes Web-based presentations, articles, and white papers that can be accessed through Google, Google Scholar, and Bing Internet. The project team and technical experts may provide additional sources of information. A targeted search of Web sites known for improvement efforts or standards of care can identify information on general or specific risks relevant to the model. Examples of potentially helpful organizations are Agency for Healthcare Research and Quality (AHRQ), Institute for Healthcare Improvement, Robert Wood Johnson Foundation, Centers for Medicare & Medicaid Services (CMS), and The Joint Commission.

All relevant literature should be entered into a database to assist with the creation of a bibliography and to support the specific risk estimates in model building.

Database Analysis

Discharge databases are important for studying health care problems in a variety of settings. The Healthcare Cost and Utilization Project (HCUP) series of datasets provide information on inpatient (National Inpatient Sample [NIS]), emergency department (State Emergency Department Databases [SEDD]), and ambulatory surgery settings (State Ambulatory Surgery Databases [SASD]). Subsets of the HCUP datasets (e.g., Kids' Inpatient Database [KID] for children) can also provide population-specific data. Analyzing quantitative information from these datasets provides occurrence rates and risk probabilities for model development.

Site Visits

The third data source involves site visits that allow the exploration of patient care and the identification of risks in actual practice settings. Site visits also provide an opportunity to determine the boundaries of the risk modeling exercise. Each site visit represents a different context where different processes, errors, and risks may be identified. Site visits should be conducted in locations where complementary information regarding the process can be gathered. A semi-structured protocol ensures consistency in data gathering across sites. Site visits may consist of several activities that serve to inform the socio-technical elements of the ST-PRA models. Our research team has found the following activities valuable for this purpose: a review and comparison of policies and procedures; informal exploratory interviews with a selection of six to eight staff from the participating setting; and a comparison of the process flow across sites, noting differences in policies and procedures, facility characteristics, and other relevant issues, as necessary.

Technical Expert Panel

A technical expert panel (TEP) can provide valuable information to guide the ST-PRA modeling. TEP members should represent expertise that ensures comprehensive coverage of the relevant issues. Often, only a few meetings are necessary, and specific guidance can be achieved through the review of documents. The first meeting should orient TEP members to the study's objectives and the ST-PRA methodology and gather feedback on the ST-PRA's focus. A second meeting can be used to review the draft fault tree model and solicit feedback on areas for improvement. A third and final TEP meeting can be used to (1) review the ST-PRA modeling results, (2) identify the highest risk basic-level events and event combinations (cut sets), and (3) inform the design of an intervention.

Developing the Fault Tree Model

A fault tree is a graphical depiction that conjoins risk estimates associated with a specific outcome of interest. The initial development of the fault tree incorporates information collected via the four major data sources previously described, to identify the risks associated with an adverse outcome. Iteratively, the model can be refined and revised to ensure that it has face validity with technical experts who understand the procedure under study. In this section, we detail the steps involved in developing a fault tree model, as depicted in Figure 1. For additional information on this topic, the reader is referred to a detailed clinical example of an ST-PRA for blood product infections,1 which highlights how the top-level event and contributing factors are organized.

Figure 1. Steps involved in developing the fault tree model

Figure 1: Steps involved in developing the fault tree model – depicts a series of interconnected circles that show the six steps necessary to develop a fault tree model: (1) identify all factors contributing to the outcome of interest, (2) identify the dependencies and interactions among the risk points, (3) validate the fault tree model, (4) identify the likelihood of the basic-level events in the fault tree, (5) conduct sensitivity analyses of the fault tree model, and (6) develop a risk-informed intervention.

Step 1: Identify All Factors Contributing to the Outcome of Interest

After determining the outcome of interest (also known as the "top-level event"), the first step involves identifying the risks that contribute most to the outcome. The objective is to identify a comprehensive list of variables (also known as "basic-level events") that contribute risk within the model and potentially lead to the outcome.

An initial list of basic-level events is created based on the major risk factors recognized in the literature. This list can be augmented by studying the process maps developed from the site visits and identifying points of failure in the processes (e.g., communication failure between health care professionals). Finally, based on discussions with TEP members, basic-level events can be added or deleted from the list, as appropriate. When additional basic-level events are considered for inclusion in the fault tree, a targeted literature review using these basic-level events as key search terms should be conducted to provide additional support for their inclusion.

Step 2: Identify the Dependencies and Interactions Among the Risk Points

The research team must consider how the basic-level events are connected to the result in the top-level event. For many of these basic-level events, this process is straightforward. For example, it is clear that contamination of surgical equipment contributes additional risk at the basic-event level, which can lead to a higher frequency of surgical site infections.

The research team uses an approach that incorporates the risk point estimates by considering each of the basic-level events along different components of the process under study. By creating a logic model and isolating the basic-level events in each part of the process, the face validity and overall interpretability of the model is improved by the relevant stakeholders. It is also a useful method for incorporating the data gathered from the site visits where the patient's care is detailed in independent process flow maps. These flow maps allow the team to visually inspect the numerous interactions of the process across providers and care settings, tending to make what may be very complex interactions more easily understood in the model under study.

As a preliminary step, specific parameters are established to guide the development of the model's framework and the relationships among the risk points. Parameters allow fault tree designers to home in on a top-level event and the numerous characteristics that may contribute to the outcome of interest. For example, investigators may consider limiting the project's scope to specific parts of the process or time limits (e.g., 30 days after hospital discharge).

Once the model's scope is appropriately defined, the relationships (i.e., dependencies and interactions) among the multiple risk points are studied to understand their contribution to the outcome. This is where clinical judgment, the results of the database analysis, the site visit process maps, and the input from the TEP are critical. Using these inputs, the identification of multiple connections and combinations associated with the occurrence of the outcome can be determined. For example, a patient-level factor (e.g., diabetes) is identified from the literature, a staff-level factor (e.g., wearing artificial nails) is specified in a hospital policy, and an organization-level factor (e.g., preoperative screening) is identified by the technical experts. These connections may be further enhanced by targeting additional literature searches to patient-level and staff-level factors that have been studied by other researchers.

The relationship between the basic-level events and the top-level event is established next. The fault tree uses "gates" to demonstrate the logic for joining all the basic-level events into an organized model that contributes to the outcome. The two major types of gates are "AND" gates (i.e., the output event occurs if all the input events connected to the AND gate occur) and "OR" gates (i.e., the output event occurs if at least one of the input events connected to the OR gate occurs). In combination, the basic-level events, modeled in the fault tree along with the AND gates and OR gates, produce a descriptive, hierarchical flow diagram of the process and the outcome under investigation. For the AND gates, the probabilities of the input events are multiplied together; for the OR gates, the probabilities of the input events are added together, with the overlap subtracted to prevent double counting the gate if both failures occur simultaneously. Figures 2 and 3 present examples of AND and OR gates, respectively.

Step 3: Validate the Fault Tree Model

Once the model is developed, the TEP reviews it and provides feedback on the connectivity and logic of the basic-level events with the top-level event. The model is then revised based on this feedback. To address specific questions that need further clarification, focused interviews and additional literature searches can be used. The goal of this validation step is to confirm that the logical relationships built into the fault tree are representative of the system and processes under study.

Figure 2. Example of an AND gate

Figure 2: Example of an AND gate – figure represents the method of combining basic-level events, known as Failures A and B, depicted in the figure as circles with specific probabilities. An “AND” gate suggests that both Failure A and Failure B must occur for the undesirable event to occur. The “AND” gate uses the basic-level probabilities for the Failures to determine the probability of the undesirable outcome.

Figure 3. Example of an OR gate

Figure 3: Example of an OR gate – represents the method of combining basic-level events, known as Failures A and B and demonstrated in the figure as circles with specific probabilities. An “OR” gate suggests that either Failure A or Failure B can occur for the undesirable outcome to occur. The “OR” gate uses the basic-level probabilities for the Failures to determine the probability of the undesirable outcome.

Step 4: Identify the Likelihood of the Basic-Level Events in the Fault Tree

The assignment of probabilities to each basic-level event in the fault tree occurs next. If available, information from the literature review provides a starting point for probability estimates of the basic-level events. When gaps exist, additional and more focused literature reviews or interviews may be necessary to estimate these probabilities. When technical experts' estimates are relied on, these estimates are targeted in the study's subsequent sensitivity testing.

Once probabilities are assigned to the basic-level events, the fault tree is modeled using Relex™ (Relex, Inc., Voronezh, Russia), a software package that calculates the remaining probability estimates for all intermediate and top-level events, using the logical relationships previously specified. This process leads to a probability estimate for the top-level event and the major risk points in the process (also known as cut sets) that are developed as the next step of the study.

Step 5: Conduct Sensitivity Analyses of the Fault Tree Model

Because some of the model's probabilities are based on imprecise information from the databases, highly variable literature, or technical expert estimates, the use of sensitivity analyses can improve the model's reliability. The sensitivity analysis can be considered a series of grounded "what if" tests to study the robustness of the ST-PRA model. These analyses begin by examining the "base case" and then varying the basic-level event probabilities across a range of values to determine whether the combinations of the major events cause a change in the likelihood of the top-level event. These analyses involve identifying the minimal cut sets, defined in the next section, for the base case and for variations of the base case (obtained by modifying the probabilities) to study the robustness of the fault tree model. This process enables the identification of an intervention with the greatest likelihood of mitigating the risk of the top-level event.

Minimal cut sets. Cut sets are unique event combinations that lead to the occurrence of the top-level event. A cut set is considered a "minimal cut set" if, when any basic-level event is removed from the set, the remaining events are collectively no longer a cut set; that is, a minimal cut set is defined as a critical path through multiple failure points. By identifying the different cut sets associated with an event, the model can be reconsidered after removing specific failure points or system components as a result of implementing an intervention or series of interventions designed to reduce the rate of occurrence of the top-level event. The minimal cut sets are identified through the software, using the underlying logic depicted in the AND/OR gates. The software then combines basic-level event probabilities to identify the paths, based on the conditional probabilities of event combinations. The minimal cut sets with the highest risk for the top-level event are then listed in descending order of priority.

Sensitivity analysis. A sensitivity analysis focuses on events with large probability variations and varies these probabilities in the base case within the ranges suggested by the literature. When a probability estimate is unavailable, an anchor estimate can be obtained from technical experts. For example, questions that arise from a process failure with relevance to pediatric patients can be referred to specific TEP members with expertise and professional experience in pediatrics. This estimate can then be considered the anchor estimate for the sensitivity analysis, which examines the range of intervals from 25 percent to 75 percent around the provided probability estimate.

For example, hand washing is a common approach to prevent the spread of bacteria and would be expected to have a positive impact on preventing healthcare-associated infections. The literature indicates that the hand-washing compliance rates for non-operating-room (OR) staff range between 40 percent and 90 percent. The hand-washing compliance rates for the OR staff are consistently higher and less variable, approximately 75 percent to 90 percent. In the sensitivity analysis, the conditional probability for non-OR hand-washing compliance can be varied across the range of 40 percent to 90 percent to understand the impact of hand washing on mitigating the occurrence of a healthcare-associated infection. Sensitivity analyses ensure a model's accuracy even if basic-level event probabilities were grossly inaccurate at the beginning of the modeling exercise. If the same contributors are identified after the sensitivity analyses, the model's integrity can be ensured, or the model can be reworked if the contributors are found to be different.

The fault tree model is examined for each variation of the base case where the outcome of the top-level event is modified. The top five minimal cut sets can be run to understand how and if they change beyond the base case.

Developing a Risk-Informed Intervention

One of the most important goals is to develop an intervention with the greatest likelihood of mitigating the risk of the outcome under consideration. There are three major steps to developing a risk-informed intervention: (1) conduct criticality analyses to inform the selection of an intervention, (2) identify the target event(s) for the intervention, and (3) design interventions to mitigate the risks associated with the target event(s).

Conduct a Criticality Analysis

Importance measures rank the most significant risks based on their contribution to the top-level event as a means of improving system performance. These measures help to assess the risk's criticality by its absolute risk, its relative importance within the model, or its frequency in the model. Commonly used relative importance measures include the criticality, Birnbaum, and Fussell-Vesely measures. These measures anchor an individual risk estimate within the context of the model's other risks. For example, the Birnbaum measure ranks the risks based on the relative contribution of individual component failures in a system; the Fussell-Vesely measure is a linear indicator of risk that accounts for the fractional contribution of a risk element to the total system for all scenarios under study based on the failure of an individual component. Alternatively, the criticality measure is a measure of absolute risk, which identifies the independent risk contribution of a basic-level event. For example, assuming that the top-level event occurs, the criticality of basic-level event A is the probability that the top-level event is a result of basic-level event A, thereby indicating the fundamental components of a system's liability. The importance measure selected depends, in part, on the type of model and the purpose of the modeling exercise.

Identify Event(s) Targeted for Intervention

The criticality analysis provides a foundation for understanding the basic-level events with the highest probability of contributing to the top-level event. However, the real power of ST-PRA stems from the event combinations and probabilities that identify critical paths leading to the occurrence of the top-level event. Using both the criticality analysis and cut sets to identify the intervention ensures that the selected intervention will have the greatest impact of reducing the top-level event.

To be successful, it is also important to consider the ease of implementation, the likelihood of achieving substantive improvement based on the intervention, and the level of effort necessary to effectively implement the intervention within an existing system. As with other quality improvement efforts, the most feasible intervention is the one that combines ease of implementation, has the greatest likelihood to yield an impact, and is the most resource conservative.

Design the Intervention

Based on the model's results, an intervention aimed at a specific event and the major components of the cut set is developed. When designing an intervention, it is important to look for opportunities where the intervention can be hardwired into the care system. Such an intervention should focus on aspects that the provider can control (as opposed to the patients' compliance), should be integrated into the care process, and should include redundant steps to minimize single point failures.

During the design, the investigative team considers both the results from the sensitivity analyses and information gleaned through the site visits. These results tend to identify major processes within the system under study that can impact how care is provided. The model can be used and revised in real time, depending on the impact achieved with the proposed intervention on reducing the likelihood of the top-level event.


Strengths shared between the more traditional PRA and the ST-PRA methodology include the following features:

  • Provides a broad perspective, including contextual elements such as operating procedures, system, and human factors, to the risk model.
  • Is proactive, identifying the possible adverse events before they actually occur, thereby enabling the decisionmaker to introduce targeted interventions for preventing these events from occurring.
  • Models complex interactions and dependencies among the multiple risk points that may lead to the adverse outcome, using logical relationships and Bayesian probabilities.
  • Allows the uncertainty associated with error rate estimates to be incorporated into the model through sensitivity analysis.
  • Allows an assessment of risk and a prioritization of risk reduction interventions based on sequences that have the highest probability of occurrence, providing a roadmap of targeted interventions.
  • Is dynamic in that PRA (and ST-PRA) can incorporate new estimates of probability as they are available.

The incremental value of the ST-PRA methodology lies in its capacity to consider both individual contributors of risk and unique combinations of risks that contribute to the adverse outcome, by incorporating both quantitative and qualitative data into the models. This modeling process creates a real world experience, which can be tested using the sensitivity analysis to ensure the scientific integrity of the tool and the ultimate results. Finally, the ST-PRA model also serves as a living document that can be modified as new risk information is acquired either through direct observation or through improved methods for studying the environment.

Despite these important strengths, notable limitations should be acknowledged. First, the quantitative estimates from datasets are limited because these data often fail to include the more granular risk estimates important for creating the risk models. As improvements in the SASD, NIS, and SEDD datasets occur, additional information regarding the care context will allow further refinement of this research. Second, the lack of integrated data systems, linking patients between care contexts such as between the emergency department and inpatient settings, significantly limits the ability to inform the model with real risk estimates across transition points of care. Finally, a common criticism of any modeling exercise such as ST-PRA is that the model will not be a "real world" representation of the process, but instead will involve some simplifying assumptions. Nonetheless, a careful use of quantitative estimates from the literature, and the modeling of the in vivo process flows contribute to a realistic understanding of the system under study. When combined with the sensitivity analyses, which ensure that the risk estimates and conclusions are supported across a range of values, the modeling effort can be made more robust.


The use of ST-PRA as a modeling tool to identify patient safety risks in a variety of contexts is an important method for advancing the understanding of risk and reliability in health care. The models can be refined as new information becomes available and as improvements in care are realized through interventions. Additional effort to make the ST-PRA methodology more accessible for use outside the research domain is a critical next step.

Although ST-PRA adds value over other existing risk assessment tools such as RCA and FMEA, the current fault tree software, Relex™, is difficult to use and not well understood by health care quality improvement teams. Until probability and fault tree analyses can be performed using more user-friendly and readily available software tools, ST-PRA will remain out of the reach of health care providers. The authors recommend that additional efforts be invested to make ST-PRA more accessible to enable the improvement of system design and the reduction of risks associated with the delivery of health care.


This project was funded under contract no. HHSA290200600019I, Task Order 12, from the Agency for Healthcare Research and Quality (AHRQ), U.S. Department of Health and Human Services. The findings and conclusions in this document are those of the authors, who are responsible for its content, and do not necessarily represent the views of AHRQ. No statement in this report should be construed as an official position of AHRQ or of the U.S. Department of Health and Human Services.

We gratefully acknowledge Rachel Crosno, Nasibeh Fard, and Xin Zeng for their valuable contributions during various phases of this study. In addition, we thank Kendall Hall, project officer at AHRQ, for her generous and thoughtful support throughout this study.

Authors' Affiliations

Barnabas Health, West Orange, NJ (ADS). New Jersey Medical School, University of Medicine and Dentistry of New Jersey, Newark, NJ (ADS). Grado Department of Industrial and Systems Engineering, Virginia Polytechnic Institute and State University, Blacksburg, VA (ADS, EB). American Institutes for Research, Washington, DC (LS).

Address correspondence to: Anthony D. Slonim, MD, DrPH, Executive Vice President and Chief Medical Officer, Barnabas Health, 95 Old Short Hills Road, West Orange, NJ 07052; Email:


1. Slonim, AD, Bish, EK, Xie, RS. Red blood cell transfusion safety: probabilistic risk assessment and cost/ benefits of risk reduction strategies. Ann Operations Research 2011 July 12;1-30. DOI 10.1007/s10479-011-0925-0.

2. Cohen MR, Smetzer JL, Westphal JE, et al. Risk models to improve safety of dispensing high-alert medications in community pharmacies. J Am Pharm Assoc 2012 Sep-Oct;52(5):584-602. PMID: 23023839.

3. Bish DR, Bish EK, Xie RS, et al. Optimal selection of screening assays for infectious agents in donated blood. IIE Trans Healthc Syst Eng 2011;1(2):67-90.

4. U.S. Nuclear Regulatory Commission. Fact Sheet on Probabilistic Risk Assessment. Accessed August 14, 2011.

5. Stamatelatos, M, Vesely, W. Fault Tree Handbook with Aerospace Applications. Washington, DC: NASA Office of Safety and Mission Assurance; 2002. (PDF File, 88 KB). Accessed on August 14, 2011.

6. Westphal, JE, Marx, DA. Socio-technical probabilistic risk assessment: its application to aviation maintenance. Int J of Aviation Psychology 2008;18(1):51-60.

7. Modarres, M. Risk Analysis in Engineering Techniques, Tools and Trends. Probabilistic risk assessment, 33-111; Identifying, ranking and predicting contributors to risk, 241-274. London: Taylor and Francis; 2006.

8. Marx, DA, Slonim, AD. Assessing patient safety risk before the injury occurs: an introduction to sociotechnical probabilistic risk modelling in health care. Qual Saf Health Care 2003 Dec;12(Suppl 2):ii33-38. PMID: 14645893.

Return to Contents

Page last reviewed June 2014
Page originally created June 2014
Internet Citation: Using Socio-Technical Probabilistic Risk Assessment (ST-PRA) to Assess Risk and Improve Patient Safety and Reliability in Health Care Systems. Content last reviewed June 2014. Agency for Healthcare Research and Quality, Rockville, MD.
Back To Top