Improving the Measurement of Surgical Site Infection Risk Stratification/Outcome Detection
Chapter 2. Determining Surgical Site Infection Rates
Table of Contents
The purpose of this task is to develop a surveillance tool that detects downstream manifestations of surgical site infection (SSI) in electronic data. Health care systems with electronic health information systems may improve the efficiency (time spent to find a positive case) of their SSI surveillance activities and improve reliability by leveraging electronic data. Although many approaches exist, the one long employed by the Intermountain Healthcare system uses electronic algorithms to populate more manageable queues of charts that an Infection Preventionist (IP) can subsequently review. This approach can capitalize on the IP's superior ability to discern the presence of SSI and may unburden the IP of mundane, automatic tasks significantly. When Intermountain Health initially implemented this scheme, there were few facilities that could have replicated the feat, but more and more facilities may be able to employ similar strategies. Human–adjudicated electronic surveillance for SSI may now be generalizable to other institutions as more hospitals switch over to electronic medical records (EMRs) or electronic health records (EHRs).
The purpose of an IP performing manual surveillance is at least two-fold: to improve situational awareness and to be able to detect differences between times or places. For the former, it is useful to have a high a sensitive surveillance system. To perform the latter, it is useful to be highly specific. IPs have been traditionally employed in this task because they are adaptable and have a better ability to discriminate between charts that have and do not have SSI than automated systems. Although employing IP appears to be the ideal solution, they are routinely stretched and not allowed adequate time for all of their responsibilities.1-3 Also, adaptation can lead to problems when it comes to comparability.
On the other hand, completely automated systems can review charts quickly and usually do not adapt. There is some evidence to suggest that depending on the conditions and the purpose of surveillance, automated systems may be the instrument of choice.4 These systems can be extremely sensitive to artifacts of data manipulation or changes in practice. Usually, algorithms are restricted to structured data and cannot use as much information as manual systems. Also, their specificities are usually inferior to manual review.
Our approach used a hybrid, human-adjudicated approach. Such systems are not new, but there have been barriers to implementation. They still need human reviewers, and adapting algorithms developed elsewhere to the local electronic health system may be difficult. The rationale for the combination of the two may be illustrated by invoking signal detection theory.
In signal detection theory, reviewers distinguish between the presence or absence of disease by assessing the chart, laboratory values, antibiotics, etc. These data are called signal. The reviewer has two important characteristics: the discriminability index and criterion. The discriminability index is a measure of how well the reviewer perceives the differences in signal between the diseased and nondiseased states. The criterion is the threshold at which the reviewer interprets signal as disease. If the criterion is lowered, then sensitivity improves and specificity declines. If the criterion is raised then the reverse is true. The only way to improve sensitivity and specificity simultaneously is to improve discriminability. A human reviewer's discriminability index is unlikely to change rapidly, but criterion might. An automated system's discriminability index is usually lower than a human reviewer's, but it can review a large number of cases rapidly. Its criterion usually does not change unless the semantics of the data have changed. With this framework, we can build a two tiered system. The first tier is run by the automated system. It removes charts where signal is weak enough that they can still be safely removed despite its inferior discriminability index. The second tier involves human review on more difficult cases, where the human's superior discriminability index can be used efficiently.
The algorithm we constructed was developed for use in this two-tiered system. A strength of our approach was to rigorously differentiate between risk factors for and manifestations of SSI. Risk factor data could supply additional information to improve performance, but it would also curtail any analysis of risk from surveillance systems using the algorithm. We anticipated that the main characteristics that would facilitate its acceptability were a high sensitivity and a low number of charts needed to review per identified SSI.
Our approach seeks to capitalize on the superior specificity of human reviewers, the growing wealth of electronic data, and the speed of automated systems. If charts are reviewed in roughly 20 minutes5 and the fraction of SSI among procedures is roughly 1 percent,6 then 33 hours of review could be anticipated for every SSI found. If electronic tools could effectively remove 80 percent of charts, then only 6.6 hours would be spent for every SSI found. The impact of such savings may be large. The Virginia requirement for statewide detection/reporting would require 160 infection preventionists (IPs) at a cost of $11.5 million. More than 50 percent of IP time is spent at the desk7—time that could be applied to implementation, education, and other effective activities. The surveillance tool will enhance nurse work, moving them from being infection counters to being IPs, freeing these professionals up to do more prevention. Further, the surveillance system provides cognitive surveillance support of the human element of current practice (i.e., chart review, available electronic data, using “shoe leather”).
Exhibit 1. Advantages of surveillance automation
SSI = surgical site infection
Subtask 2.1. Identify Potential Automated/Electronic Sources of Health Care Data Useful for Surveillance of SSI
The investigators and supervising officers decided to investigate a human-adjudicated electronic system akin to the one currently used in Intermountain Healthcare. In such systems, electronic algorithms with high sensitivity and negative predictive value are employed to identify electronic markers of SSI and populate a manageable queue of charts that an IP would subsequently review. The decision to implement this hybrid type of surveillance system was made based on the difficulty of categorizing SSI subtypes and concern for the poor specificity of solely electronic approaches. Initially, the plan was to perform this task using data from the four participating hospitals; however, as we acquired SSI data from these hospitals, it became apparent that the rarity of SSI among the total number of procedures performed would make these data insufficient for the proper training and validation of the electronic algorithm component of our system.
In Exhibit 2, it can be seen that there were 73 SSIs, with the smallest hospital contributing only 3 (4.1 percent) and the largest hospital contributing 43 (58.9 percent). The use of data from only the principal four hospitals would produce algorithms based on small numbers and dominated by Intermountain Health.
The standard approach to SSI surveillance, as implemented by the National Surgical Quality Improvement Program (NSQIP), facilitates a consistent measurement of SSI across facilities. Their methodologies are well-documented and there is a certification process for each reviewer. Additionally, NSQIP performs yearly audits to assess interrater reliability. The accrual of data contributed is also reviewed to maintain data quality. The entire Veterans Affairs (VA) network of 152 active hospitals participates in NSQIP (its implementation is called VASQIP) and performs a large number of surgeries. We subsequently received permission for and obtained VASQIP data for 2007 through 2010 for training and testing. Data from the four principal hospitals were used for external validation process described in Task 2.3.2.
Of note, VASQIP reviewers do not review all cases. Surgical chart reviewers review cases in temporal order as they are identified by CPT code. Reviewing stops when they reach their quotas over 8 day cycles. In the VA, the quota is 36 procedures. No more than 5 of the procedures can be inguinal herniorrhaphies during a cycle.8 Although sampling is not random, the first day of each cycle shifts the weekday it falls on for each cycle, so it is not obvious that this process produces systematic bias with regard to SSI outcomes. Exhibit 3 illustrates the fraction of cases reviewed among the listed procedures at VA SLC HCS, whether documented only by ICD-9 codes, CPT codes, or both. Depending on the total number of cases of each type, we expect that there would be differences in the proportions reviewed that may vary over time. We cannot exclude bias and it appears that, for some procedures, only a minority of cases is reviewed, but the number of both surgeries and SSIs in hospitals of different sizes, locations, and acuity were seen as an asset for training algorithms to detect SSI. Again, because our objective is to develop an algorithm to detect SSI, the only bias we are concerned with is whether cases are sampled in ways that induce differential misclassification between diagnoses of SSI and our measured indicators of SSI.
2.1.1. Identify data elements for inclusion in training datasets
A literature review was performed using Medline and the searches “(“surgical wound infection/diagnosis”[Mesh] AND “Data collection”[Mesh])” as well as “(“surgical wound infection/diagnosis”[Mesh] AND (“Blood Sedimentation”[Mesh] OR “C-Reactive Protein”[Mesh] OR “Leukocytosis”[Mesh])”, which produced 256 and 75 results respectively. Titles and abstracts were reviewed to identify articles to investigate. Our criterion was to identify articles that pertained to the manifestations of surgical site infections, especially those manifestations that can be identified electronically, as opposed to risk factors of disease. We also incorporated articles that the authors were aware of and allowed “snowballing” of related articles during review. We excluded articles that employed primary data collection. The identified data elements were: leukocyte count, leukocyte differential, fever, procalcitonin (not helpful, as this laboratory measurement is not readily available in the United States), erythrocyte sedimentation rate, C-reactive protein, microbiology results, and antimicrobial administration.9-43 A significant number of articles incorporated claims data into algorithms.16, 18, 20, 21, 26, 30, 34, 44 Unfortunately, these data are generally not available until well after an IP typically would be reviewing cases. We have elected not to include claims data into the algorithm here.
Based on our findings in the literature review, a data dictionary was sent to each of the participating centers, so that they could pull their data. Standardizing to a common physical data model allowed us to share the algorithm through the dissemination of SQL (structured query language) code scripts. Each of the centers then implemented the algorithm on their own data and identified charts that needed to be reviewed. The full data dictionary is included in Appendix C.
2.1.2. Each site pulls surgical-procedural and other identified data, based on the list for training sets
Some modifications were made to this subtask with approval from AHRQ. The movement of large sets of individual data between institutions was problematic. Instead, we focused on developing a portable algorithm, so that each of the centers would be able to implement it locally. Local (as opposed to central) implementation also demonstrates the feasibility of algorithm dissemination.
During the task, it became apparent that VA SLC HCS was the only hospital contributing to NSQIP for all four surgeries of interest. Intermountain Healthcare (IH) does participate in NSQIP, but does not perform NSQIP surveillance on all surgery types of interest. Neither Denver Health (DH) nor Vail participated in NSQIP. At the VA, VASQIP is the principal method of SSI surveillance. For the purposes of training and validating an algorithm, we needed a dataset much larger than the participating hospitals could provide so we decided to use nationwide VASQIP data. A database of vitals, laboratories, medications, microbiology data, and SSI outcomes was constructed for the purposes of algorithm development. Drawbacks of developing an algorithm entirely in the VA and applying it to other hospitals include the fact that the veteran patient population does not necessarily generalize well to the populace at large, and that the VA system has a comprehensive inpatient and outpatient system. IH and DH are similar in this respect. The use of both inpatient and outpatient data improve postdischarge surveillance while making it more efficient. We anticipated that the algorithms might fare more poorly at Vail Valley Medical Center (VVMC) due to the nonintegrated inpatient and outpatient care systems.
A significant amount of work was devoted to ensuring data quality. Since VA data came from many individual hospitals with different data formats, the distributions of each of the data elements was examined to look for outliers. These outliers were then examined to see if they stemmed from differences in units or other unanticipated formats. The final distributions of each were within anticipated bounds.
Subtask 2.2. Develop Procedure-specific Algorithms Utilizing Identified Data Sources to Detect SSI Events
2.2.1. Create and train the algorithm
Our objective was to build an algorithm with high negative predictive value that favored sensitivity over specificity, and relied on human adjudication to improve the specificity of the SSI surveillance system. Traditionally, prospective surveillance systems that rely on manual human review have suffered from suboptimal sensitivity. Because SSIs are rare outcomes, many hours are spent to find each infection, which is extremely inefficient and time consuming. Further, sensitivity may be low because of reviewer fatigue.45 A human-adjudicated system that reduces workload by removing charts unlikely to contain SSI both reduces the amount of work necessary to detect SSIs and raises the reviewer's expectation that a chart might contain an SSI.
Previous experience detecting methicillin-resistant Staphylococcus aureus (MRSA) by means of electronic algorithms46–48 guided our efforts to find electronic signs of infection as opposed to risk factors. We began with identifying candidate surgeries among VASQIP data from 2007 through 2009. As VASQIP surgeries are identified by CPT code and not by ICD-9s, we built a map between the two for the four target procedures: coronary artery bypass grafts, total hip arthroplasties, total knee arthroplasties, and abdominal and inguinal herniorrhaphies. We used the UMLS (Unified Medical Language System) metathesaurus concepts to bridge between ICD-9 and CPT vocabularies at the level of coronary artery bypass grafting (CABG), herniorrhaphies, procedures of the hip, and procedures of the distal femur and knee. We then reviewed the children of these concepts and identified codes that described the types of procedures that were included in the ICD-9 list. We felt that this procedure was more reproducible and updatable than an entirely manual mapping attempt. Details of our findings while performing this mapping are included in Appendix D.
Once the necessary CPT codes were identified, they were used to identify candidate surgeries among all VA hospitals. Between 2007 and 2009, there were 71,102 targeted procedures performed and reviewed in our sampling of the VASQIP dataset. This set was randomly divided into two equal sets, one for training and one for testing. However, due to gaps in the laboratory data we received, the cases before January 1, 2008, were excluded from the testing set.
The dataset also noted whether a superficial, deep, or organ-space SSI was identified within 30 days of the surgical procedure. For simplicity, we summarized the information present in the different levels of infection into dichotomous variables indicating the presence or absence of deep or organ-space SSI, or SSI of any type. As an IP would still need to review the chart, we felt that it was unnecessary for an algorithm to be trained to find each SSI type separately.
These data were then linked to potential manifestations of disease. We included electronic markers between postoperative days 4 and 30 because pre- or perioperative data might indicate risks for SSI or that the patient was already infected at the time of operation. We then investigated the relationship of leukocyte count, temperature, the sending of a microbiology culture, whether the culture matched, the administration of an antibiotic (inpatient or outpatient), readmission, erythrocyte sedimentation rate, and C-reactive protein to SSI. Maximum values during the eligible time-frame were used for laboratory values and vitals. The administration of an antibiotic was limited to systemic antibacterials and readmission was limited to admission to ICU or acute care medical or surgery wards. Although we recognized that risk factors could be associated with SSI as well, we were concerned about introducing mathematical coupling49—that bias would be introduced into any subsequent analyses of risk, because risk was used to determine eligibility for SSI.
We also considered the potential need for rules that considered the dynamic evolution of the patient's status over time. We excluded the first three postoperative days when determining the administration of antibiotics and sending of cultures because of recommendations for antibiotics prophylaxis and because some operations are performed on known or suspected infected joints. Laboratories and vitals were more difficult because successful operations without complication are known to cause abnormalities that resolve over time. Exhibits 4 and 5 show the evolution of C-reactive protein in total hip arthroplasties both without SSI and with SSI. The laboratory values appeared to have poor correlation with outcome during preliminary analysis, so it was unclear whether this line of analysis would yield much extra information. We opted to simplify by not considering this aspect.
To increase the amount of information that a microbiology culture could provide and to improve the specificity of electronic algorithms, we mapped the reported sample and specimen fields to a single collection-site type. Each type was categorized as to whether it could be consistent with each of the surgeries of interest (see Appendix E). For example, a urine specimen was considered to be incompatible with an SSI from any of our surgeries of interest. A wound swab was considered to be compatible with any of the surgeries. Synovial fluid from the hip was considered to be only compatible with an SSI after a total hip arthroplasty. While all of the cultures were mapped, not all were considered to be postoperative. Only postoperative cultures were included for consideration in the algorithm. We also extracted information regarding whether there was growth of any organism, whether there was growth of a virulent organism, and whether the specimen came from a normally sterile site. But, as it became clear that implementation would be difficult at other centers, we discontinued development of this aspect.
Various strategies are available for algorithm development. We targeted algorithms with high sensitivity that also could increase the efficiency of chart review by excluding a large fraction of negative charts. To do the latter while not impeding the former, we investigated methods that would allow interactions between variables. Classification tree and regression tree (CART) analysis, also called recursive partitioning, lends itself to the formulation of interacting rules and has been used previously in algorithms to detect SSI.16 This method is limited, however, because it does not analyze interactions along the entire range of variables. Another issue is that it is not as robust when dealing with frequent missing data. Random forest strategies may have had advantages, but we felt that, for user acceptability, it was important to have simple, understandable rules.
We used the function rpart for recursive partitioning in R, to develop algorithms. We used the function initially to detect all types of SSI, but because of the lack of sensitivity and inefficiencies when searching for superficial SSIs, it was subsequently trained to target only deep and organ-space SSIs. It was felt that the reliability in the reference standard would also be higher in this subgroup.50 The results presented in this document refer to these later algorithms; however, classification trees of these earlier attempts are also included in Appendix F. We specified a classification tree and a loss matrix to penalize false negatives. The loss matrix was weighted by the inverse of the prevalence of deep and organ-space infections in the set. The maximum depth was limited to three and the minimum number of cases in a branch, before a split was permitted, was three. Any tree that resulted in a change of the complexity parameter (cp) of more than 0.001 was investigated. Effort was taken to prune the tree at the cp that minimized the relative cross-validation error, but when the difference was small and the algorithm was not sensitive enough, values with more splits but slightly higher relative cross-validation errors were accepted. The model was built with the following R code:
fit<-rpart(SSI~WBC +ESR +CRP +NE_N +NE_P +FERRITIN
+P_WBC +P_ESR +P_CRP +P_FERRITIN +postopabx +postopcx +postopadmit
,data=N, minsplit=3, maxdepth=3, method="class", cp=.001, parms=list(loss=loss))
Where ‘N' is the original data frame and ‘N$SSI' is the vector representing deep and organ-space SSI. ‘loss' is the loss matrix. The modification of the variable names with the prefix ‘P_' indicates the presence of a value, as we anticipated that the presence or absence of a lab may be informative as well. NE_N denotes the absolute number of neutrophils and NE_P denotes the percentage of neutrophils.
The initial tree, the correlation coefficient and cross-validation errors, and the final tree (if different from the initial tree) are included in Appendices F & G. The presentation of rules in Exhibit 6 is equivalent to the charts in the appendix. This is because the tree has been collapsed into an expression of the set of surgeries among which SSI are likely.
The way in which classification trees are collapsed is illustrated in Exhibit 6, if the first branch includes on A and its subbranch includes on B, then the set A∩B meets both conditions, where ∩ is the INTERSECTION operator. Because of the law of commutation, A∩B= B∩A and, by extension, A∩B∩C= C∩B∩A. Each branch that extends to the right joins the new set by the INTERSECTION operator. The rest of the figure describes how rules may be expanded. Exhibit 7 presents the same algorithm as the charts in the appendix, except that the law of commutation has been used to make the rules more presentable. An important point to make is that missing values do not evaluate to true or to false. They evaluate to NULL. All of these NULLs are also interpreted as positive.
Alternatively, we tried an “inclusive” rule using the presence of any high-normal value.
- The presence of an erythrocyte sedimentation rate (ESR) greater than 20
- Or a total neutrophil count greater than 5,000/mm3
- Or a leukocyte counter greater than 9,000/mm3
- Or a C-reactive protein greater than 3mg/dL
- Or postoperative antibiotics given
- Or the presence of a postoperative culture, or the patient was readmitted within 30 days postoperatively.
Finally, we also implemented a "simple" rule.
- Microbiology test ordered between postoperative days 4 and 30, inclusively
- An antibacterial was prescribed between postoperative days 4 and 30, inclusively
The algorithms' performance on the training set can be seen in Exhibits 8–10. In Exhibit 8, the total numbers of procedures are divided by procedure type. Additionally, the breakdown by depth of infection and fraction of total SSI are expressed. In Exhibit 9, two-by-two tables and diagnostic accuracy in the training set are listed. Additionally, the fraction excluded and the numbers of charts that need to be reviewed per positive case are expressed for both unfiltered review and algorithm filtered review. The sensitivity was as low as 93.3 percent for herniorrhaphies and total knee arthroplasties. When heavily penalizing false-negatives, even beyond the inverse prevalence, the rpart function would not return acceptable algorithms. The “inclusive algorithm” was implemented as well, which is detailed above, but did not employ any other logic than logical OR/set union. Exhibit 10 demonstrates the gains in sensitivity from a very inclusive algorithm. While there were some gains in sensitivity, particularly with respect to superficial SSI, which we were not targeting, this came at the expense of needing to review approximately one-third of the charts.
After the algorithm was developed, its sensitivity was 93.8% (95% confidence interval [CI] = 88.5–97.1) and its specificity was 93.0% (95% CI = 92.7–93.3) for all procedures compared to the training NSQIP dataset. Its positive and negative predictive values were 5.2 (95% CI = 4.3–6.1) and 99.99% (95% CI = 99.9–100), respectively. Thus, when an IP reviews these procedures, we would expect her/him to review 18.9 charts on average before finding an SSI using the recursive partitioning algorithm, 79.3 using the “inclusive” algorithm, and 246.9 if all charts were reviewed.
2.2.2. Externally validate data elements for inclusion in the algorithm & create final list
The VA NSQIP data was randomly divided into two equal size sets for validation, because a second set of data was not collected prospectively. Data from the VA Salt Lake City Healthcare System were excluded because they would later be used in the analysis of our four principal hospitals of interest. We had initially considered bootstrap validation. However, when we decided on using NSQIP data and anticipated a much larger number of outcomes, we elected to use one-fold cross-validation.
Sensitivity, specificity, and positive and negative predictive value were calculated by comparing the electronic algorithm's output against the testing set to report final validation numbers. Its sensitivity and specificity were 73.1 percent and 92.9 percent, respectively. The positive predictive value (PPV) was 3.9 percent and the negative predictive value (NPV) was 99.9 percent. Unfortunately, the statistics for sensitivity were well below those seen on the training set. However, the inclusive and simple algorithms' performance remained stable, as seen when comparing Exhibits 10 and 11.
Not all elements were included in the final algorithm. Although we initially planned to include fever in the algorithm, Denver Health did not have this information extending back through the whole cohort so it was removed from analysis. To develop an easily comprehensible algorithm with face validity and easier implementation, we targeted an algorithm with a minimal set of easily pulled elements (Exhibit 12).
Each of the hospitals was sent the data elements necessary for the final algorithm. Actual code scripts were also sent to facilitate algorithm implementation; however, we realized, as others have before,51, 52 that tailoring and adjustments would have to be made to accommodate different data structures at each facility. The final algorithm was implemented in the SQL script included in Appendices F and G.
2.2.3. Create test database from final/validated list of data elements
As previously mentioned, it became apparent that the movement of individual level data between institutions for this task was an unworkable option. In lieu of this, we focused on algorithm portability. Instead of a single database upon which to run the electronic algorithm, multiple local databases and local implementation of electronic algorithms solved the issue of moving individual-level data back and forth. Each institution performed data pulls for the necessary elements. Scripts that encoded algorithms for each of the four target procedures were written and distributed to each of the facilities. Data and computer professionals at each hospital tailored the code to run on their data. Each center's code was reviewed by the team at Salt Lake VAMC in order to catch potential misunderstandings during the implementation and adaptation process. Finally, the electronic algorithm was used on local hospital data, which identified charts to be reviewed for Subtask 2.3.
Page originally created December 2012