Michael A. Rubin, Makoto Jones, Jefrey L. Huntington, M. Josh Durfee, Susan L. Moore, James F. Lloyd, Christopher Nielson, Heather Gilmartin, R. Scott Evans, Walter L. Biffl, Lucy A. Savitz, Connie Savor Price
Automated systems for surgical site infection (SSI) surveillance have been developed and used, but they are rarely tested for generalizability. We sought to develop highly sensitive algorithms for detecting deep and organ-space SSI based on electronic data to flag charts for subsequent clinical review. We developed three electronic algorithms to detect deep and organ-space SSI after coronary artery bypass grafting, total hip and knee arthroplasties, and herniorrhaphies, using a sample of nationwide Veterans Affairs Surgical Quality Improvement Program (VASQIP) data. One algorithm was created using recursive partitioning, while the other two used simpler methods based on abnormal laboratory values or the presence of postoperative microbiology or antimicrobial data. The algorithms were tested against VASQIP data and then assessed for generalization using data from hospitals in three different, external (non-VA) health care systems. Although all three algorithms performed reasonably well at identifying deep and organ-space SSI in the VASQIP test dataset, the recursive partitioning algorithm showed a lower sensitivity than expected. Performance worsened considerably when tested against data from the outside hospital systems, suggesting that the recursive partitioning algorithm was overfit (i.e., did not generalize well to test data samples), despite 10-fold cross validation. The simpler algorithms were more robust, but performance was variable between facilities. The observed variation was primarily due to differences in the data collected and stored in each system. The development of generalizable algorithms to detect SSI using electronic data necessitates careful consideration of the data readily available at most health care systems.
The purpose of traditional infection surveillance, performed manually by infection preventionists (IPs), is at least two-fold: (1) to improve situational awareness and (2) to accurately detect trends and differences across times or locations. For the former, it is most useful to have a high sensitivity; for the latter, it is most useful to have a high specificity. IPs have been favored over automated systems for this task because of their adaptability and clinical judgment about the presence or absence of surgical site infection (SSI). Continuing to solely use IPs in this role may appear ideal, but because of increasing time demands, they often cannot devote adequate time to all of their responsibilities.1–3 Also, the fact that they can and do use clinical judgment can potentially lead to issues concerning intra- and inter-rater comparability and reliability.
Health care systems with electronic health records (EHRs) may improve the efficiency of their SSI surveillance activities (i.e., time spent to find a positive case) and improve case finding reliability by leveraging electronic data. Although many potential approaches exist, the system long employed by Intermountain Healthcare (IH) uses electronic algorithms to screen potential cases and populate more manageable queues of charts for an IP to subsequently review.4 This approach can capitalize on the IP's superior specificity (i.e., ability to discern the presence of a true SSI) and may significantly reduce the work burden. When IH initially implemented this scheme, few facilities had the data infrastructure or capacity to replicate its system, but as more facilities employ sophisticated EHRs, more may now be able to implement similar electronic surveillance strategies that are augmented by clinician review.
Completely automated electronic systems can review charts rapidly and without concern for adaptation. There is some evidence to suggest that, in some situations, automated systems may be the instrument of choice.5 However, these systems can be extremely sensitive to artifacts of data manipulation or changes in clinical practice. Also, automated algorithms are usually limited to using structured data and cannot utilize the same body of information as manual review, such as the information contained within text notes. As a result, the specificity of these systems is typically inferior to manual review.
The purpose of our work was to develop an SSI surveillance tool that detects downstream manifestations of SSI as indicated in electronic data and to implement and test this tool at four disparate health care organizations. We chose to build a two-tiered system: the first tier is run by the automated system, which removes charts where the signal is weak enough to safely exclude; the second tier involves human review on the more difficult cases, where a superior ability to discriminate between signal and noise (i.e., cases and non-cases) can be efficiently applied. We hypothesized that this system would lead to comparable results between health care systems and considerable time savings during surveillance activities.
Study Sites and Cohort
Our study involved four participating centers: the VA Salt Lake City Health Care System (VASLCHCS), Denver Health (DH), Vail Valley Medical Center (VVMC), and Intermountain Healthcare (IH). The population of interest was all patients who underwent coronary artery bypass grafting (CABG), total hip arthroplasty (THA), total knee arthroplasty (TKA), and abdominal and inguinal herniorrhaphy. These patient populations were chosen because of the relatively high volume of procedures at these centers and the risk of deep and organ-space SSIs. To develop, train, and test our electronic algorithms, we used VA Surgical Quality Improvement Program (VASQIP) data on the outcomes of patients who underwent these procedures from January 1, 2007 through December 31, 2009. These data were selected because of the large volume of data available compared with the other centers and to amass a reasonable number of SSIs for training. We supplemented these data with VA enterprise-wide microbiology, laboratory, admission/discharge/transfer, bar code medication administration, and vitals data from 1 week prior to 30 days after the surgical procedure. Similar external test datasets were developed for each participating center.
Each of the centers had different pre-existing strategies for SSI surveillance. DH and VVMC generally followed National Healthcare Safety Network (NHSN) guidelines and performed traditional manual surveillance. While centers were opportunistic when recording post-discharge, prosthetic-related infections up to a year after surgery, they did not systematically follow up on patients beyond 30 days postoperatively. IH had previously pioneered electronically supported, clinician-managed surveillance systems and uses this modality routinely.6 The VA uses VASQIP for surveillance, with rules similar to (but not entirely the same as) NHSN. Each of the facilities pulled the results of routine surveillance based on its own methodologies into databases residing on its own systems. Each of these datasets served as a reference standard representing the status quo. Table 1 shows the procedure and SSI data gathered from the four centers.
We performed a literature review using MEDLINE to identify data elements that were likely to inform a diagnosis of SSI. We selected articles that pertained to the manifestations of SSI that were potentially included in electronic records. We identified leukocyte count, leukocyte differential, fever, procalcitonin, erythrocyte sedimentation rate (ESR), c-reactive protein (CRP), microbiology results, and antimicrobial administration as potential elements to include.7–20 A significant number of published algorithms also incorporated claims data, but these data were excluded from our algorithm because they often are not available at the time of IP case review.9,10,12–16,21 Not all of the elements were included in the final algorithm; for instance, although we initially planned to include fever, it was excluded because DH did not record these data through the entire study period.
Algorithm Training and Testing Data
We began by identifying candidate surgeries among VASQIP data from 2007 through 2009. Because VASQIP surgeries are identified by Current Procedural Terminology (CPT) code and not by International Classification of Diseases, Ninth Revision (ICD-9), we built a map between the two vocabularies for the four target surgeries, using the Unified Medical Language System (UMLS) metathesaurus concepts. Included surgeries were identified by both CPT and ICD-9 codes.
VASQIP surveillance is the principal method of SSI accounting at the VA; as such, surveillance is not performed on all surgeries, but rather on a subset. During our study timeframe 63,290 of the target procedures were performed and reviewed in the VASQIP system. This set was divided randomly into two sets for training and testing of the algorithm. Data from VASLCHCS were excluded from the training set because they would later be used in the analysis of our four principal centers.
Table 1. Number of procedures and SSIs between 2008 and 2009 stratified by hospital and type
Abbreviations: CABG = coronary artery bypass grafting; DH = Denver Health; HERNIA = herniorrhaphy; IH = Intermountain Healthcare; SSI = surgical site infection; TKA = total knee arthroplasty; THA = total hip arthroplasty; sSSI = superficial SSI; dSSI = deep SSI; oSSI = organ-space SSI; VASLCHCS = VA Salt Lake City Health Care System; VVMC = Vail Valley Medical Center.
The VASQIP data included whether a superficial, deep, or organ-space SSI was identified within 30 days of the surgical procedure. For simplicity, we dichotomized this variable to indicate the presence or absence of any SSI type. These data were then linked to potential manifestations of disease. We included electronic markers between postoperative days 4 and 30 because earlier data might indicate that the patient was already infected at the time of operation. Our candidate electronic data elements were leukocyte count, temperature, the sending of a culture, the administration of a systemic antibiotic (inpatient or outpatient), hospital readmission, ESR, and CRP to the presence of SSI. Maximum values during the eligible timeframe were used for laboratory values and vitals.
We targeted algorithms with high sensitivity and high negative predictive value that could increase the efficiency of chart review by excluding a large fraction of negative charts. To accomplish the latter while not impeding the former, we investigated methods that would allow interactions between variables. Classification tree and regression tree analysis (CART, also called recursive partitioning) lends itself to the formulation of interacting rules and has been used previously in algorithms to detect SSI.13 This method is somewhat limited in that it does not analyze interactions along the entire range of variables. Another issue is that, much of the time, postoperative laboratory elements are missing. Random forest strategies may have advantages when dealing with sets where much of the data are missing, but we thought that for user acceptability it was important to have simple, coherent rules.
We used the function rpart for recursive partitioning in the R software package22 to develop our algorithms. Initially, we tried to detect all SSI, but because of the lack of sensitivity and inefficiencies when searching for superficial SSI (sSSI), we trained on only deep (dSSI) and organ-space (oSSI) infections. We specified a classification tree and a loss matrix to penalize false negatives. The loss matrix was weighted by the inverse of the prevalence of dSSI and oSSI in the set. The maximum tree-depth was limited to three, and the minimum number of cases in a branch before a split was permitted was three. Any tree that resulted in a change of the complexity parameter (cp) of more than 0.001 was investigated. Effort was taken to prune the tree at the cp that minimized the relative cross-validation error, but when the difference was small and the algorithm was not sensitive enough, values with more splits but slightly higher relative cross-validation errors were accepted.
In addition to the rpart algorithm, we created an "inclusive" algorithm using the presence of any high-normal laboratory value and a "simple" algorithm that looked only for postoperative cultures and antimicrobials. The specific rules for all three algorithms are shown in Table 2.
Each of the hospitals was then sent the data elements necessary for the final algorithm. Actual code scripts were also sent to facilitate algorithm implementation; however, tailoring and adjustments were made to accommodate different data structures at each facility.
For clinical review, we randomly selected up to 50 charts that had been flagged as positive by both the reference standard and the algorithm and up to 50 negatives (false positives by the same standard) for manual review at each center. The reviewer was blind as to the result of routine surveillance as well as to the ratio of positives and negatives. Each chart was classified as to whether an SSI was present and the depth of the SSI. Charts not queued for review by the algorithm were considered negative by the manual review system.
Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated by comparing the modality's output against the reference standard.
Human Subjects Research
This study was approved by the institutional review boards at all participating sites, including the University of Utah, IH, DH, and VVMC, as well as by the Research and Development Committee at VASLCHCS.
Table 2. Component rules of the rpart, inclusive, and simple algorithms
Total Knee Arthroplasty
Total Hip Arthroplasty
Abbreviations: CABG–coronary artery bypass grafting; CRP–c-reactive protein.
Algorithm Training and Testing Performance
For the VASQIP training set, the overall sensitivity of the algorithms (for deep and organ-space SSIs combined) was 93.8 percent, and its specificity was 92.7 percent for all four surgical procedures; the positive and negative predictive values were 5.0 percent and 99.9 percent, respectively. Thus, we anticipated that when an IP reviewed procedures identified by the algorithms, this person would, on average, review 20.1 charts to find each SSI using the recursive partitioning algorithm, 57.3 charts using the "inclusive" algorithm, and 246.9 if all charts were reviewed.
When the algorithms were applied to the VASQIP test dataset, the overall sensitivity and specificity of the rpart algorithm were 73.1 percent and 92.9 percent, respectively, with a PPV of 3.9 percent and an NPV of 99.9 percent. The performance of the inclusive and simple algorithms was somewhat better and remained stable in both training and test sets.
We then applied our electronic algorithm to all surgical procedures that met our prespecified criteria at each principal hospital. The results are shown in Table 3. Overall, the sensitivity was 37.8 percent, the specificity was 94.3 percent, the PPV was 2.0 percent, and the NPV was 99.8 percent.
To investigate the reasons for false alarms by the algorithm at the various sites, we reviewed the false positives identified by the algorithm, as well as the positives identified by routine surveillance. At DH, the study reviewer agreed with all of the cases identified as positive by routine surveillance. Four surgeries were noted to have incorrect ICD-9 codes, indicating that they should not have been included. The study reviewer also identified one superficial SSI and one deep SSI queued by the algorithm but not found in routine surveillance records. At VASLCHCS, four additional deep and organ-space SSIs were identified by the study reviewer in addition to those identified by routine surveillance. At VVMC, all algorithm-identified cases were false positives. At IH, the study reviewer agreed with all positive cases identified by manual surveillance but with none identified by algorithm, except for two cases that appeared to have errors with identifiers.
False negatives were also reviewed at each center to determine the reasons for low sensitivity. At DH, two of the false negatives represented problems with the data pull. One SSI was assigned to the wrong hip replacement in the historical dataset. The hip replacement with infection was not in the dataset. Another procedure identified as having an SSI was actually a hysterectomy. Three surgeries were missed because the SSI occurred more than 30 days postoperatively. One SSI was missed because laboratories were only available from the outpatient setting. One SSI could only have been picked up from emergency department notes. Only two SSIs could have been picked up by electronic data, but they were missed due to the algorithm's threshold criteria.
Table 3. Accuracy of algorithm at each participating hospital
|Accuracy of Algorithm at DH||Accuracy of Algorithm at IH|
|Routine Surveillance||Routine Surveillance|
|no SSI||7||1345||1352||no SSI||16||10857||10873|
|Positive Predictive Value||7.8%||Positive Predictive Value||1.3%|
|Negative Predictive Value||99.5%||Negative Predictive Value||99.9%|
|Accuracy of Algorithm at VASLCHCS||Accuracy of Algorithm at VVMC|
|Routine Surveillance||Routine Surveillance|
|no SSI||2||531||533||no SSI||3||832||835|
|Positive Predictive Value||5.7%||Positive Predictive Value||0.0%|
|Negative Predictive Value||99.6%||Negative Predictive Value||99.6%|
Abbreviations: Algo = algorithm; DH = Denver Health; IH = Intermountain Health; SSI = surgical site infection; VASLCHCS = VA Salt Lake City Health Care System; VVMC = Vail Valley Medical Center.
At VASLCHCS, only two SSIs were missed. Both occurred in total hip arthroplasties with onset of infection more than 30 days postoperatively. At VVMC, one infection was treated in the outpatient setting, and another was treated at an outside facility. The last infection developed 11 months after surgery and thus was not picked up because it occurred more than 30 days postoperatively.
At IH, 11 of 16 false negatives occurred because the algorithm missed important information in the notes and microbiology. All the data necessary to make an SSI diagnosis occurred after discharge from the initial surgery. In two cases, the reviewer thought that the cases were ambiguous; in another two, the reviewer disagreed that the cases were SSIs. In one case, the reviewer thought that the case was an sSSI rather than a dSSI or oSSI.
Our objective was to generate algorithms that would feature high sensitivity and require a low number of charts to review per SSI found; however, we found that our recursive partitioning algorithm had a low sensitivity in the testing set and even poorer performance when tested in outside hospitals. Our simpler algorithms were more robust, which suggests that the recursive partitioning algorithm was overfit (that is, fit too closely to the data, resulting in poor generalization) to both the sample data and the VA data. Performance was quite variable between facilities.
When SSI rates between facilities are compared, algorithm diagnostic accuracy and reliability must be carefully considered. Usually, routine prospective surveillance or some augmentation of it is used as a reference standard. Routine, manual, prospective surveillance is estimated to have a sensitivity between 30 percent and in excess of 90 percent, with most estimates in the 70 percent to 80 percent range.7,16,23–25 In addition, the reliability of manual healthcare-associated infection and SSI surveillance has been reported to be less than ideal.16,26–29 Any comparisons to such standards must take this into account.
Electronic algorithms are frequently reported to have sensitivities in excess of 80 percent.19,20 Only some of these algorithms have been applied to multiple hospitals, and none of them report individual hospital validation results among hospitals as heterogeneous as the principal hospitals in our study.7,13,16 Although our recursive partitioning algorithm had high sensitivity on the VASQIP training set, it was notably lower on the VASQIP test set. The pooled sensitivity at the four principal hospitals was only about 40 percent. These results contrast with the high performance seen in other published literature. Specificities and predictive values were relatively stable between our training and testing sets.
The differences in sensitivities that we saw in the recursive partitioning algorithm suggest that the model was overfit at two levels: first, overfitting to the training data, and second, overfitting to the VA data. One study in the literature used the same method to develop algorithms and reported high sensitivities;13 however, those algorithms were not applied to external data. We expected the sensitivity of the algorithms developed in our study to be high because of success with previously devised algorithms, and because we surmised that it was unlikely that patients with either deep or organ-space SSIs would be absent of both antibiotic therapy and any culture testing for etiologic microorganisms. However, when these algorithms were tested against other hospitals, sensitivity and PPV varied. At VASLCHCS, no improvement in sensitivity over the recursive partitioning algorithm was observed, perhaps due to small numbers. At IH, a relatively large number of false positives were generated; this appears to be largely due to the use of antimicrobials during the postoperative period at this center. At DH, the simple algorithm fared poorly, while the inclusive algorithm fared better, perhaps because the simple algorithm relied more heavily on antimicrobial prescribing, a large amount of which (on the outpatient side) may not be captured by the DH system. This underscores our concern that even more robust "common-sense" algorithms that included elements successful at other institutions still did not generalize well because of institutional differences in data collection and clinical practice.7,13,16
The strengths of our study include drawing from VASQIP data to amass a reasonable number of SSIs for training. Also, the use of one-fold cross validation on the VASQIP dataset on an algorithm that was already derived with a 10-fold cross validation and external implementation at other hospitals presents a more realistic picture of algorithm accuracy and its variability. The main limitations of this work are related to three key issues. First is the use of routine, manual surveillance from each facility as the reference standard. Since the accuracy and reliability of manual SSI surveillance performed at different medical centers are thought to be suboptimal, cases identified as true positives at different centers may meet different sets of criteria. Second is the fact that small numbers of SSIs were observed in our four centers, limiting our ability to develop robust algorithms and to obtain an accurate assessment of their performance. The final issue is the variability in data availability and standardization across the different health care systems.
In the future, improving diagnostic sensitivity while keeping the number of charts needed to review low can only be accomplished by improving the algorithm's ability to distinguish between SSI and other abnormal conditions. This could be accomplished by using procedures more robust to sparse data for algorithm development, incorporating dynamic thresholds for laboratory values and vitals, and enriching the input data by using natural language processing to extract information from text notes. Any electronic algorithm used to compare SSI rates at different centers should undergo extensive testing before operational use.
The following key lessons were learned as a result of the work performed in this study, which will help guide future work in this area:
- Generating automated electronic algorithms to detect SSIs across disparate health care systems is complicated by incompatible or missing data and relatively small numbers of true cases.
- The reference standard for SSI surveillance—routine, manual, prospective surveillance—is suboptimal for comparison because of issues with sensitivity and reliability across different centers.
- Advanced methods for algorithm development, including procedures robust to sparse data and natural language processing, will likely be needed to produce algorithms useful for surveillance across disparate health care systems.
This project was funded under contract no. HHSA290200600020i, Task Order 8, from the Agency for Healthcare Research and Quality (AHRQ), U.S. Department of Health and Human Services; the VA Informatics and Computing Infrastructure (VINCI; VA HSR HIR 08-204); and the Consortium for Healthcare Informatics Research (CHIR; VA HSR HIR 08-374). The findings and conclusions in this document are those of the authors, who are responsible for its content, and do not necessarily represent the views of AHRQ. No statement in this report should be construed as an official position of AHRQ, the U.S. Department of Health and Human Services, or the Department of Veterans Affairs. This work was supported using resources and facilities at VASLCHCS, IH, DH, and VVMC. We also gratefully acknowledge the VA Surgical Quality Improvement Program (VASQIP) and VA Patient Care Services (PCS) for providing data and support for this project.
VA Salt Lake City Health Care System, Salt Lake City, UT (MAR, MJ). University of Utah, Salt Lake City, UT (MAR, MJ, RSE). Intermountain Healthcare, Salt Lake City, UT (JLH, JFL, LAS, RSE). Denver Health and Hospital, Denver, CO (CSP, MJD, SLM, WLB). University of Colorado School of Medicine, Denver, CO (CSP). VA Reno Medical Center, Reno, NV and Veterans Health Administration Office of Patient Care Services, Washington DC (CN). Vail Valley Medical Center, Vail, CO (HG).
Address correspondence to: Michael A. Rubin, MD, PhD, George E. Whalen Dept. of Veterans Affairs Medical Center, 500 Foothill Drive, Salt Lake City, UT 84148; Email: Michael.Rubin2@va.gov.
2. Goldrick BA. The Certification Board of Infection Control and Epidemiology white paper: the value of certification for infection control professionals. Am J Infect Control 2007 Apr;35(3):150-6. PMID: 17433937.
3. Stevenson KB, Murphy CL, Samore MH, et al. Assessing the status of infection control programs in small rural hospitals in the western United States. Am J Infect Control 2004 Aug;32(5):255-61. PMID: 15292888.
7. Bolon MK, Hooper D, Stevenson KB, et al. Improved surveillance for surgical site infections after orthopedic implantation procedures: extending applications for automated data. Clin Infect Dis 2009 May 1;48(9):1223-9. PMID:19335165.
8. Hirschhorn LR, Currier JS, Platt R. Electronic surveillance of antibiotic exposure and coded discharge diagnoses as indicators of postoperative infection and other quality assurance measures. Infect Control Hosp Epidemiol 1993 Jan;14(1):21-8. PMID: 19335165.
9. Huang SS, Livingston JM, Rawson NS, et al. Developing algorithms for healthcare insurers to systematically monitor surgical site infection rates. BMC Med Res Methodol 2007 Jun 6;7:20. PMID: 17553168.
13. Sands K, Vineyard G, Livingston J, et al. Efficient identification of postdischarge surgical site infections: use of automated pharmacy dispensing information, administrative data, and medical record information. J Infect Dis 1999 Feb;179(2):434-41. PMID: 9878028.
14. Spolaore P, Pellizzer G, Fedeli U, et al. Linkage of microbiology reports and hospital discharge diagnoses for surveillance of surgical site infections. J Hosp Infect 2005 Aug;60(4):317-20. PMID: 16002016.
15. Stevenson KB, Khan Y, Dickman J, et al. Administrative coding data, compared with CDC/NHSN criteria, are poor indicators of health care-associated infections. Am J Infect Control 2008 Apr;36(3):155-64. PMID: 18371510.
21. Sands KE, Yokoe DS, Hooper DC, et al. Detection of postoperative surgical-site infections: comparison of health plan-based surveillance with hospital-based programs. Infect Control Hosp Epidemiol 2003 Oct;24(10):741-3. PMID: 14587934.
25. Rosenthal R, Weber WP, Marti WR, et al. Surveillance of surgical site infections by surgeons: biased underreporting or useful epidemiological data? J Hosp Infect 2010 Jul;75(3):178-82. PMID: 20227139.
26. Allami MK, Jamil W, Fourie B, et al. Superficial incisional infection in arthroplasty of the lower limb. Interobserver reliability of the current diagnostic criteria. J Bone Joint Surg Br 2005 Sep;87(9):1267-71. PMID: 116129756.
28. Mayer J, Howell, J, Green, T, et al. Assessing inter-rater reliability (IRR) of surveillance decisions by infection preventionists (IPs). Fifth Decennial International Conference on Healthcare-Associated Infections. 2010 Mar 19; Atlanta, GA.