Paper type: Proposal (eHealth) Using Computational Approaches to Improve Risk-Stratified Patient Management: Rationale and Methods

Gang Luo1, PhD; Bryan L Stone2, MD, MS; Farrant Sakaguchi3, MD, MS; Xiaoming Sheng4, PhD; Maureen A Murtaugh5, PhD, RDN 1Department of Biomedical Informatics, University of Utah, Suite 140, 421 Wakara Way, Salt Lake City, UT 84108, USA 2Department of Pediatrics, University of Utah, 100 N Mario Capecchi Drive, Salt Lake City, UT 84113, USA 3Department of Family and Preventive Medicine, University of Utah, 375 Chipeta Way, Suite A, Salt Lake City, UT 84108, USA 4Department of Pediatrics, University of Utah, 295 Chipeta Way, Salt Lake City, UT 84108, USA 5Department of Internal Medicine, University of Utah, 295 Chipeta Way, Salt Lake City, UT 84108, USA gang.luo@utah.edu, bryan.stone@hsc.utah.edu, farrant.sakaguchi@hsc.utah.edu, xiaoming.sheng@utah.edu, maureen.murtaugh@hsc.utah.edu


Introduction
Risk-stratified management of chronic disease patients Table 1. Description of four patient management strategies.

Management strategy Description
Case management "A collaborative process that assesses, plans, implements, coordinates, monitors, and evaluates the options and services required to meet" a patient's "health and human service needs" [12]. It involves a case manager who calls the patient periodically, helps make doctor appointments, and arranges for health and health-related services. Disease management Example intervention: Check electronic medical records to find and call high-risk patients with the disease who require a specific test, but have not had it for ≥2 years. Supported self-care Example intervention: Give patients electronic monitoring tools for self-management. Wellness promotion Example intervention: Mail educational materials on how to maintain health.
Chronic diseases affect ~52% of Americans and consume 86% of healthcare costs [1]. Example management strategies for care include case management, disease management, supported self-care, and wellness promotion listed in Table 1 in descending order of intensity. Each strategy is widely used and has its own benefits and properties [2,3], e.g., most major employers purchase and nearly all private health plans offer case management services [2,4] targeting early interventions at high-risk patients to prevent large expenditures and avoid deterioration of health status. Proper use of case management can reduce hospital (re)admissions and emergency department visits by up to 30-40% [3,[5][6][7][8][9], lower cost by up to 15% [6][7][8][9][10], and improve patient satisfaction, quality of life, and treatment adherence by 30-60% [5]. A case management program can cost >$5,000 per patient per year [6], and typically enrolls only 1-3% of targeted patients due to resource limitations [11]. For maximal benefit, only patients expected to incur the highest costs and/or with the poorest prognosis should be enrolled.
Patients' healthcare use and costs have a pyramid-like distribution. A small portion of patients consume most healthcare resources and costs [13,14]. For instance, 25% and 80% of costs are spent on 1% and 20% of patients, respectively [11,14]. High costs often result from bad health outcomes or inappropriate use of healthcare. Typically, more intensive management strategies are more effective at improving health outcomes, but are also more expensive. To efficiently use limited resources, risk stratification is widely used in managing patients with chronic diseases such as asthma, chronic obstructive pulmonary disease, diabetes, and heart diseases [13]. As shown in Fig. 1, available management strategies are arranged into a hierarchy [14]. Patients are stratified based on predicted risk [6], and this risk can represent either high cost or a bad health outcome. Higher risk results in more intensive care to match expected returns [15]. For example, patients with predicted risk above the 99 th percentile are put into case management, and so on.

Problems with the current risk-stratified patient management approach
The current risk-stratified patient management approach has three shortcomings, which result in many patients not receiving the most appropriate care and greatly degrade its outcomes.
First, existing methods for predicting individual patients' risk have low accuracy resulting in mis-stratification. As shown in Allaudeen et al. [16], clinicians cannot predict well which patients will become high risk in the future. Criterion-based modeling uses a priori criteria to describe high-risk patients. It is ineffective partly due to regression to the mean, where most patients who incurred high cost or healthcare use in one period will stop doing so in the next period [17]. Frequently, a predictive model for individual patient health outcome or cost is used to automatically identify high-risk patients [5,[18][19][20][21][22][23]. For instance, health plans in 9 out of 12 communities are reported to use predictive modeling to identify candidate patients for case management [24]. For patients with predictions of the poorest outcomes or highest costs, case managers manually review patient charts and make final management decisions. Predictive modeling greatly outperforms clinicians and criterion-based modeling [17] and is the best method for identifying high-risk patients, yet needs improvement.
Existing predictive models for individual patient health outcomes and costs have low accuracy. When predicting a patient's cost, the average error is usually as large as the average cost [25] and the R 2 accuracy measure is <20% [26]. When predicting a patient's health outcome, the area under the receiver operating characteristic (ROC) curve accuracy measure is often low at much <0.8 [27, page 281, 28-31]. These large errors cause enrollment to align poorly with patients who would benefit most from a management program [5]. As shown in Weir et al. [23], among the top 10% of patients who incurred the highest costs, >60% were missed in the top 10% risk group selected by a predictive model. Among the top 1% of patients who incurred the highest costs, >80% and ~50% were missed in the top 1% and 10% risk groups selected, respectively. Suppose a case management program could accommodate 1% of affected patients. Even if case managers had time to manually review the top 10% risk group selected by the model and made perfect enrollment decisions, they would still miss half of the top 1% who incurred the highest costs. The case with health outcomes is similar [29,30].
Existing predictive models primarily use patient features only, implicitly assuming that a patient's health outcome and cost depend only on the patient's characteristics and are unrelated to the treating physician's characteristics, which are influential. The use of treating physician's characteristics, or physician profile features, has been exploited minimally in predictive modeling [28], leaving a knowledge gap.
Second, patients are at high risk for different reasons. Complex predictive models, including most machine learning models such as random forest, give no explanation for a prediction of high risk. Existing models also give no suggestion on interventions tailored to the patient's specific case. An intervention addressing the reason for being at high risk tends to be more effective than non-specific ones. For instance, for a patient who lives far from his/her physician and has difficulty accessing care, providing transportation can be effective.
A patient can be at high risk for multiple reasons, each corresponding to either a single or a combination of multiple patient and physician profile features. A clinician may give the patient tailored interventions based on subjective and variable clinical judgment, but is likely to miss some suitable interventions due to three factors: (1) Large practice variation (e.g., by 1.6-5.6 times) exists across different clinicians, healthcare facilities, and regions [13,27,[32][33][34][35][36][37]. (2)  clinician can concurrently process no more than a single-digit number of information items [38], making it difficult to identify all of these reasons due to the vast number of possible feature combinations. (3) Clinicians usually give interventions addressing patient factors only and miss those addressing physician factors. For instance, a physician may be unfamiliar with the patient's disease. Providing the physician continuing medical education on it can be effective.
Third, thresholds for risk strata are decided heuristically with no quality guarantee, leading to unnecessarily increased costs and/or suboptimal health outcomes. For instance, total future cost of all patients factoring in the management programs' costs is unlikely to be minimized even under the unrealistic assumption that we know exactly (1) each patient's future risk and (2) every program's impact on each patient's future cost if the patient is put into the program. Total future cost implicitly reflects patient health outcomes and the management programs' benefits. For instance, fewer hospitalizations usually lead to lower costs.

Improving prediction accuracy, explaining prediction results, suggesting tailored interventions, and computing optimal thresholds
New techniques are needed to improve risk-stratified patient management so that more patients can receive the most appropriate care. To fill the gap, we will (1) combine patient, physician profile, and environmental variable features to improve prediction accuracy of individual patient health outcomes and costs, (2) develop an algorithm to explain prediction results and suggest tailored interventions, (3) develop an algorithm to compute optimal thresholds for risk strata, and (4) conduct simulations to estimate outcomes of risk-stratified patient management for various configurations. A physician's practice profile contains his/her own information as well as clinical and administrative data of his/her patients aggregated historically. We hypothesize that using our techniques will increase prediction accuracy, improve outcomes, and reduce costs. The explanations and suggestions provided by our algorithm can help clinicians prioritize interventions and review structured attributes in patient charts more efficiently, and will be particularly useful for clinicians who are junior or unfamiliar with how to handle certain types of patients. After our methods identify patients with the highest predicted risks and give explanations and suggestions, clinicians would review patient charts, consider various factors (e.g., social factors, how likely a patient's health outcome will improve much [39, page 101]), and make final decisions on the management levels and interventions for these patients, as is often done in case management.

Innovation
This study is innovative for several reasons: (1) We will develop the first algorithm to a) explain prediction results, which is critical for clinicians to trust the results, and b) suggest tailored interventions. Currently no algorithm can do the latter. Our algorithm will explain results for any predictive model without degrading accuracy and solve a long-standing open problem. In contrast, existing explanation methods are usually model specific and decrease accuracy [40,41]. (2) We will transform risk-stratified patient management by personalizing management strategies based on objective data. At present, clinicians give interventions based on subjective and variable clinical judgment, and miss some of the suitable interventions for many high-risk patients. (3) The added value of physician profile features in predicting health outcomes and costs has never been systematically studied. We will include physician profile characteristics to construct new features and build new predictive models accurate for individual patients. (4) To better predict individual patient costs, we will develop a new and general technique for reducing features, a.k.a. independent variables. The technique can increase the prediction accuracy of any continuous outcome variable with a complex non-linear relationship with many independent variables. This is particularly useful when standard feature selection techniques [42] cannot narrow down many independent variables to a few effective features. (5) We will develop the first algorithm to compute optimal thresholds for risk strata. These thresholds aim at maximizing total expected return on the entire patient population, and will be better than those determined heuristically. Currently no algorithm exists for this purpose. (6) When a predictive model is used, our study will estimate outcomes of risk-stratified patient management with multiple management strategies. No such estimates have been provided before. Previous studies have estimated outcomes for a single management strategy: case management [43]. (7) We will use a new simulation method to determine which attributes are the most important to include in the predictive model. Different combinations of attributes will be used to determine the minimum performance requirement and allow tradeoffs for adapting use of our models beyond our setting based on available attributes. Previous predictive models have relied on a fixed set of attributes, which may not be collected by other sites, and thus do not generalize beyond the study site. (8) Often, a specific technique is useful for only a single disease or decision support application. In contrast, after proper extension our new techniques will generalize to a variety of decision support applications and disease settings. Examples of opportunities for future studies are: a) More precise models for health outcomes and costs will augment various decision support applications for managing limited resources, such as assisting with healthcare resource allocation planning [44], and automatically identifying patients likely to be admitted or readmitted in the near future, triggering earlier follow-up appointments or home visits by nurses to reduce admissions and readmissions. b) Adding physician profile features can improve prediction accuracy of other outcomes such as patient satisfaction [45], patient adherence [46], and missed appointments [47]. This would facilitate targeting resources, such as print and telephone reminders to reduce missed appointments [47], or interventions to improve treatment adherence [46]. c) The algorithm for explanations and suggestions can be used to explain prediction results and suggest interventions for various applications, such as to reduce missed appointments. d) The threshold computation algorithm can help target resources for various applications. e) Our simulation method can be used to deploy other predictive models in clinical practice.
In summary, the significance of this study is development of new techniques to help transform risk-stratified patient management and personalize management strategies so that more patients will receive the most appropriate care. Broad use of our techniques will improve clinical outcomes, patient satisfaction, and quality of life, and reduce healthcare use and cost.

Methods
Machine learning is a computer science area that studies computer algorithms improving automatically through experience. Machine learning methods, such as neural network, decision tree, and support vector machine, are widely used for predictive modeling [48] and will be used in our study. With less strict assumptions, e.g., on data distribution, machine learning can achieve higher prediction accuracy, sometimes doubling it, than statistical methods [11,49,50].

Data sets and test cases
This study will use a large clinical and administrative data set in Intermountain Healthcare (IH)'s enterprise data warehouse (EDW) for all four aims. IH is the largest healthcare system in Utah, with 185 clinics and 22 hospitals. IH's EDW contains ~9,000 tables and an extensive set of attributes [51]. Partial lists of patient and physician attributes follow: Patient attributes: admission date and time; age; orders (medications, labs, exams, immunizations, imaging, counseling, etc.), including order name, ordering provider, performing date, and result date; allergies; barriers (hearing, language, learning disability, mental status, religion, vision, etc.); cause of death; chief complaint; death date; diagnoses; discharge date; exam result; facility seen for the patient visit; gender; health insurance; healthcare cost (billed charge, Intermountain-internal cost, and reimbursed cost); height; home address; immunizations; lab test result; language(s) spoken; medication refills; primary care physician as listed in the electronic medical record; problem list; procedure date; procedures; provider involved in the visit; race/ethnicity; referrals; religion; visit type (inpatient, outpatient, urgent care, or emergency department); vital signs; weight; … Physician attributes: age; gender; health insurances accepted; level of affiliation with IH; office location(s); specialties; type of primary care physician; years in practice; … Our contracted IH data analyst will execute Oracle database SQL queries to extract a de-identified version of the data set, encrypt it, and transfer it securely to a password-protected and encrypted computer, on which we will perform secondary analysis. IH uses dedicated tables to track changes in diagnosis and procedure codes over time. The data set contains information on patient encounters over the past 11 years. For the last five years, data captured for children cover more than 400 pediatric primary care physicians, 360,698 pediatric patients (age 0 to 17), and 1,557,713 clinical encounters per year. Data captured for adults cover more than 600 primary care physicians, 878,448 adult patients (age≥18), and 5,786,414 clinical encounters per year. Asthma prevalence is ~7.6% in the IH pediatric population and ~8.6% in the IH adult population. The data set includes ~400 attributes and represents electronic documentation of ~85% of pediatric care and ~60% of adult care delivered in Utah [33,52]. IH dedicates extensive resources to data accuracy and integrity. Due to its large size and attribute richness, the data set gives us many advantages for exploring the proposed predictive models.
In addition, we will use 21 environmental variables recorded over 11 years by regional monitoring stations within the geographic area covered by IH. These variables include PM 2.5 , PM 10 , CO, NO 2 , SO 2 , O 3 , temperature, relative humidity, wind speed, precipitation, dew point, and activities of viruses (adenovirus, enterovirus, human metapneumovirus, influenza A virus, influenza B virus, parainfluenza virus types 1, 2, and 3, rhinovirus, and respiratory syncytial virus). Since the monitoring stations are spread across a large geographic area including the entire state of Utah, at any time the readings of the same environmental variable can differ greatly at different monitoring stations.
Using IH data, we will demonstrate our techniques on the test case of asthma patients. In the U.S., asthma affects 18.7 million adults (8%) [53] and 7.1 million children (9.6%) [54,55]. Patient management strategies such as case management can ensure proper care to reduce asthma exacerbations, improve school attendance and performance, and reduce hospitalizations and emergency department visits. This impacts both quality of life and 63% of total annual asthma costs attributable to asthma exacerbations [8,56].
Our analysis results will use different combinations of attributes to determine the minimum performance requirement and allow tradeoffs for adapting use of our models beyond our setting based on available attributes. Our results will provide a cornerstone to expand testing of our techniques on other clinical data sets, patient populations, and diseases beyond asthma in the future. As patient status and feature patterns associated with high risk change over time, our techniques can be periodically re-applied, e.g., to move patients across different management levels and identify newly occurring feature patterns.
Aim 1: Combine patient, physician profile, and environmental variable features to improve prediction accuracy of individual patient health outcomes and costs. Aim 1.a: Build predictive models for individual patient health outcomes. Framework: We will apply the framework shown in Fig. 2 to build predictive models using patient, physician profile, and environmental variable features. Environmental variables impact outcomes of certain diseases such as asthma [57,58]. The models will be used to predict individual patient health outcomes. For each physician, we build a practice profile including his/her own (e.g., demographic) information as well as aggregated historical information of his/her patients (excluding the index patient) from the provider's electronic medical record and administrative systems. An example physician practice profile attribute is the number of the physician's patients with a specific disease [59]. We use patient attributes to form patient features. We use both patient and physician practice profile attributes to form physician profile features. Each feature is formed from one or more base attributes. If the outcome variable is affected by environmental variables, we also use environmental variable attributes to construct features. Predictive models are built using patient, physician profile, and environmental variable features.
There are almost an infinite number of possible such features. In addition, factors such as characteristics of a pediatric patient's parents can impact patient outcomes. This study's purpose is not to list all possible features, exhaust all possible factors that can impact patient outcomes, and reach the theoretical limit of maximum possible prediction accuracy. Instead, our goal is to demonstrate that adding physician profile features can improve prediction accuracy, and subsequently risk-stratified patient management. A non-trivial improvement in health outcomes and/or reduction in costs can benefit the society greatly. As is typical with predictive modeling and adequate for our targeted decision support application, our study focuses on associations.
Data pre-processing: We will use established techniques, such as imputation, to deal with missing values and detect and remove/correct invalid values [48,60]. For environmental variables, we will use standard methods [61,62] to obtain aggregate values, such as monthly averages, from raw values. For administrative and clinical attributes, we will use grouper models such as the Diagnostic Cost Groups (DCG) system to group diseases, procedures, and drugs and reduce features [13,Chapter 5,25].

Patient features:
We will use standard patient features, such as age and diagnoses, that have been studied in the clinical predictive modeling literature [13,27,48]. Commonly used features are listed in Luo [32] and Schatz et al. [29].
Physician profile features: Some physician profile features are computed using only physician practice profile attributes. Examples of such features are: 1) The logarithm of the normalized number of a physician's patients with a specific characteristic, such as a specific disease, gender, race, or age range. Here, a logarithm is used to diminish the difference in the number across physicians.
2) The logarithm of the number of specific procedures performed by a physician. 3) The mean outcome of a physician's patients with a specific disease. If a physician does not have enough patients with a specific disease, we will set the disease's mean outcome in the physician's practice profile to the mean outcome of all patients with the disease. 4) The average cost of a physician's patients with a specific disease. 5) The average ratio of chronic controller to total asthma medications of a physician's asthma patients. This ratio is an asthma care quality measure [63][64][65][66]. 6) The mean of a feature of a physician's (pediatric) asthma patients with desirable/undesirable outcomes. 7) A physician's age. 8) The number of a physician's office hours per week. 9) A physician's years in practice. 10) A physician's specialty. Some physician profile features are formed by combining patient and physician practice profile attributes, characterizing the match of patient and physician. Examples of such features are: 1) The distance between the physician's office and patient's home. 2) An indicator of whether the physician and patient are of the same gender [67]. 3) An indicator of whether the physician and patient speak the same language. 4) An indicator of whether the physician accepts the patient's insurance.
The above two lists of physician profile features are only for illustration purpose and by no means exhaustive. More physician profile features will be investigated in this study. When a patient is managed by multiple physicians simultaneously, the patient's outcomes are affected by the profile features of all of these physicians. A traditional method for handling this situation is to use episode grouper software to split the whole span of patient care into episodes and assign each episode to a single physician [13, page 265, 68]. An episode of care is "a series of temporally contiguous healthcare services related to treatment of a given spell of illness or provided in response to a specific request by the patient or other relevant entity." [27, page 84, 69] Apart from the episode method, we will investigate other methods to combine multiple physicians' profile features.

Environmental variable features:
We will use standard environmental variable features such as monthly averages from clinical predictive modeling literature [57].
Definition of asthma cases and outcomes: As test cases, we will focus on primary care physicians and develop and test our idea using (i) pediatric asthma and (ii) adult asthma. The method described in Schatz et al. [29,70,71] will be used to identify asthma patients. A patient is considered to have asthma if he/she has (1) ≥1 ICD-9 diagnosis code of asthma (493.xx) or (2) ≥2 asthma-related medication dispensing records (excluding oral steroids) in a one-year period, including inhaled steroids, βagonists (excluding oral terbutaline), oral leukotriene modifiers, and other inhaled anti-inflammatory drugs [29]. We will use two outcome measures for asthma: (1) primary outcome -whether acute care (inpatient stay, urgent care, and emergency department visit) with a primary diagnosis of asthma (ICD-9 code: 493.xx) occurred for a patient in the following year [28,29,31,32,56,72,73], and (2) secondary outcome -the total amount of reliever medication and oral steroid medication for acute asthma exacerbations that a patient refilled in the following year. Total refill amount reflects the number and degree of asthma exacerbations experienced by the patient [63,64] and is available in our data set.
Predictive models: We will use Weka [74], a widely used open-source machine learning and data mining toolkit, to build predictive models. Weka integrates an extensive set of popular machine learning algorithms, ensemble techniques combining multiple predictive models, feature selection techniques, and methods for handling the imbalanced class problem. Both numerical and categorical variables appear in clinical, administrative, and environmental data. We will use supervised algorithms that can handle both types of variables, such as decision tree and k-nearest neighbor. We will test every applicable algorithm and manually tune hyper-parameters.
The accuracy achieved by state-of-the-art predictive models is usually far below 80% [28,29]. We would regard Aim 1.a partially successful if we can improve accuracy by ≥10% for either pediatric or adult asthma. We would regard Aim 1.a completely successful if we can improve accuracy by ≥10% for both pediatric and adult asthma. Given a set of features, we will use three methods to improve model accuracy. First, some features are unimportant or highly correlated with each other, which may degrade model accuracy. To address this, we will use standard feature selection techniques such as the information gain method to identify important features that will be used in the model [28,42,74]. Second, for a categorical outcome variable with two values, the corresponding two classes in our data set can be imbalanced, meaning many more instances exists for one class than the other. This can potentially degrade model accuracy. We will use standard techniques such as SMOTE (Synthetic Minority Over-sampling TEchnique) to address this [74]. Third, we will try ensemble techniques, such as random forest, that combine multiple models and usually work better than individual models [74].
Accuracy evaluation and sample size justification: We have 11 years' data. We will use a standard approach to train and test predictive models. We will conduct stratified 10-fold cross validation [74] on the first 10 years' data to train and estimate the accuracy of models. The 11 th year's data will be used to assess the best models' performance, reflecting use in practice. For categorical outcome variables, we will use the standard performance metric of the area under the ROC curve (AUC) [74] to select the best model. For continuous outcome variables, we will use the standard performance metric of R 2 to select the best model and also report the Cumming's Prediction Measure (equivalent to the Mean Absolute Prediction Error) [25,32]. To determine the clinical, administrative, and environmental variable attributes essential for high accuracy, backward elimination [48] will be used to drop independent variables as long as the accuracy does not drop by >0.02.
We will test the hypothesis that adding physician profile features can increase prediction accuracy twice, once for children and once for adults. We will compare the accuracies achieved by two predictive models using the best machine learning algorithm. The first model will use patient, physician profile, and environmental variable features, the second only patient and environmental variable features. We will accept the hypothesis if the first model achieves higher accuracy (AUC or R 2 ) than the second model by ≥10%.
Consider the categorical outcome variable of acute care usage with two values (classes). A predictive model using only patient and environmental variable features usually achieves an AUC far below 0.8 [28,29]. Using a two-sided Z-test at a significance level of 0.05 and assuming for both classes a correlation coefficient of 0.6 between the two models' prediction results, a sample size of 137 instances per class has 90% power to detect a difference of 0.1 in AUC between the two models. The 11 th year's data include about 27,000 children and 75,000 adults with asthma, providing adequate power for testing our hypothesis. To train a predictive model well, typically the ratio of the number of data instances to the number of features should be 10 or larger. In our case, at most a few hundred features will be used, and thus our data set would be large enough for training the predictive models. The case with the continuous outcome variable is similar (see Aim 1.b -sample size justification).

Aim 1.b: Build predictive models for individual patient costs.
We will use an approach similar to that in Aim 1.a, changing the prediction target from health outcomes to individual patients' total costs in the following year [13,25,27]. Each medical claim is associated with a billed cost, an Intermountain-internal cost, and a reimbursed cost [13, page 43]. We will use the Intermountain-internal cost [33], which is less subject to variation due to member cost-sharing [13, page 45] and reflects actual cost more closely. To address inflation, we will standardize all costs to 2014 dollars using the medical consumer price index [75].
Besides the rare use of physician profile features, two other major reasons also cause low accuracy in predicting an individual's cost. First, most existing work on predicting costs uses linear regression models [13,25,27]. In reality, costs are far from following a linear model [26]. Second, the cost of a patient with a specific disease is the cost of treating all his/her diseases [25]. To consider this factor, each model uses many features (independent variables), e.g., one feature per disease, and can easily have insufficient training data [48, page 102]. To address these two problems, we will try non-linear, disease-specific, machine learning models, which were proposed in our recent paper [32] but have not been implemented so far. This method's key idea is to reduce features by merging several less important features into one feature while maintaining important features as separate. The current approach of identifying important features and grouping other features is manual. We will also investigate automatic approaches. For example, we can regard the top features with the largest associations with the outcome variable as important ones. The remaining features are clustered using a similarity metric to form groups. The automatic approach is general and can be used to improve prediction accuracy of any continuous outcome variable that has a complex non-linear relationship with many independent variables.

Sample size justification:
In predicting an individual's cost, a predictive model using only patient and environmental variable features usually achieves an R 2 <20% [26]. Using an F-test at a significance level of 0.05 and assuming the presence of 70 patient and environmental variable features, a sample size of 245 patients has 90% power to detect an increase of 10% in R 2 attributed to 30 physician profile features. The 11 th year's data include about 27,000 children and 75,000 adults with asthma, providing adequate power for testing our hypothesis of an increase of ≥10% in R 2 .
Our goal is to achieve a ≥10% improvement in accuracy. If our models cannot achieve high accuracy on the entire group of asthma patients, we will build separate models for different subgroups of asthma patients. Patient subgroups are defined by specific characteristics, such as age, prematurity, co-morbidity, or insurance type that are usually independent variables of the original models. If our models still cannot achieve high accuracy, we will conduct subanalyses to identify patient subgroups on which our models perform well. In this case, our final models will be applied only to the identified patient subgroups.
A missing data problem occurs when a patient has several physicians belonging to different provider groups, with no single provider having complete information on the patient. We anticipate that adding physician profile features can improve prediction accuracy even if some data are missing. The missing data problem is unlikely to be an issue for children in our case, as IH provides ~85% of pediatric care in Utah [52]. If the IH EDW misses too much data for adults, we will use claim data in the allpayer claims database [76] to compensate. In the future when applying our predictive models to other healthcare systems, this compensation strategy can be used. Also, we expect missing data problems to be uncommon in Health Maintenance Organization (HMO) settings, where all physicians managing the patient belong to the same provider group, and the provider's electronic medical record and administrative systems usually have all medical data collected on the patient [77].
As mentioned in "Definition of asthma cases and outcomes," identifying asthma requires medication order and refill information. Our data set includes this information, as IH has its own health insurance plan (SelectHealth [78]). If the IH EDW is missing too much refill information, we will use claim data in the all-payer claims database [76] to compensate. If adding physician profile features cannot significantly increase prediction accuracy for asthma, we will choose chronic obstructive pulmonary disease or heart diseases for Aims 1-4.
We have a large data set. If we experience scalability issues using Weka, we will use a parallel machine learning toolkit such as Spark's MLlib [79][80][81] to build predictive models on a secure computer cluster available to us at the University of Utah Center for High Performance Computing [82].

Aim 2: Develop an algorithm to explain prediction results and suggest tailored interventions.
For patients with predicted risk above a pre-determined threshold, such as the 95 th percentile, this aim will explain prediction results and suggest tailored interventions. These explanations and suggestions can help clinicians make final decisions on the management levels and interventions for these patients. Prediction accuracy and model interpretability are frequently two conflicting goals. A model achieving high accuracy is usually complex and difficult to interpret. How to achieve both goals simultaneously has been a long-standing open problem.
Our key idea to solve this problem is to separate prediction and explanation by using two models concurrently, each for a different purpose. The first model makes predictions and targets maximizing accuracy. In this study, this model is the best one built for the outcome variable in Aim 1. The second model is rule-based and easy to interpret. It is used to explain the first model's results rather than make predictions. The rules used in the second model are mined directly from historical data rather than coming from the first model. For each patient whom the first model predicts to be at high risk, the second model will show zero or more rules. Each rule gives a reason why the patient is predicted to be at high risk. Since some patients can be at high risk for rare reasons that are difficult to identify, we make no attempt to ensure that at least one rule will be shown for every patient predicted to be at high risk. Instead, we focus on common reasons that are more important and relevant to the patient population than rare ones. We expect most high-risk patients to be covered by one or more common reasons.
We will use an associative classifier [83][84][85] from the data mining field as the second model. Associative classifiers can handle both numerical and categorical variables and be built efficiently from historical data. Compared to several other rulebased models, an associative classifier includes a more complete set of interesting and useful rules and can better explain prediction results. For ease of description, our presentation focuses on the case that each patient has exactly one data instance (row). The case where a patient has ≥1 data instances can be handled similarly. We will proceed in three steps.
Step 1: Mine association rules from historical data. As mentioned in Aim 1, each patient is described by the same set of patient, physician profile, and environmental variable features, and labelled as either high risk or not. An associative classifier includes a set of class-based association rules. Each rule includes a feature pattern associated with high risk and is of the form: Here,  is the logical AND operator. The value of k varies across different rules. Each item p i (1≤i≤k) is a feature-value pair of the form (f, v) indicating that feature f takes a value equal to v (if v is a value) or within v (if v is a range). The rule suggests that a patient is likely to be at high risk if he/she satisfies p 1 , p 2 , …, and p k . An example rule is: the patient was hospitalized for asthma last year  the patient's primary care physician has <10 asthma patients  high risk.
For a given association rule R, the percentage of patients satisfying R's left side and being at high risk reflects R's coverage and is called R's support. Among all patients satisfying R's left side, the percentage of patients at high risk reflects R's accuracy and is called R's confidence. An associative classifier includes association rules at a given level of minimum support (e.g., 1%) and confidence (e.g., 70%). These rules can be efficiently mined from historical data using existing techniques [83][84][85], which can eliminate redundant and noisy rules. As we need only rules suggesting high risk, we can mine desired feature patterns, i.e., the rules' left side, from high-risk patients' data rather than from all patients' data to improve the efficiency of rule generation.
Typically, many association rules will be mined from historical data [83][84][85][86]. Keeping all of these rules will overwhelm clinicians. To address this issue, we will use three methods to reduce the number of rules. First, in forming rules, we will consider only features appearing in the first model that is used to make predictions. As mentioned in Aim 1.a, many nonessential features will be removed during feature selection and backward elimination when building the first model. Second, we will focus on rules with no more than a pre-determined small number of (e.g., 4) items, as long feature patterns are difficult to understand and act on [83]. Third, users can optionally specify for a feature, which values or type of range (e.g., stating that the feature is above a threshold) may potentially indicate high risk and appear in rules [40,87]. The other values or types of range are not allowed to appear in rules. This also helps form clinically meaningful rules.
Step 2: List interventions for the mined association rules. Through discussion and consensus, our clinical team will examine mined association rules and remove those that make little or no clinical sense. For each remaining rule, the clinicians will list zero or more interventions addressing the reason given by the rule. Example interventions for patients include: 1) Provide transportation or telemedicine for a patient living far from his/her physician. 2) Schedule longer or more frequent doctor appointments for a patient with multiple co-morbidities. 3) Schedule appointments with nurse educators or clinical pharmacists for a patient with multiple co-morbidities. 4) Arrange language service for doctor appointment if the patient and physician speak different languages. 5) Give wearable air purifiers to certain types of asthma patients living in an area with bad air quality. Example interventions at the system level include: 1) Provide the primary care physician continuing medical education on a specific disease, cultural competence, women's health, or pediatric health if he/she is unfamiliar with or cannot well manage the disease, patients of a particular race, diseases in women, or pediatric diseases. A physician may be unfamiliar with a disease if he/she has few patients with it. A bad mean outcome of a physician's patients with the disease may indicate, but not always, that the physician cannot manage the disease well. 2) Extend physician office hours. 3) Open a new primary care clinic in an area with no such clinic nearby.
Interventions for patients are displayed to clinicians in Step 3. Interventions at the system level are optional and may be viewed only by managers of the healthcare system. We call a rule actionable or non-actionable based on whether or not at least one intervention is associated with it. The remaining rules and their associated interventions will be stored in a database to facilitate reuse.
Step 3: Explain prediction results and suggest tailored interventions. Upon prediction time, for each patient identified as high risk by the first model, we will find all association rules whose left side is satisfied by the patient using an index for rules [84]. We will display the actionable rules above the non-actionable ones, each in descending order of confidence [84]. If two rules have equal confidence, the rule with higher support will be ranked higher. If two rules have the same confidence and support, the one with fewer items will be ranked higher. Our rule sorting method differs from several traditional ones [83][84][85], as our goal is to explain the prediction result for a patient rather than to maximize the average prediction accuracy in a patient group. We will list confidence and associated interventions, if any, next to each rule to help the clinician identify suitable tailored interventions. By default we will show no more than a pre-determined small number of rules (e.g., 3). If desired, the clinician can opt to view all rules applicable to the patient.
Commonly used support and confidence thresholds [83][84][85] may not be suitable for our case, where only a small percentage of patients are at high risk. We will adjust the support and confidence thresholds if the commonly used ones cannot produce enough meaningful association rules. By setting the thresholds low enough, we will produce meaningful rules, at the expense of our clinicians spending time removing rules that make little or no clinical sense. Since existing predictive models give no suggestion on tailored interventions, we will regard Aim 2 successful if a non-trivial percentage (e.g., ≥20%) of high-risk patients are covered by actionable rules.

Performance evaluation:
The algorithm for explanations and suggestions will be evaluated in Aim 4.

Aim 3: Develop an algorithm to compute optimal thresholds for risk strata.
In risk-stratified management, chronic disease patients are stratified into multiple levels [14,15]. This aim will compute the optimal thresholds for these levels that minimize total future cost of all patients factoring in the management programs' costs. Total future cost implicitly reflects patient health outcomes, healthcare use, efficiency of care, and the management programs' benefits. For instance, fewer hospitalizations usually lead to lower costs. The following discussion focuses on stratification based on predicted patient risk of experiencing a specific type of undesirable event (e.g., hospitalization or emergency department visit). The case of stratification based on predicted cost or with more than one type of undesirable event can be handled similarly. Our discussion applies to any predictive model and is based on a fixed period in the future, such as the next 12 months.
Threshold computation algorithm: We will conduct quantitative analysis to determine the optimal management level for each risk percentile. We will proceed through the risk percentiles one by one, from the highest to the lowest. Given a risk percentile, we compute for each management level the average future cost per patient in the percentile if patients in the percentile are put into the level. The level with capacity remaining in its management program and the lowest average future cost per patient will be chosen for the risk percentile.
More specifically, consider a risk percentile and an average patient whose predicted risk falls into the percentile. If the patient is enrolled in a management program, we estimate that the patient's future cost will change by  = the program's cost -the program's benefit gained by reducing undesirable events _ compared with no enrollment. Here, c i is the program's average cost per patient. Factors such as increased medication cost due to better medication adherence are included in c i . avg_n e is the average number of undesirable events that a patient in the risk percentile will experience in the future. p is the percentage of undesirable events the management program can help avoid, reflecting the program's benefit. c e is the average cost of experiencing the undesirable event once. c i and p can be obtained from statistics reported in the literature for the management program [39, chapters 5 and 18, 88]. avg_n e can be obtained by making predictions on historical data and checking the corresponding statistics for the risk percentile. c e is obtained from statistics on historical data. The management level with the smallest  is optimal for the risk percentile. If no statistics on c i and p of a management program are available in the literature, the clinician in our research team (Dr. Stone) will provide rough estimates based on experience. We will perform sensitivity analysis when choosing thresholds by varying the estimated values of c i and p to obtain the full spectrum of possible outcomes in Aim 4.
The above-mentioned method performs an exhaustive search among all management levels for each risk percentile. In practice, we would expect avg_n e to decrease as the predicted patient risk of experiencing undesirable events becomes smaller. We will investigate using this property to reduce the search space when going through the risk percentiles one by one, from the highest to the lowest.

Performance evaluation:
The threshold computation algorithm will be evaluated in Aim 4.

Aim 4: Conduct simulations to estimate outcomes of risk-stratified patient management for various configurations.
To determine a predictive model's value for future deployment in clinical practice, we need to estimate outcomes of riskstratified patient management when the model is used, and determine how to generalize the model to differing sites collecting different sets of attributes. Our models will be built on IH data sets. Our simulations will guide how to deploy the models in another healthcare system. No previous study has either estimated outcomes for a model with >1 management strategy or determined the attributes most important for generalizing the model. We will demonstrate our simulation method for the task of risk-stratified management of (i) asthmatic children and (ii) asthmatic adults, by using our models for predicting acute care use for asthma in the following year (see Aim 1.a -definition of asthma cases and outcomes), the hierarchy of risk-stratified management levels shown in Fig. 1, and our algorithms described in Aims 2 and 3. Our simulation method is general and can be used to deploy other models in clinical practice. We will first evaluate the technique in Aim 1.
Outcomes: We will focus on the outcomes of costs, hospital admission, and emergency department visit in the following year. Cost is the primary outcome, reflecting healthcare use and efficiency of care. Other outcomes are secondary and indirectly reflected in costs.
Estimate outcomes: Given a set of attributes and a predictive model, we will estimate each outcome. We will use the same method as in Aim 1 to train the model on the first 10 years' data. For the 11 th year's data, we will obtain prediction results, compute thresholds for risk strata, then estimate the outcome in a way similar to Aim 3. For example, consider a patient who will have a cost of h and experience n e undesirable events in the following year with no program enrollment. If the patient is enrolled in a management program, we estimate that the patient's future cost will become , where c i , p, and c e are as defined in Aim 3. The overall outcome estimate is the aggregate of estimated outcomes for all patients. Using a similar approach, we can identify the minimum accuracy requirement of the model for it to be clinically valuable.
Sensitivity analysis: IH collects an extensive set of attributes. Another healthcare system may collect only a subset of these attributes. To ensure the model's generalizability, we will test various combinations of attributes and estimate outcomes when the modified model is used. The estimate will identify which attributes are critical. If an important attribute is unavailable in a specific healthcare system, the estimate can suggest alternative attributes with minimal negative impact on outcomes.
Our full model will use up to 400 attributes. It is not possible to conduct simulations for every possible combination of these attributes. Instead, we will use an attribute grouping approach associating attributes likely to co-exist, such as attributes associated in a lab test panel, based on our clinical expert's judgment. We will construct and publish a table listing possible combinations of attributes by groups, including outcomes estimated through simulations and the predictive model's trained parameters. A healthcare system interested in deploying the model can use the table to determine expected outcomes for their data environment and identify attributes that need to be collected. One entry in the table will correspond to the attributes available in the OMOP (Observational Medical Outcomes Partnership) common data model [89], which standardizes clinical and administrative attributes from ≥10 large healthcare systems in the U.S. [90]. The model in this entry will directly apply to at least those healthcare systems. If conducting simulations for the many combinations of attribute groups is too slow on one computer, we will parallelize simulations on a secure computer cluster available to us [82].
Outcome evaluation and sample size justification: We will compare outcomes achieved by two predictive models using the best machine learning algorithm. The first model will use patient, physician profile, and environmental variable features, the second only patient and environmental variable features. We will test three hypotheses: adding physician profile features will be associated with reduced (1) costs, (2) hospital admissions, and (3) emergency department visits. We will test each hypothesis twice, once for children and once for adults. Cost data will be log-transformed due to skewed distribution [13, page 134]. We will accept the primary hypothesis if the first model can reduce the log cost by 10% × its standard deviation compared with the second model. One-sided paired-sample t-test will be used to test the difference in log cost between the two models' outcomes. McNemar's test will be used to test the difference in hospital admissions and emergency department visits. At a significance level of 0.05, a sample size of 857 instances has 90% power to confirm the primary hypothesis. The 11 th year's data include about 27,000 children and 75,000 adults with asthma, providing adequate power for testing the primary hypothesis.
We will do two similar analyses to compare our threshold computation algorithm vs. the current method of determining thresholds heuristically (evaluate the technique in Aim 3), and our algorithm for explanations and suggestions vs. the current method of giving no explanation and suggestion (evaluate the technique in Aim 2). Physician profile features will be used in either analysis. In the first analysis, we will use the heuristically determined thresholds reported in the literature [15]. In the second analysis, we will use our threshold computation algorithm and estimate outcomes of our algorithm for explanations and suggestions. For an intervention, we will use statistics on its benefits and average cost per patient from the literature [39, chapters 5 and 18] where available. If no information is available, the clinician in our research team (Dr. Stone) will conservatively estimate these numbers' minimum and maximum values based on experience. For each number, we will use five levels ranging from the minimum to the maximum value. To obtain the entire spectrum of possible outcomes, we will perform sensitivity analysis by varying the level and percentage of suggested interventions that clinicians will use. For the current method of giving no explanation and suggestion, we will proceed in a similar way by letting Dr. Stone estimate the lower and upper bounds of the likelihood that clinicians will use an intervention. If Dr. Stone has difficulty estimating the likelihood that clinicians will use an intervention, we will interview clinicians using sample patient cases to help with the estimation. Based on its own estimate of the situation, a healthcare system can check where in the spectrum it will fall.

Ethics approval
We have already obtained institutional review board approvals from the University of Utah and IH for this study.

Results
We are currently in the process of extracting clinical and administrative data from the IH EDW. We plan to complete this study in about five years.

Discussion
Our techniques' principles are general and rely on no special property of any disease, patient population, or healthcare system. Just as predictive models are used for case management for various diseases and patient populations [13,24,30,31], after proper extension our techniques can be used for a range of decision support applications in various settings (see Introductioninnovation). Our simulation method will determine how to generalize a predictive model to differing sites collecting different sets of attributes, and the attributes most important for generalization. Using data from an integrated healthcare system with many heterogeneous facilities spread over a large geographic area, we will demonstrate our techniques on the test case of asthma patients. These facilities include 22 hospitals and 185 clinics, ranging from tertiary care hospitals in metropolitan areas staffed by sub-specialists to community urban and rural clinics staffed by family physicians and general practitioners with limited resources. Variation in geographic location, patient population, cultural background, staff composition, and scope of services provides a realistic situation to identify factors generalizable to other facilities nationwide. When conducting simulations for each disease (pediatric/adult asthma), one of the models produced will directly apply to ≥10 large healthcare systems.
As inaccurate predictive models are already commonly used for case management [24], we would expect our more precise models to have practical value. Future studies will demonstrate our techniques on other diseases, test cases, and patient populations, implement our techniques in a major healthcare system for risk-stratified management of asthmatic children, and test the impacts in a randomized control trial.
In summary, our work will transform risk-stratified patient management and personalize management strategies based on objective data so that more patients will receive the most appropriate care. This will improve clinical outcomes and reduce healthcare use and cost. We will achieve generalizable advances in predictive modeling, explaining prediction results, tailoring interventions, and resource allocation. After proper extension, our new techniques can be used for a variety of decision support applications in various disease settings. The new simulation method will be useful for estimating outcomes for a predictive model in dissimilar data environments.