Appendix B—Semilog Modeling
Certain outcome measures, notably costs and length-of-stay (LOS), are distributed with a rightward (positive) skew, as depicted below in Figure 1(a). Applying linear regression to models with skewed dependent variables gives rise to a number of pathologies, including inefficient, often biased, parameter estimates and predictions outside logical bounds, such as negative values for LOS and costs. When outcome measures are not symmetrically distributed, analysis of performance can be disproportionately influenced by outliers and special or extreme cases. This phenomenon can require a manual procedure for identifying and removing outliers, a subjective technique at best.
A more robust solution is to take the natural log of the dependent variable, which results in an approximately symmetric distribution and contracts the outliers inward toward the center of the data, as shown in Figure 1(b). It also ensures that all predicted values will be positive. (No matter how negative the log value is, taking the anti-log to restore the values will guarantee that they are positive.)
We conducted a systematic review of non-adverse outcome measures—LOS, charges, and costs—by three-digit ICD-9 code to monitor the positive skew and measure its magnitude. In symmetric distributions two measures of central tendency, geometric mean and arithmetic mean (se below), are equal. As the skew increases in unimodal distributions the ratio of the arithmetic mean to the geometric mean grows from unity.
To illustrate skew: Total cost is skewed right but the natural log of total cost - ln(cost) - is approximately symmetrically distributed, therefore using linear regression to forecast ln(cost) will result in much better estimates with smaller error.
A numeric illustration:
Depicted below is the total cost frequency distribution for a sample of 200 hospital discharges. It displays the characteristic positive skew (skew coefficient = 2.6).
Figure 2(a) Total cost for 200 discharges
Figure 2(b) Log of cost for 200 discharges
Geometric vs. arithmetic means:
The arithmetic mean is the simple average, computed by adding up all values (xi) in the sample and dividing by the number of such values (n):
The geometric mean follows the same principle, but instead of adding the values, they are multiplied together and instead of dividing by n, the nth root of the product is taken:
An equivalent way to compute the geometric mean is to take advantage of natural logarithms. Defining y as the natural log of x [y = ln(x)], the geometric mean is the anti-log (exp) of the arithmetic mean of y:
Because the geometric mean is based on log values and the log transformation tends to draw extreme values toward the center of the data, the geometric mean is more "robust" than the arithmetic mean. "Robust" here means less influenced by outliers.
Back to the cost example from 200 hospital discharges:
Transforming cost from Fig. 2(a) by taking the natural log gives the frequency distribution in Fig. 2(b), which exhibits the typical symmetric bell shape of the normal distribution. The arithmetic mean cost is marked on the first (skewed) frequency histogram, which in this illustration is $1670. The mean of the log(cost) is marked on the second histogram at 6.95. Taking the anti-log of this value yields the geometric mean equal to $1043, which is much closer to the mode of the original (untransformed) histogram. The pronounced positive skew in the original cost distribution guarantees that the arithmetic mean is much larger than the geometric mean, which tends to pull back the extreme values in the upper tail. In this illustration the ratio of the arithmetic mean to the geometric mean is $1670/$1043 = 1.60.
raw arithmetic mean
raw geometric mean
arithmetic mean risk
geometric mean risk where
where xijkl = patient.total_charges, patient.comparative_costs, and patient.length_of_stay yijkl = ln(xijkl)
and y^ijkl = ln(total_charges) risk, ln(comparative_cost) risk, and ln(length of stay) risk
i = patient (each row in the patient table)
j = provider or grouping
k = icd9 diagnosis (3 digit)
l = outcome (length of stay, charges, cost)
n = all observations including zeros
- Create an additional column in the patient/episode table to hold ln_x, ln_x_risk, x_risk(eln_x_risk) and ln_x_stderr where x represents the dependent variables.
- Populate this column with the ln(total_charge), ln(comparative_costs), and ln(ccms_length_of_stay) respectively. The ln values will be populated with a '99' when costs and charges are zero.20
- Regress the ln(total_charge), ln(comparative_costs), and ln(length of stay) on the original vector of independent variables. Cases with a null value or a 99 for the dependent variable as well as incomplete cases will not be included in the regression. Nevertheless fitted values (risks) and their standard errors will be generated for all complete cases.21 We shall use n to designate the number of complete observations including those with null or '99' dependent values; m indicates the number of observations included in the regression (excluding incomplete and those with null or '99' dependent values). To illustrate, suppose a given model stratum has 100 observations of which 95 are complete; and of these 95, ten have cost equal to zero (are given a value of '99' in the log column). Then n=95 and m=85. The regression is run on m=85 cases and fitted values (risks) together with their standard errors are generated for n=95 cases.
- The back-end values are left in log form and antilogs are applied only after aggregation on the front end.
- The front-end software application then performs the appropriate calculation (sum, average, etc.) on the log values to display the raw, standardized, and deviation results in the reports. (The calculations that are relevant to this conversion are on found above.)
- Deviations are based on geometric means:
Geometric Deviation =
- The front-end software calculates the p-value to determine significance with all measures in logs (i.e. without converting raw, risk, or standard errors to the original units.) The calculation will not change from what is currently in use but will be based on the m nonzero cases.
- The deviations column on all CareScience Quality Manager reports must equal the raw minus the standardized values up to rounding error in the first decimal place, such that the deviation is no more than 0.1 different from the difference between raw and the standardized value.
Implementation is a combination of front-end and back-end changes. The database must hold logarithmic values—ln(total_charges), ln(comparative_costs), and ln(length of stay)—and standard errors in log form. All computations of confidence intervals and significance are in logs, including necessary aggregations. All computations on risk values are done before conversion back to "levels" (in log units), hence excluding cases with zero values in the raw data. This approach to aggregation generates geometric (not arithmetic) means. Moreover, the log transformation method guarantees that expected level values (after taking the antilog) be positive, which eliminates the need for front-end data trimming.
Comparative Costs as an example:
a = exp(avg(ln_comparative_costs))
b = exp(avg(ln_comp_cost_risk))
c = sum(decode(ln_comparative_costs,null,0,1))
d = sqrt[ sum(ln_comp_cost_risk_stderr ^ 2) ]
k = avg(ln_comparative_costs) - avg(ln_comp_cost_risk)
Charge deviation = a - b
Charge sig flags: t-value= k * c/d (with degrees of freedom: c - 1)
Addendum on LOS
Within CareScience database, patients discharged the same day as admitted are assigned length-of-stay = 1, not 0. That conforms to most billing practices. LOS is defined as the number of days present, not including the day of discharge with a minimum LOS = 1. This algorithm eliminates the possibility of undefined value of ln(length_of_stay) when LOS = 0.
20 99 is a placeholder used by Data Manager to identify observations that should be excluded from the regression because the dependent variable is undefined (ln of 0 is undefined).
21 Complete cases are defined as having values for all independent variables required for the regression.
Return to Article Contents
Proceed to Appendix C