Frequently Asked Questions: Predictive Analytics, Challenge
How do I get access to the customized data files?
Participants must first submit an executed "AHRQ Bringing Predictive Analytics to Healthcare Challenge Data Use Agreement" [PDF File; 123.4 KB and Word; 21 KB] to AHRQ. The participant will then receive a spreadsheet with customized data files for prior years. This spreadsheet will also serve as part of the submission form for providing predictive estimates for the Challenge.
What is the final submission, and where do I send it?
The final submission should include the Excel spreadsheet with predictive estimates added to it. Participants should also include backup documentation on how these predictive estimates were derived. These documents should be uploaded to the registration page.
Where do I upload my submission?
Participants can submit their applications through this Web portal. They may select “Submission Requirements/Registration” any time during the submission period. If an application is submitted multiple times, AHRQ will review the application that was submitted at the latest time/date. Participants must also complete the Data Release Forms.
Where can you find the data in Table 1 "Calculating community-level statistics" for HCUP method documents?
Here is the link to the HCUP Community Resources Document.
Does the data release agreement (and signatory) pertain to the organization of the individual members?
It pertains to the one individual that represents that organization. So other team members do not need to also submit agreements. However, the signatory serves as the steward for the data file and will be held responsible for the protection and destruction of the data after the Challenge.
What does AHRQ plan to do with the end-product from this Challenge?
AHRQ is interested in the predictive analytic space in general. The Agency's intent is to better understand the current science especially as it applies to healthcare data, including those datasets managed by AHRQ.
Should contestants focus on a novel variable versus a novel algorithm?
Challengers should focus on the product (a model predicting outcomes) versus merely algorithms.
Can we use proprietary tools like SAS to create models?
Proprietary or purchased are eligible to use in your predictions.
Are any predictors ineligible, such as by race, age, gender?
Any predictors that you feel accurately predict the data provided are acceptable.
Is interoperability an issue, or can the model be a "black box"?
The methods that you employ have to be replicable to AHRQ.
Are the submissions judged solely on metrics?
There are two components: the metrics and the report. The report will be used to replicate the results and to provide face validity. In terms of empirical metrics, the data will be weighted with the 2017 weights being higher than the 2016 weights.
Do the acute care numbers include all data, including the VA?
The community data document has additional details about the data fields. The acute care data does not include Federal hospitals, such as the VA.
Can individuals participate?
Yes. However, the individual should be a U.S. citizen or permanent resident. Several individuals may also participate as a team.
Can an individual receive the prize money?
Yes, if they are a U.S. citizen or have permanent residency. It is important to note that there would be tax implications in receiving the prize.
Are there restrictions on the platform or programming software used, such as cloud-based computing platforms?
AHRQ imposes no restrictions on data platform or programming software used in this Challenge.
There are discrepancies in the data file for length-of-stay variable.
NOTE: This was corrected, and the analytic file was updated on April 12th. If you received the data file before that date and did not receive the updated file, please contact Susan Kerin at email@example.com.
Does the data represent the insured, general population and/or Medicaid population?
Participants should reference the community data document, which provides background details on the data.
Can participants use the AHRQ datasets which are not open to the public?
No, your application should be based on publicly available and free datasets.
What specific accuracy metric are you using to judge? Population-weighted, MSE, MAE, etc.?
The criteria will be based on our calculations of both the average length of stay and the number of discharges for all those counties based on the databases that AHRQ produces here. Predictions are going to be compared to those numbers, and there will be a percentage difference that you have there. AHRQ will weight it by county, so the size of the county will matter, and that will have greater weight, and we will weight it based on 2017 predictions versus 2016 predictions, with the higher weight for the 2017 prediction. This latter weighting is because AHRQ wants these models to be able to forecast into the future.
Is there a performance metric that the judges wish to optimize, for example f one score versus precision?
No. Submission item #1 will be weighted by 80%, and submission item #2 will be weighted by 20%.
Are statistical approaches preferred over black box approaches, or can either be used and be acceptable?
Black boxes, which AHRQ is unable to replicate, will be disqualified.
I understand that publicly available information may be used to supplement this file. Does this include information that can be gathered using a Freedom of Information Act request? I found a dataset available through HRSA, but it is a more micro (Census tract) level than I am looking for. I would need to submit a request to have it modified for the purposes of this Challenge, so I want to be sure that is still acceptable.
Any data that can be obtained by any individual that is free of charge and does not prohibit the participant's intended use may be used in the Challenge.
Are we able to use data from the Predictive Analytics Challenge for the SDOH Challenge?
No, the customized AHRQ file provided to participants may only be used for the "Bringing Predictive Analytics to Healthcare Challenge."
Could you help me understand what the requirements are for the timeline in which the predictor data need to be available? There's apparently nothing mentioned on the topic, but it seems hard to believe it's not an important criteria. The issue is this: All of the predictions we're being asked to make are for years that are entirely completed and, therefore, data are available for the full year from most sources. However, if you were to use this model to predict for future years, this would not be the case. For example, if I use the number of child births in 2017 to predict utilization for 2017, and you then try to predict 2020 going forward, it won't be possible because the number of child births aren't in for that year. Normally, for something like this, we would need to know in what timeframe the data need to be available in order to predict the following year. For example, data used to predict 2020 utilization would have to have been available by the end of the first quarter of 2020, or no later than 12/31/2019. This makes a very large difference in our approach because if there really is no guidance on this question, and we can use any data that are available NOW to predict 2016/2017, well, that's likely going to give us the most accurate model (and therefore the winner) but the model will be completely useless in actually predicting anything that hasn't already happened. It's kind of like asking people to predict the winner of the basketball game after you already know the score.
AHRQ has no restrictions regarding the year in which supplemental data is acquired. As noted in the Challenge, participants must describe their METHODOLOGICAL approach in obtaining their estimates. While participants may observe the actual values for 2016, for their 2016 estimates, the MUST apply the methodological approach deployed in obtaining the 2017 estimates. Simply reporting the actual numbers for 2016 as your estimates would not be consistent with deploying your methodological approach for your 2017 estimates.
I was hoping you could answer a question about the evaluation metric for this challenge. I’ve reviewed it, and the FAQs, and I am still confused. Could you provide an example? Or clarify the following:
I understand there are two predictions for each year for each County. So, four total predictions for each County in the dataset:
- Number of discharges in 2016
- Average length of stay, 2016
- Number of discharges in 2017
- Average length of stay, 2017
Then the weighted, absolute percent difference (wAPD) between each of the four predictions and their actual values are calculated. I also understand that the score for 2017 will be weighted as 80% of the final score, while the score for 2016 will be weighted as 20% of the final score. What I do not understand is how the two components (wAPD of predicted discharges, and wAPD of predicted length of stay--or the annual sums of these) are combined to get a single score for the year. Since I am assuming from the reading that item #1 is the 2016 single score for the year, and item #2 is 2017 score?
AHRQ is providing a refined description with more information on the “Evaluation Criteria” for this challenge.
An overall evaluation metric for the submitted model or method is calculated in the following way:
- Four categories of predictions are associated with this Challenge.
- 2017 number of discharges by county
- 2017 average length of stay by county
- 2016 number of discharges by county
- 2016 average length of stay by county
- For each county in each category, the absolute percentage difference between the predicted value and the actual value is determined.
- For each county in each category, the calculated value is weighted by the share of the population in that county to the total population of all selected counties in the dataset.
- For each category, the weighted values by county is summed to determine the participant's overall weighted value for that category.
- For each category, participants' overall weighted values are standardized to facilitate across-category comparison.
- To obtain the overall evaluation metric for each participant, the standardized weighted values in each category are weighted such that:
- 2017 and 2016 predictions are weighted 80% and 20%, respectively;
- the number of discharges and average length of stay are weighted 65% and 35%, respectively.
- The lower the participant's overall evaluation metric, the more reliable and valid the predictive method.
A question for you on the guidelines for this Challenge. We understand that our prediction model can use 2011-2016 AHRQ data to predict 2017 discharges/average LOS. For our other data sources (from free, publicly available sources), are we restricted to the same timeframe--2011-2016--or are we able to include 2017 data if available?
AHRQ has no restrictions regarding the year in which supplement data is acquired.
For the evaluation criteria, it says 20% of the weight is on validity, determined "by how well the model performs on earlier years of data." How exactly is the validity assessment made? Do you assess the model on just the previous year (i.e., the 2016 data), or multiple previous years (e.g., the preceding 5 years of utilization rates)? If you use multiple previous years, are they each weighted equally?
To assess validity, we will compare your 2016 estimate, which must be derived by the model you deployed to obtain your 2017, with the actual values in 2016. For each county, the absolute percentage different between the predicted value and the actual value is determined. These values are weighted by the share of the population in that county to total population of all selected counties in the dataset.
The competition page states: For each cell, the calculated value is weighted by the share of the population in that county to the total population of all selected counties in the dataset. What data source and year will be used for the county weights? The 2017 Census county population estimates?
The estimated resident population for 2017 from the U.S. Census Bureau will be used.
What column should we put our answers in?
As long as participants provide AHRQ with the state-county code and the requested predicted values that are CLEARLY LABELED, the values may be placed in any column.
When using other data sources, do we need to submit each dataset in raw form with the full audit trail of transformation code? Or can we just submit the final transformed data files along with the locations of the original raw data sources?
The final transformed data with a description of the original data source is sufficient for purposes of this Challenge.
Are only historical data allowed to predict the outcome? Or can we use data that is concurrent with the predicted outcome? For example, if we use county-level Census data, can we use the 2017 demographic estimates to predict the 2017 outcome? Or should we use the 2016 estimates? For the Census data, estimates are available mid-year (e.g., the 2017 estimates are available on 07/01/2017).
AHRQ has no restrictions regarding the year in which supplemental data is acquired.
If a data source is free but requires a license of some sort to be used (e.g., narrowing the purposes to which the dataset can be used), is that still considered a "public" data source that we are allowed to use?
Any data that can be obtained by any individual that is free of charge and does not prohibit the participant's intended use may be used in the Challenge.
How do we demonstrate that we have destroyed the analytic file after the Challenge is over?
Participants should send an email to Susan Kerin at firstname.lastname@example.org after the Challenge to confirm that they have complied with this requirement.
Are there any options by which we can gain access to the AHRQ county-level data for future use? For example, will they be available to purchase?
Participants may obtain similar data at https://hcupnet.ahrq.gov/#setup. From this homepage, select "Create a New Analysis" and then "Community."
We blended the AHRQ dataset with the dataset that we had identified and used for the Aetna Big Data Challenge for Social Determinants we had attempted at the beginning of the year. The prediction shows up with a negative correlation with equity. The prediction is catching those with higher standard of health generally correlates with better economic/income conditions with longer length of stay and also with less severe health/disease conditions.
Thank you for sharing this information. We will forward this information to the AHRQ evaluation team.
The main thing we do not understand is what specific values we need to predict. From reading the website, we expected that the Excel spreadsheet would have rows and columns that were only partially filled in (with missing values in the year 2016 and 2017 columns), and that we would fill in missing values and send the spreadsheet back, but that does not seem to be the case: everything in the 2016 columns are filled in, and there are no columns for 2017. For Item #1 of the submission, should we create our own columns for 2017 (and if so, does it matter where in the spreadsheet we place them?) and fill in predicted values for all of the current rows? And for Item #2, for what counties should we predict 2016 values? We notice that not all 50 States or all counties are included in the spreadsheet--should we create new rows for those and predict 2016 and/or 2017 values for the missing counties?
The customized AHRQ data file contains the ACTUAL number of discharges and average length of stay for years 2011-2016. Please place your predictions in any other column not used in the spreadsheet (e.g., Q, R, S, and T). Please be sure to LABEL the appropriate predictions at the top of the column (e.g., "Predicted Discharges 2016"). Make predictions for ALL counties identified in the customized AHRQ file. You do NOT need to make predictions for counties that are not listed.
Can you confirm whether the number of discharges refers to the total number of discharges from facilities in the specific county, or number of discharges from residents in the county regardless of where they are discharged from?
The number of discharges are based on patient's county of residence.
Submission requirements #1 and #2 use the phrase "selected counties." Are we to make predictions on all the counties listed in the Excel spreadsheet, or is there another list of "selected" counties we are to be using? This is particularly confusing for #2 because it seems we are supposed to predict values for 2016, but as I mentioned in my question above, it appears this information is already populated in the data file received. There are U.S. counties missing from the Excel spreadsheet. Are we to figure out which counties are missing and add or only use the counties in the list provided?
The customized AHRQ files contain ACTUAL number of discharges and average length of stay for years 2011-2016. Please place your predictions in any other columns (e.g., Q, R, S, and T). Please be sure to LABEL the appropriate predictions at the top of the column (e.g., "Predicted Discharges 2016"). Make predictions for ALL counties identified in the customized AHRQ file. You do NOT need to make predictions for counties that are not listed.
I'm interested in how data leakage will be considered during an evaluation. Presumably, there are publicly available datasets that have been released since 2017 that include factors that are highly correlated with the number of inpatient discharges and mean length of stay. While those files may improve predictive accuracy, it feels like they go against the spirit of the challenge. For example, if someone were to use 2017 CMS Hospital Compare Data in their submission (data captured for CY 2017), and you wanted to use the exact model/methodology to predict utilization for 2019 or 2020, you would have to wait several years after the year you intended to predict in order for the 2019/2020 dataset to be released.
AHRQ has no restrictions regarding the year in which the supplemental data is acquired.
Can I apply with an existing product?
While the submission requirements for this Challenge allows for the use of an existing solution as a part of a participant’s entry, there still needs to be clear innovation. Participants will need to show they did not simply apply an existing solution without its modification, enhancement, or augmentation. The purpose of the Challenge is to drive innovation, so there has to be a clearly innovative offering in your submission for it to successfully compete.
Challenges are different than grants and contracts. No budget needs to be submitted. There is a set prize award. However, there are no restrictions on how the prize award is spent.
I receive Federal grants or contracts. Can I still apply?
That depends. Federal grantees may not use Federal funds to develop COMPETES Act Challenge applications unless consistent with the purpose of their grant award. Federal contractors may not use Federal funds from a contract to develop COMPETES Act Challenge applications or to fund efforts in support of a COMPETES Act Challenge submission.
Can I maximize my odds for winning by providing multiple submissions?
UPDATED: Unfortunately, multiple submissions will not be allowed.
Do I retain intellectual property ownership?
Each entrant retains title and full ownership in and to their submission. Entrants expressly reserve all intellectual property rights not expressly granted under the Challenge agreement. By participating in the Challenge, each entrant hereby irrevocably grants to AHRQ a limited, non-exclusive, royalty-free, worldwide license and right to reproduce, publicly perform, publicly display, and use the submission to the extent necessary to administer the Challenge, and to publicly perform and publicly display the submission, including, without limitation, for advertising and promotional purposes relating to the Challenge.
Additionally, the grand prize winner will be required to post an open source version of the application’s code on the GitHub source code repository, made publicly available under the Creative Commons license, CCO 1.0 Universal (CCO 1.0, Public Domain Dedication). For a summary and full text of the CCO 1.0 Universal License, go to https://creativecommons.org/publicdomain/zero/1.0/ the GitHub source code repository is accessible at https://github.com.
What potential benefits do I get from making a submission to one of the Challenges?
In addition to the Prize award for the winning applications, winning participants get expert validation of their solution and benefit from promotion efforts which may expand their exposure to a wide audience of potential users.
Can more than one person work on a submission and receive credit? And if the submission wins, will the prize money be split among the participants?
Yes, more than one individual is permitted to complete a submission package during both phases of the Challenge. Team members are solely responsible for allocating the value of services, awards, or prizes.
Who can I contact if I have questions?
Please email the logistics contractor, Susan Kerin with Capital Consulting Corporation, at email@example.com. In fairness to all potential contestants, we may post the answer to your question(s) online on our FAQ page.