Skip to main content

Illustrating potential effects of alternate control populations on real-world evidence-based statistical analyses

Cite this dataset

Huang, Yidi; Yuan, William; Kohane, Isaac; Beaulieu-Jones, Brett (2021). Illustrating potential effects of alternate control populations on real-world evidence-based statistical analyses [Dataset]. Dryad.


Objective: Case-control study designs are commonly used in retrospective analyses of Real-World Evidence (RWE). Due to the increasingly wide availability of RWE, it can be difficult to determine whether findings are robust or the result of testing multiple hypotheses.

Materials and Methods: We investigate the potential effects of modifying cohort definitions in a case-control association study between depression and Type 2 Diabetes Mellitus (T2D). We used a large (>75 million individuals) de-identified administrative claims database to observe the effects of minor changes to the requirements of glucose and hemoglobin A1c tests in the control group.

Results: We found that small permutations to the criteria used to define the control population result in significant shifts in both the demographic structure of the identified cohort as well as the odds ratio of association. These differences remain present when testing against age and sex-matched controls.

Discussion: Analyses of RWE need to be carefully designed to avoid issues of multiple testing. Minor changes to control cohorts can lead to significantly different results and have the potential to alter even prospective studies through selection bias.

Conclusion: We believe this work offers strong support for the need for robust guidelines, best practices, and regulations around the use of observational RWE for clinical or regulatory decision making.


Included are the code and bootstrap results from our analysis on a large nationwide commercial insurance claims dataset covering >75 million individuals from Jan 1, 2008 to Aug 31, 2019. Results can be used to recreate our figures. Due to licensing reasons we are unable to share the source data. 

Usage notes

T2D Data prep.ipynb creates helper tables from source data which are used for T2D case/control, as well as depression phenotyping.

Exploration.ipynb generates the phenotyped cohorts and performs bootstrap association testing. This file also includes queries for population characteristics.

Paper figs.ipynb generates the figures shown in our manuscript from results stored in figdata/. is a helper script to map NDC drug codes to RxCUI ingredient codes.


United States National Library of Medicine, Award: T15LM007092

National Institute of Neurological Disorders and Stroke, Award: K99NS114850

National Institutes of Health