On the cross-population generalizability of gene expression prediction models
Keys, Kevin L. et al. (2020), On the cross-population generalizability of gene expression prediction models, Dryad, Dataset, https://doi.org/10.7272/Q6RN362Z
The genetic control of gene expression is a core component of human physiology. For the past several years, transcriptome-wide association studies have leveraged large datasets of linked genotype and RNA sequencing information to create a powerful gene-based test of association that has been used in dozens of studies. While numerous discoveries have been made, the populations in the training data are overwhelmingly of European descent, and little is known about the generalizability of these models to other populations. Here, we test for cross-population generalizability of gene expression prediction models using a dataset of African American individuals with RNA-Seq data in whole blood. We find that the default models trained in large datasets such as GTEx and DGN fare poorly in African Americans, with a notable reduction in prediction accuracy when compared to European Americans. We replicate these limitations in cross-population generalizability using the five populations in the GEUVADIS dataset. Via realistic simulations of both populations and gene expression, we show that accurate cross-population generalizability of transcriptome prediction only arises when eQTL architecture is substantially shared across populations. In contrast, models with non-identical eQTLs showed patterns similar to real-world data. Therefore, generating RNA-Seq data in diverse populations is a critical step towards multi-ethnic utility of gene expression prediction.
This dataset is linked to a manuscript. For a complete description of methods for how these data were produced, processed, and analyzed, see the preprint on bioRxiv here. This study contains three separate analyses, for which a summary is given below.
- Analysis of expression data from SAGE, a pediatric asthma cohort;
- Analysis of paired genotype-expression data from the GEUVADIS study;
- Simulated data using genotype data from the 1000 Genomes Project (1KGP)
Genotype data from SAGE are available on dbGaP under ascension number phs000921.v1.p1.
Expression data were processed in accordance with the GTEx v6p pipeline. Inverse quantile normalized expression values on 39 SAGE subjects are provided here. These data are stored in the file sage_39_wgs_for_rnaseq_expression_sorted_headered.bed.tar.gz.
Genome data for GEUVADIS were downloaded from the 1KGP data portal.
Expression data from GEUVADIS were taken from the file GD462.GeneQuantRPKM.50FN.samplename.resk10.txt.gz, downloaded from the GEUVADIS data portal (originally at https://www.ebi.ac.uk/Tools/geuvadis-das/, but defunct as of May 2020; try the 1KGP page or the EBI page).
Simulations from 1000 Genomes
Simulations used haplotype data originally from 1KGP. Haplotype data were downloaded from the IMPUTE website (download link not working as of May 2020) and are provided here for completeness (see file "HM3.tgz"). Forward-simulated haplotypes from HAPGEN2 are also provided here in three archives: AA.chr22.tar.gz, CEU.chr22.tar.gz, and YRI.chr22.tar.gz. A list of genes from chromosome 22 is also included as chr22.genelist.txt.
Results are separated by analysis.
Results from analysis of SAGE are stored in the archive sage.results.tar.gz. It contains three files:
sage.predixcan.all.gene.results.txt, which contains all R2 and correlation results from comparison of measured gene expression to predictions from PrediXcan;
gtex7.compare.r2.txt, which contains the comparison of GTEx v7 training R2 versus empirical PrediXcan R2 in SAGE, illustrated in Figure 3 of Keys et al. (2020);
sage_predixcan_allresults_allplots_2020-02-17.Rdata, an R data file with manuscript figures included. Plotting code is available on Github here.
Results from analysis of GEUVADIS data are split into three archives:
geuvadis.numpred.tar.gz, which contains a count of the number of samples predicted for each gene in GEUVADIS analyses;
geuvadis.predictionweights.tar.gz, which contains the prediction weights produced by the PrediXcan pipeline applied to GEUVADIS populations;
geuvadis.results.tar.gz, which contains the comparisons between measurements and predictions in GEUVADIS subpopulations (see Tables 1-4, Supplementary Tables 5 and 8, and Supplementary Figures 13-17 of Keys et al. (2020)).
1000 Genomes Simulations
Gene expression prediction and TWAS association testing results from analysis of simulated cross-population prediction with 1KGP populations is stored in the single archive 1000genomes-simulation.results.tar.gz. See Figures 4-7, Supplementary Tables 6-7 and 9-12, and Supplementary Figures 20-23 of Keys et al. (2020). Analysis and plotting code is on Github.
All source code for this project can be found on Github here.
Genotype data are stored on dbGaP under ascension number phs000921.v4.p1.
National Heart, Lung, and Blood Institute, Award: R01HL117004
National Heart, Lung, and Blood Institute, Award: R01HL128439
National Heart, Lung, and Blood Institute, Award: R01HL135156
National Heart, Lung, and Blood Institute, Award: X01HL134589
National Heart, Lung, and Blood Institute, Award: R01HL141992
National Heart, Lung, and Blood Institute, Award: R01HL104608
National Human Genome Research Institute, Award: U01HG007419
National Institute of Environmental Health Sciences, Award: R01ES015794
National Institute on Minority Health and Health Disparities, Award: P60MD006902
National Institute of General Medical Sciences, Award: RL5GM118984
Tobacco-Related Disease Research Program, Award: 24RT-0025
Tobacco-Related Disease Research Program, Award: 27IR-0030
National Institute of General Medical Sciences, Award: TL4GM118986
National Institute of General Medical Sciences, Award: UL1GM118985
National Heart, Lung, and Blood Institute, Award: R01HL135156-S1
Gordon and Betty Moore Foundation, Award: GBMF3834
Alfred P. Sloan Foundation, Award: 2013-10-27
National Heart, Lung, and Blood Institute, Award: R01HL117004-S1
National Institute of General Medical Sciences, Award: K12GM081266
National Heart, Lung, and Blood Institute, Award: K01HL140218
National Institute of General Medical Sciences, Award: T34GM008574
National Human Genome Research Institute, Award: R56HG010297
National Human Genome Research Institute, Award: T32HG00044