On the cross-population generalizability of gene expression prediction models

Keys, Kevin L.1 ; Mak, Angel C.Y.1 ; White, Marquitta J.1; Eckalbar, Walter L.1; Dahl, Andrew W.1; Mefford, Joel1; Mikhaylova, Anna V.2; Contreras, María G.1; Elhawary, Jennifer R.1 ; Eng, Celeste1; Hu, Donglei1; Huntsman, Scott1; Oh, Sam S.1; Salazar, Sandra1 ; Lenoir, Michael A.3; Ye, Jimmie Chun1; Thornton, Timothy A.2; Zaitlen, Noah4; Burchard, Esteban G.1 ; Gignoux, Christopher R.5

Published Aug 06, 2020 on Dryad. https://doi.org/10.7272/Q6RN362Z

Abstract

The genetic control of gene expression is a core component of human physiology. For the past several years, transcriptome-wide association studies have leveraged large datasets of linked genotype and RNA sequencing information to create a powerful gene-based test of association that has been used in dozens of studies. While numerous discoveries have been made, the populations in the training data are overwhelmingly of European descent, and little is known about the generalizability of these models to other populations. Here, we test for cross-population generalizability of gene expression prediction models using a dataset of African American individuals with RNA-Seq data in whole blood. We find that the default models trained in large datasets such as GTEx and DGN fare poorly in African Americans, with a notable reduction in prediction accuracy when compared to European Americans. We replicate these limitations in cross-population generalizability using the five populations in the GEUVADIS dataset. Via realistic simulations of both populations and gene expression, we show that accurate cross-population generalizability of transcriptome prediction only arises when eQTL architecture is substantially shared across populations. In contrast, models with non-identical eQTLs showed patterns similar to real-world data. Therefore, generating RNA-Seq data in diverse populations is a critical step towards multi-ethnic utility of gene expression prediction.

Data

This dataset is linked to a manuscript. For a complete description of methods for how these data were produced, processed, and analyzed, see the preprint on bioRxiv here. This study contains three separate analyses, for which a summary is given below.

Analysis of expression data from SAGE, a pediatric asthma cohort;
Analysis of paired genotype-expression data from the GEUVADIS study;
Simulated data using genotype data from the 1000 Genomes Project (1KGP)

SAGE

Genotype data from SAGE are available on dbGaP under ascension number phs000921.v1.p1.

Expression data were processed in accordance with the GTEx v6p pipeline. Inverse quantile normalized expression values on 39 SAGE subjects are provided here. These data are stored in the file sage_39_wgs_for_rnaseq_expression_sorted_headered.bed.tar.gz.

GEUVADIS

Genome data for GEUVADIS were downloaded from the 1KGP data portal.

Expression data from GEUVADIS were taken from the file GD462.GeneQuantRPKM.50FN.samplename.resk10.txt.gz, downloaded from the GEUVADIS data portal (originally at https://www.ebi.ac.uk/Tools/geuvadis-das/, but defunct as of May 2020; try the 1KGP page or the EBI page).

Simulations from 1000 Genomes

Simulations used haplotype data originally from 1KGP. Haplotype data were downloaded from the IMPUTE website (download link not working as of May 2020) and are provided here for completeness (see file "HM3.tgz"). Forward-simulated haplotypes from HAPGEN2 are also provided here in three archives: AA.chr22.tar.gz, CEU.chr22.tar.gz, and YRI.chr22.tar.gz. A list of genes from chromosome 22 is also included as chr22.genelist.txt.

Results

Results are separated by analysis.

SAGE

Results from analysis of SAGE are stored in the archive sage.results.tar.gz. It contains three files:

sage.predixcan.all.gene.results.txt, which contains all R² and correlation results from comparison of measured gene expression to predictions from PrediXcan;
gtex7.compare.r2.txt, which contains the comparison of GTEx v7 training R² versus empirical PrediXcan R² in SAGE, illustrated in Figure 3 of Keys et al. (2020);
sage_predixcan_allresults_allplots_2020-02-17.Rdata, an R data file with manuscript figures included. Plotting code is available on Github here.

GEUVADIS

Results from analysis of GEUVADIS data are split into three archives:

geuvadis.numpred.tar.gz, which contains a count of the number of samples predicted for each gene in GEUVADIS analyses;
geuvadis.predictionweights.tar.gz, which contains the prediction weights produced by the PrediXcan pipeline applied to GEUVADIS populations;
geuvadis.results.tar.gz, which contains the comparisons between measurements and predictions in GEUVADIS subpopulations (see Tables 1-4, Supplementary Tables 5 and 8, and Supplementary Figures 13-17 of Keys et al. (2020)).

1000 Genomes Simulations

Gene expression prediction and TWAS association testing results from analysis of simulated cross-population prediction with 1KGP populations is stored in the single archive 1000genomes-simulation.results.tar.gz. See Figures 4-7, Supplementary Tables 6-7 and 9-12, and Supplementary Figures 20-23 of Keys et al. (2020). Analysis and plotting code is on Github.

On the cross-population generalizability of gene expression prediction models

Data files

Abstract

Methods

Data

SAGE

GEUVADIS

Simulations from 1000 Genomes

Results

SAGE

GEUVADIS

1000 Genomes Simulations

Usage notes