eQTM catalogue in children's blood
Ruiz-Arenas, Carlos; Bustamante, Mariona (2021), eQTM catalogue in children's blood, Dryad, Dataset, https://doi.org/10.5061/dryad.fxpnvx0t0
Background: The identification of expression quantitative trait methylation (eQTMs), defined as associations between DNA methylation levels and gene expression, might help the biological interpretation of epigenome-wide association studies (EWAS). We aimed to identify autosomal cis eQTMs in children’s blood, using data from 832 children of the Human Early Life Exposome (HELIX) project.
Methods: Blood DNA methylation and gene expression were measured with the Illumina 450K and the Affymetrix HTA v2 arrays, respectively. The relationship between methylation levels and expression of nearby genes (1 Mb window centered at the transcription start site, TSS) was assessed by fitting 13.6 M linear regressions adjusting for sex, age, cohort, and blood cell composition.
Results: We identified 39,749 blood autosomal cis eQTMs, representing 21,966 unique CpGs (eCpGs, 5.7% of total CpGs) and 8,886 unique transcript clusters (eGenes, 15.3% of total transcript clusters, equivalent to genes). In 87.9% of these cis eQTMs, the eCpG was located at <250 kb from eGene’s TSS; and 58.8% of all eQTMs showed an inverse relationship between the methylation and expression levels. Only around half of the autosomal cis-eQTMs eGenes could be captured through annotation of the eCpG to the closest gene. eCpGs had less measurement error and were enriched for active blood regulatory regions and for CpGs reported to be associated with environmental exposures or phenotypic traits. 40.4% of eQTMs had at least one genetic variant associated with methylation and expression levels. The overlap of autosomal cis eQTMs in children’s blood with those described in adults was small (13.8%), and age-shared cis eQTMs tended to be proximal to the TSS and enriched for genetic variants.
To test associations between DNA methylation levels and gene expression levels in cis (cis eQTMs), we paired each Gene to CpGs closer than 500 kb from its TSS, either upstream or downstream. For each Gene, the TSS was defined based on HTA-2.0 annotation, using the start position for transcripts in the + strand, and the end position for transcripts in the - strand. CpGs position was obtained from Illumina 450K array annotation. Only CpGs in autosomal chromosomes (from chromosome 1 to 22) were tested. In the main analysis, we fitted for each CpG-Gene pair a linear regression model between gene expression and methylation levels adjusted for age, sex, cohort, and blood cell type composition. A second model was run without adjusting for blood cellular composition and it is only reported on the online web catalog, but not discussed in this manuscript. Although some of the unique associations of the unadjusted model might be real, others might be confounded by the large methylation and expression changes among blood cell types.
To ensure that CpGs paired to a higher number of Genes do not have higher chances of being part of an eQTM, multiple-testing was controlled at the CpG level, following a procedure previously applied in the Genotype-Tissue Expression (GTEx) project (Gamazon et al., 2018). Briefly, our statistic used to test the hypothesis that a pair CpG-Gene is significantly associated is based on considering the lowest p-value observed for a given CpG and all its paired Gene (e.g., those in the 1 Mb window centered at the TSS). As we do not know the distribution of this statistic under the null, we used a permutation test. We generated 100 permuted gene expression datasets and ran our previous linear regression models obtaining 100 permuted p-values for each CpG-Gene pair. Then, for each CpG, we selected among all CpG-Gene pairs the minimum p-value in each permutation and fitted a beta distribution that is the distribution we obtain when dealing with extreme values (e.g. minimum) (Dudbridge and Gusnanto, 2008). Next, for each CpG, we took the minimum p-value observed in the real data and used the beta distribution to compute the probability of observing a lower p-value. We defined this probability as the empirical p-value of the CpG. Then, we considered as significant those CpGs with empirical p-values to be significant at 5% false discovery rate using Benjamini-Hochberg method. Finally, we applied a last step to identify all significant CpG-Gene pairs for all eCpGs. To do so, we defined a genome-wide empirical p-value threshold as the empirical p-value of the eCpG closest to the 5% false discovery rate threshold. We used this empirical p-value to calculate a nominal p-value threshold for each eCpG, based on the beta distribution obtained from the minimum permuted p-values. This nominal p-value threshold was defined as the value for which the inverse cumulative distribution of the beta distribution was equal to the empirical p-value. Then, for each eCpG, we considered as significant all eCpG-Gene variants with a p-value smaller than nominal p-value.
For the meQTLs catalogue, we selected 9.9 M cis and trans meQTLs with a p-value <1e-7 in the ARIES dataset consisting of data from children of 7 years old (Gaunt et al., 2016). Then, we tested whether this subset of 9.9 M SNPs were also meQTLs in HELIX by running meQTL analyses using MatrixEQTL R package (Shabalin, 2012), adjusting for cohort, sex, age, blood cellular composition and the first 20 principal components (PCs) calculated from genome-wide genetic data of the GWAS variability. We confirmed 2.8 M meQTLs in HELIX (p-value <1e-7). Trans meQTLs represented <10% of the 2.8 M meQTLs. Enrichment of eCpGs for meQTLs was computed using a Chi-square test, using non eCpGs as background.
Finally, we tested whether meQTLs were also eQTLs for the eGenes linked to the eCpGs. To this end, we run eQTL analyses (gene expression being the outcome and 2.8 M SNPs the predictors) with MatrixEQTL adjusting for cohort, sex, age, blood cellular composition and the first 20 GWAS PCs in HELIX. We considered as significant eQTLs the SNP-Gene pairs with p-value <1e-7 and with the direction of the effect consistent with the direction of the meQTL and the eQTM.