Generative prediction of causal gene sets responsible for complex traits
Data files
Apr 15, 2025 version files 1.96 GB
-
data_aa_lam_3.zip
425.96 MB
-
depmap_expression.npy
197.30 MB
-
depmap_y.npy
10.41 KB
-
expression_GSE114065_series_matrix.npy
20.22 MB
-
expression_GSE129653_series_matrix.npy
23.23 MB
-
expression_GSE135092_series_matrix.npy
93.29 MB
-
expression_GSE182870_series_matrix.npy
224.87 MB
-
expression_GSE193677_series_matrix.npy
437.70 MB
-
expression_GSE202695_series_matrix.npy
189.67 MB
-
expression_GSE89843_series_matrix.npy
29.10 MB
-
expression_GSE96783_series_matrix.npy
62.82 MB
-
gene_pert_dict.npy
2.97 MB
-
gene_perturbations_for_ben.gctx
248.72 MB
-
mart_export.txt-2.gz
4.10 MB
-
matching_indices_allergy_rnaseq.npy
104.05 KB
-
matching_indices_amd_rnaseq.npy
307.44 KB
-
matching_indices_asthma_rnaseq.npy
255.26 KB
-
matching_indices_cancermeta_rnaseq.npy
66.83 KB
-
matching_indices_depmap_rnaseq.npy
274.90 KB
-
matching_indices_ib_rnaseq.npy
169.82 KB
-
matching_indices_MODY3_rnaseq.npy
153.54 KB
-
matching_indices_t1d_rnaseq.npy
132.82 KB
-
matching_indices_tep_rnaseq.npy
66.83 KB
-
README.md
6.77 KB
-
traits_allergy_GSE114065_series_matrix.npy
3.42 KB
-
traits_amd_GSE135092_series_matrix.npy
8.72 KB
-
traits_asthma_GSE96783_series_matrix.npy
7.22 KB
-
traits_cancermeta_GSE202695_series_matrix.npy
39.23 KB
-
traits_ib_GSE193677_series_matrix.npy
39.97 KB
-
traits_MODY3_GSE129653_series_matrix.npy
4.46 KB
-
traits_t1d_GSE182870_series_matrix.npy
49.07 KB
-
traits_tep_GSE89843_series_matrix.npy
12.54 KB
Abstract
The relationship between genotype and phenotype remains an outstanding question for organism-level traits because these traits are generally complex. The challenge arises from complex traits being determined by a combination of multiple genes (or loci), which leads to an explosion of possible genotype-phenotype mappings. The primary techniques to resolve these mappings are genome/transcriptome-wide association studies, which are limited by their lack of causal inference and statistical power. Here, we develop an approach that leverages transcriptional data endowed with causal information and a generative machine learning model to strengthen statistical power. Our implementation of the approach-- dubbed TWAVE---includes a variational autoencoder trained on human transcriptional data, which is incorporated into an optimization framework. TWAVE generates trait expression profiles, which we dimensionally reduce by identifying independently varying generalized pathways (eigengenes). We then conduct constrained optimization to find causal gene sets that are the gene perturbations whose measured transcriptomic responses best explain trait differences. By considering several complex traits, we show that the approach identifies causal genes that cannot be detected by the primary existing techniques. Moreover, the approach identifies complex diseases caused by distinct sets of genes, meaning that the disease is polygenic and exhibits distinct subtypes driven by different genotype-phenotype mappings. We suggest that the approach will enable the design of tailored experiments to identify multi-genic targets to address complex diseases.
https://doi.org/10.5061/dryad.s4mw6m9hf
Description of the data and file structure
This is the data repository for the project ‘Generative prediction of causal gene sets responsible for complex traits’.
This repository contains data to run Jupyter notebooks and a Python script in the associated Zenodo code repository (doi: 10.5281/zenodo.12955283).
Data: 1) single-cell RNAseq data on the human complex disease traits featured in the
manuscript (labeled by GEO series, see Table 1 in main text). In the files below, the traits are labeled by their abbreviations and GEO series:
- tep = Non-small cell lung cancer (GSE89843)
- t1d = Type-1 diabetes (GSE182870)
- MODY3 = Maturity-onset diabetes of the young type 3 (GSE129653)
- ib = Inflammatory bowel (GSE193677)
- cancermeta = Cancer metastasis (GSE202695)
- asthma = Allergic asthma (GSE96783)
- allergy = Food allergy (GSE114065)
- amd = Macular degerneration (GSE135092)
- depmap = Pan-cancer metastasis (no GEO i.d.)
'traits'
files are binary 1-hot encoded arrays labeling baseline and variant phenotypes for each trait (labeled by abbreviation and GEO i.d., see above)
'matching_indices'
files contain matching indices to match genes between perturbation and trait datasets, labeled by abbreviation (see above)
mart_export.txt-2.gz, gene_pert_dict.npy and gene_perturbations_for_ben.gctx contain gene perturbations
'expression'
files are gene expression arrays (cells by genes) containing gene expression for each trait (by abbreviation, see above) in transcripts-per-million sample optimization data for allergic asthma trait to run the second Jupyter notebook in the Zenodo repository TWAVE2 below: data_aa_lam_3.zip.
Jupyter notebooks in the associated Zenodo repository:
1) TWAVE1: a notebook implementing the variational autoencoder TWAVE, as well as dimensionality reduction via selection of causal eigengenes
2) TWAVE2: a notebook implementing post-optimization analysis by the maximum entropy graph null model and construction of gene perturbation co-occurrence networks.
Python script in the associated Zenodo repository:
TWAVE_optimization: implementing constrained optimization to find relevant genes that drive the transition between baseline and variant clusters in causal eigengene space. his script (see Zenodo) likely needs to be run on the cluster (2500 X 2 optimizationseach taking ~15-30 minutes).
Please reach out to Ben Kuznets-Speck at biophysben@gmail.com with any questions.
Files and variables
File: traits_tep_GSE89843_series_matrix.npy
Description: binary trait matrix for Non-small cell lung cancer trait
File: traits_t1d_GSE182870_series_matrix.npy
Description: binary trait matrix for Type-1 diabetes
File: traits_MODY3_GSE129653_series_matrix.npy
Description: binary trait matrix for Maturity onset diabetes of the young type-3
File: traits_ib_GSE193677_series_matrix.npy
Description: binary trait matrix for Inflammatory bowel
File: traits_cancermeta_GSE202695_series_matrix.npy
Description: binary trait matrix for cancer metastasis
File: traits_asthma_GSE96783_series_matrix.npy
Description: binary trait matrix for allergic asthma
File: traits_allergy_GSE114065_series_matrix.npy
Description: binary trait matrix for food allergy
File: traits_amd_GSE135092_series_matrix.npy
Description: binary trait matrix for macular degeneration
File: matching_indices_tep_rnaseq.npy
Description: matching gene indices between traits and perturbations (Non-small cell lung cancer)
File: matching_indices_ib_rnaseq.npy
Description: matching gene indices between traits and perturbations (inflammatory bowel)
File: matching_indices_t1d_rnaseq.npy
Description: matching gene indices between traits and perturbations (type-1 diabetes)
File: matching_indices_depmap_rnaseq.npy
Description: matching gene indices between traits and perturbations (pan-cancer metastasis)
File: matching_indices_allergy_rnaseq.npy
Description: matching gene indices between traits and perturbations (food allergy)
File: matching_indices_asthma_rnaseq.npy
Description: matching gene indices between traits and perturbations (allergic asthma)
File: matching_indices_MODY3_rnaseq.npy
Description: matching gene indices between traits and perturbations (maturity onset diabetes of the young type-3)
File: matching_indices_cancermeta_rnaseq.npy
Description: matching gene indices between traits and perturbations (cancer metastasis)
File: matching_indices_amd_rnaseq.npy
Description: matching gene indices between traits and perturbations (macular degeneration)
File: depmap_y.npy
Description: depmap binary trait matrix (pan-cancer metastasis)
File: mart_export.txt-2.gz
Description: file needed for perturbations
File: gene_pert_dict.npy
Description: gene perturbation dictionary
File: expression_GSE114065_series_matrix.npy
Description: trait expression data (food allergy)
File: expression_GSE129653_series_matrix.npy
Description: trait expression data (MODY3)
File: expression_GSE202695_series_matrix.npy
Description: trait expression data (cancer metastasis)
File: expression_GSE89843_series_matrix.npy
Description: trait expression data (Non-small cell lung cancer aka tep)
File: gene_perturbations_for_ben.gctx
Description: gene perturbation expression
File: expression_GSE96783_series_matrix.npy
Description: trait expression data (allergic asthma)
File: expression_GSE135092_series_matrix.npy
Description: trait expression data (macular degeneration amd)
File: expression_GSE182870_series_matrix.npy
Description: trait expression data (type-1 diabetes t1d)
File: depmap_expression.npy
Description: depmap expression data (pan-cancer metastasis)
File: expression_GSE193677_series_matrix.npy
Description: trait expression data (inflammatory bowel ib)
File: data_aa_lam_3.zip
Description: data from allergic asthma optimization
Code/software
See Zenodo Generative-prediction-of-causal-gene-sets-responsible-for-complex-traits doi: 10.5281/zenodo.12955283
Access information
Other publicly accessible locations of the data:
- Gene Expression Omnibus
Data was derived from the following sources:
- n/a
Data were collected from GEO.