Expression-based machine learning models for predicting plant tissue identity
Data files
Sep 25, 2024 version files 13.68 GB
-
all_tissue_type.csv
12.11 KB
-
Angiosperm_data_clean.csv
151.84 MB
-
Arabidopsis_metadata.tsv
2.85 MB
-
gene_FPKM_200501.parquet
5.07 GB
-
gene_FPKM_transposed_UMR75.parquet
3.61 GB
-
gene_FPKM_transposed.parquet
4.84 GB
-
metadata_UMR75.csv
2.57 MB
-
README.md
5.86 KB
-
tissue_type_map_UMR75.csv
11.94 KB
Abstract
The selection of Arabidopsis as a model organism played a pivotal role in advancing genomic science. Competing frameworks to select an agricultural- or ecological-based model species were selected against in favor of building knowledge in a species that would facilitate genome-enabled research. Here, we examine the ability of models based on Arabidopsis gene expression data to predict tissue identity in other flowering plants. Comparing different machine learning algorithms, models trained and tested on Arabidopsis data achieved near-perfect precision and recall values, whereas when tissue identity is predicted across the flowering plants using models trained on Arabidopsis data, precision values range from 0.69 to 0.74 and recall from 0.54 to 0.64. Below-ground tissue is more predictable than other tissue types, and the ability to predict tissue identity is not correlated with phylogenetic distance from Arabidopsis. K-Nearest Neighbors is the most successful algorithm and suggests that gene expression signatures, rather than marker genes, are more valuable in creating models for tissue and cell type prediction in plants. Our data-driven results highlight that the assertion that knowledge from Arabidopsis is translatable to other plants is not always true. Considering the current landscape of abundant sequencing data, we should reevaluate the scientific emphasis on Arabidopsis and prioritize plant diversity.
README: Expression-based machine learning models for predicting plant tissue identity
Arabidopsis Gene Expression Dataset
https://doi.org/10.5061/dryad.4b8gthtn7
The dataset contains three .parquet
files:
1) gene_FPKM_200501.parquet
: The original gene expression database was downloaded from the Arabidopsis RNA-Seq Database (Zhang et al, 2020). The original dataset contains 28,165 Arabidopsis gene expression profiles across 37,334 genes.
2) gene_FPKM_transposed.parquet
: Simply the transposed version of gene_FPKM_200501.parquet
which is better aligned with typical machine learning datasets where samples are represented in rows.
3) gene_FPKM_transposed_UMR75.parquet
: The gene expression profiles (gene_FPKM_transposed.parquet
) were filtered to remove samples with a unique mapped rate below 75%. This dataset is used to train and test machine learning models.
In addition to the .parquet
files, the following .tsv
and .csv
files are also included in this upload:
4) Arabidopsis_metadata.tsv
: This is the metadata accompanying the original gene expression dataset (gene_FPKM_200501.parquet
). It provides the sample ID, project ID, sample name, PMID, genotype, ecotype, tissue type, total reads, unique mapped rate, and the release date for each sample in the dataset. A "/" is placed in the fields with missing information. In other metadata files "/" is replaced with "NA" for better Python and Pandas compatibility.
5) metadata_UMR75.csv
: This file contains the metadata corresponding to the filtered gene expression profiles (gene_FPKM_transposed_UMR75.parquet
).
6) all_tissue_type.csv
: This file lists tissue type labels encountered in the metadata (Arabidopsis_metadata.tsv
) and the various aggregations thereof which were ultimately used in the analyses. Here is a description of the variables (columns) in the file:
Tissue
: Contains all unique tissue type labels encountered in the metadata. "/" indicates missing or unknown labels.Counts
: Indicates the number of samples with the givenTissue
label in the gene expression dataset.TissueClean
: Sanitized tissue labels based on the original designations. Following 23 distinct tissue labels are adopted: “anther,” “carpel,” “cotyledon,” “flower,” “hypocotyl,” “inflorescence,” “internode,” “leaf,” “other,” “petal,” “petiole,” “pistil,” “reproductive-other,” “root,” “root cell,” “seed,” “seedling,” “sepal,” “shoot,” “stamen,” “stigma,” “vasculature,” or “whole plant”. The missing or unknown labels are indicated by "NA", and the samples with "NA" labels are eventually dropped from the analysis.VegetativeRepro
: The sanitized labels are further aggregated into "vegetative", "reproductive", "root", "whole plant", and "hypotocyl". Samples for which tissue identity could not be determined from their description were marked as "NA" and eventually discarded, as they were incompatible with our machine-learning pipeline.AboveBelow
: Similarly, the sanitized tissue labels were also aggregated into "above ground", "below ground", "seed", and "whole plant". Once again samples for which tissue identity could not be determined were marked as "NA" and discarded.Debatable
: For some samples, the assignedVegetativeRepro
orAboveBelow
designations could be debatable. For example, we assigned the "whole plant" designation to "seedlings". Such debatable assignments are marked with a "Yes" in this column. The rest are marked with "NA" to indicate no ambiguity in the label assignment.
7) tissue_type_map_UMR75.csv
: This file contains unique tissue type labels encountered in the metadata after discarding the samples with a low unique mapped rate. It has the same column headers as all_tissue_type.csv
with two main differences: 1) Does not have the Debatable
column anymore, and 2) All samples with "NA" labels are dropped. This file is used to map the aggregated tissue type labels to the gene expression profiles (gene_FPKM_transposed_UMR75.parquet
). After discarding samples that had a low unique mapped rate (<75%) or samples for which tissue identity could not be determined, the final Arabidopsis database was left with 19,415 samples.
8) Angiosperm_data_clean.csv
: This database, created by Palande et al., 2023 contains 2,671 flowering plant gene expression profiles across 6,327 orthogroups. For more detailed information about the database, refer to the PLoS Biology article titled "The topological shape of gene expression across the evolution of flowering plants".
9) arabidopsis_gene_expression_main.zip
: This archive contains all the code required to perform the analyses and reproduce the results presented in Expression-based machine learning models for predicting plant tissue identity. To run the analyses, download the archive and extract all files into the code
folder under the project directory. Similarly, download all the .parquet
and .csv
files into the data
folder under the project directory. From there, the Python scripts and the Jupyter notebooks should work as intended.
The code is also available on GitHub at: "https://github.com/PlantsAndPython/arabidopsis-gene-expression". To run the analysis, simply clone the GitHub repository, then download the parquet files and put them in the "data" directory of the cloned repository. From there, the scripts and Jupyter Notebooks should work as intended. The GitHub method is preferred because it preserves the directory structure and relative paths.
Methods
We analyzed gene expression data from two sources. The first (Zhang et al., 2020) contains 28,165 Arabidopsis gene expression profiles across 37,334 genes. The second (Palande et al., 2023) contains 2,671 flowering plant gene expression profiles across 6,327 orthogroups.
Originally gene expression profiles were classified into 23 tissue types based on their original designations: “anther,” “carpel,” “cotyledon,” “flower,” “hypocotyl,” “inflorescence,” “internode,” “leaf,” “other,” “petal,” “petiole,” “pistil,” “reproductive-other,” “root,” “root cell,” “seed,” “seedling,” “sepal,” “shoot,” “stamen,” “stigma,” “vasculature,” or “whole plant.”
Due to large differences in sample size between these categories, they were aggregated into four tissue type labels: "aboveground", "below ground", "whole plant", and "other". The categories are purposefully encompassing and were chosen to facilitate accurate assignment across the broad categories of experimental data we analyzed, focusing on aboveground and belowground tissue identity as one of the simplest cases to test tissue predictability.
Samples for which tissue identity could not be determined from their description were discarded, as they were incompatible with our machine learning pipeline. Additionally, we discarded low-quality samples, which we measured by unique mapped rate, or the number of uniquely mapping reads divided by the total number of reads. After removing samples with missing metadata and samples with a low unique mapped rate (<75%), the Arabidopsis database was left with 19,415 samples. A conserved Arabidopsis database was also constructed by keeping only the genes mapped to the orthogroups from the flowering plant database. The conserved Arabidopsis database contained the same number of samples but with much smaller expression profiles across only the 6327 orthogroups shared with the angiosperm dataset.
References:
Zhang, H., F. Zhang, Y. Yu, L. I. Feng, J. Jia, B. O. Liu, B. Li, et al. 2020. A comprehensive online database for exploring ∼20,000 public Arabidopsis RNA-seq libraries. Molecular Plant 13(9): 1231–1233.
Palande, S., J. A. Kaste, M. D. Roberts, K. S. Aba, C. Claucherty, J. Dacon, R. Doko, et al. 2023. The topological shape of gene expression across the evolution of flowering plants. PLoS Biology 21(12): e 3002397.