Data from: Application of a metabolic network-based graph neural network for the identification of toxicant-induced perturbations
Data files
Jun 09, 2025 version files 65.27 MB
-
File_S1.zip
24.36 MB
-
File_S2.cys
786 KB
-
File_S3.zip
35.90 KB
-
README.md
21.77 KB
-
Table_S1.txt.tar.gz
4.63 MB
-
Table_S10.txt
1.20 KB
-
Table_S2.txt
35.22 MB
-
Table_S3.txt
204.43 KB
-
Table_S4.txt
676 B
-
Table_S5.txt
1.96 KB
-
Table_S6.txt
137 B
-
Table_S7.txt
691 B
-
Table_S8.txt
1.63 KB
-
Table_S9.txt
152 B
Abstract
This study applies a graph neural network (GNN)-based approach to investigate metabolic perturbations in mouse liver transcriptomic data following toxicant exposure. A mouse-specific metabolic reaction network was constructed from Reactome, replacing the human network used in prior models. Publicly available transcriptomic datasets (n = 7,903 control samples across 26 tissues) were curated from Recount3 for model training and validation. Test datasets (n = 299) included liver samples from mice exposed to the environmental toxicant 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD).
Gene counts were filtered to retain only those linked to known metabolic reactions and transformed using DESeq2. Principal component analysis (PCA) was applied to genes per reaction, with the first principal component (PC1) used as node features. A GNN architecture using PyTorch Geometric with GraphConv layers and global mean pooling was trained to classify tissue type and later adapted via transfer learning for toxicant response classification.
Integrated Gradients were used to estimate the importance of individual edges in the reaction network, and network centrality measures identified key reactions. Comparative differential gene expression and enrichment analyses were performed to contextualize GNN findings. All data were obtained from public sources and all code is available on GitHub.
Provenance for this README
- File name: README.txt
- Authors: Keji Yuan, Rance Nault
- Date created: 2025-04-15
- Date modified: 2025-04-22
Dataset Version and Release History
- Current Version:
- Number: 1.0.0
- Date: 2025-04-22
- Persistent identifier: https://doi.org/10.5061/dryad.k3j9kd5kz
- Summary of changes: n/a
- Embargo Provenance: n/a
- Scope of embargo: n/a
- Embargo period: n/a
Dataset Attribution and Usage
- Dataset Title: Data for the article “Application of a metabolic network-based graph neural network for the identification of toxicant-induced perturbations”
- Persistent Identifier: n/a
- Dataset Contributors:
- Creators: Keji Yuan, Rance Nault
- Date of Issue: 2025-04-15
- Publisher: Michigan State University
- License: Use of these data is covered by the following license:
- Title: CC0 1.0 Universal (CC0 1.0)
- Specification: https://creativecommons.org/publicdomain/zero/1.0/; The authors respectfully request that in the case of re-use that the associated manuscripts be cited.
- Suggested Citations:
-
Dataset citation:
Yuan K. and Nault, R. 2025. Data from: Application of a metabolic network-based graph neural network for the identification of toxicant-induced perturbations, Dryad, Dataset, https://doi.org/10.5061/dryad.k3j9kd5kz
-
Corresponding publication:
Yuan K. and Nault, R. 2025. Application of a metabolic network-based graph neural network for the identification of toxicant-induced perturbations.
-
Contact Information
- Name: Rance Nault
- Affiliations: Department of Pharmacology and Toxicology, Institute for Integrative Toxicology, Michigan State University
- ORCID ID: https://orcid.org/0000-0002-6822-4962
- Email: naultran@msu.edu
- Address: e-mail preferred
Additional Dataset Metadata
Acknowledgements
- Funding sources: National Institute of Environmental Research Superfund Research Program, Award: P42ES004911
Data and File Overview
Summary Metrics
- File count: 17
- Total file size: 42526832 bytes
- Range of individual file sizes: 137 bytes - 35221577 bytes
- File formats: .pdf .tar.gz .txt
Naming Conventions
- File naming scheme: files contain the prefix “Table_S”, “File_S”, or “Fig S” followed by the number representing their order of appearance in the associated manuscript.
Table of Contents
- Table_S1.txt.tar.gz
- Table_S2.txt
- Table_S3.txt
- Table_S4.txt
- Table_S5.txt
- Table_S6.txt
- Table_S7.txt
- Table_S8.txt
- Table_S9.txt
- Table_S10.txt
- File_S1.zip
- File_S2.cys
- File_S3.zip
- Fig S1.pdf
- Fig S2.pdf
- Fig S3.pdf
- Fig S4.pdf
Setup
- Unpacking instructions:
tar -xzf Table_S1.txt.tar.gz
unzip File_S1.zip
unzip File_S3.zip - Relationships between files/folders: n/a
- Recommended software/tools:
Cytoscape
Text editor
File/Folder Details
File Structure
This dataset is organized into several files and directories to support the findings of our research. Below is a description of the data structure and how a potential user might navigate and utilize these files.
├── Fig S1.pdf
├── Fig S2.pdf
├── Fig S3.pdf
├── Fig S4.pdf
├── File_S1
│ ├── Training_ARI
│ │ ├── ARI_v_ Adipose misclassification.png
│ │ ├── ARI_v Adrenal misclassification.png
│ │ ├── ARI_v Aorta misclassification.png
│ │ ├── ARI_v Bone misclassification.png
│ │ ├── ARI_v Brain misclassification.png
│ │ ├── ARI_v Eye misclassification.png
│ │ ├── ARI_v Heart misclassification.png
│ │ ├── ARI_v Intestine misclassification.png
│ │ ├── ARI_v Kidney misclassification.png
│ │ ├── ARI_v Liver misclassification.png
│ │ ├── ARI_v Lung misclassification.png
│ │ ├── ARI_v MOE misclassification.png
│ │ ├── ARI_v Mammary gland misclassification.png
│ │ ├── ARI_v Muscle misclassification.png
│ │ ├── ARI_v Nerve misclassification.png
│ │ ├── ARI_v Olfactory misclassification.png
│ │ ├── ARI_v Ovary misclassification.png
│ │ ├── ARI_v Pancreas misclassification.png
│ │ ├── ARI_v Skin misclassification.png
│ │ ├── ARI_v Sperm misclassification.png
│ │ ├── ARI_v Spinal cord misclassification.png
│ │ ├── ARI_v Stomach misclassification.png
│ │ ├── ARI_v Tendon misclassification.png
│ │ ├── ARI_v Testes misclassification.png
│ │ ├── ARI_v Tongue misclassification.png
│ │ ├── ARI_v Uterus misclassification.png
│ │ ├── ARI_v VNO misclassification.png
│ │ ├── top_10_Adipose_rxns.csv
│ │ ├── top_10_Adrenal_rxns.csv
│ │ ├── top_10_Aorta_rxns.csv
│ │ ├── top_10_Bone_rxns.csv
│ │ ├── top_10_Brain_rxns.csv
│ │ ├── top_10_Eye_rxns.csv
│ │ ├── top_10_Heart_rxns.csv
│ │ ├── top_10_Intestine_rxns.csv
│ │ ├── top_10_Kidney_rxns.csv
│ │ ├── top_10_Liver_rxns.csv
│ │ ├── top_10_Lung_rxns.csv
│ │ ├── top_10_MOE_rxns.csv
│ │ ├── top_10_Mammary gland_rxns.csv
│ │ ├── top_10_Muscle_rxns.csv
│ │ ├── top_10_Nerve_rxns.csv
│ │ ├── top_10_Olfactory_rxns.csv
│ │ ├── top_10_Ovary_rxns.csv
│ │ ├── top_10_Pancreas_rxns.csv
│ │ ├── top_10_Skin_rxns.csv
│ │ ├── top_10_Sperm_rxns.csv
│ │ ├── top_10_Spinal cord_rxns.csv
│ │ ├── top_10_Stomach_rxns.csv
│ │ ├── top_10_Tendon_rxns.csv
│ │ ├── top_10_Testes_rxns.csv
│ │ ├── top_10_Uterus_rxns.csv
│ │ └── top_10_VNO_rxns.csv
│ └── Validation_ARI
│ ├── ARI_v Adipose misclassification.png
│ ├── ARI_v Brain misclassification.png
│ ├── ARI_v Eye misclassification.png
│ ├── ARI_v Heart misclassification.png
│ ├── ARI_v Intestine misclassification.png
│ ├── ARI_v Kidney misclassification.png
│ ├── ARI_v Liver misclassification.png
│ ├── ARI_v Lung misclassification.png
│ ├── ARI_v MOE misclassification.png
│ ├── ARI_v Muscle misclassification.png
│ ├── ARI_v Pancreas misclassification.png
│ ├── ARI_v Skin misclassification.png
│ ├── ARI_v Testes _misclassification.png
│ ├── top_10_Adipose_rxns.csv
│ ├── top_10_Brain_rxns.csv
│ ├── top_10_Eye_rxns.csv
│ ├── top_10_Heart_rxns.csv
│ ├── top_10_Intestine_rxns.csv
│ ├── top_10_Kidney_rxns.csv
│ ├── top_10_Liver_rxns.csv
│ ├── top_10_Lung_rxns.csv
│ ├── top_10_MOE_rxns.csv
│ ├── top_10_Muscle_rxns.csv
│ ├── top_10_Pancreas_rxns.csv
│ ├── top_10_Skin_rxns.csv
│ └── top_10_Testes_rxns.csv
├── File_S2.cys
├── File_S3
│ ├── repeated random subsampling validation
│ │ ├── rrs_detailed_metrics_1.csv
│ │ ├── rrs_detailed_metrics_10.csv
│ │ ├── rrs_detailed_metrics_2.csv
│ │ ├── rrs_detailed_metrics_3.csv
│ │ ├── rrs_detailed_metrics_4.csv
│ │ ├── rrs_detailed_metrics_5.csv
│ │ ├── rrs_detailed_metrics_6.csv
│ │ ├── rrs_detailed_metrics_7.csv
│ │ ├── rrs_detailed_metrics_8.csv
│ │ └── rrs_detailed_metrics_9.csv
│ └── stratified_kfold_validation
│ ├── k_fold_detailed_metrics_1.csv
│ ├── k_fold_detailed_metrics_10.csv
│ ├── k_fold_detailed_metrics_2.csv
│ ├── k_fold_detailed_metrics_3.csv
│ ├── k_fold_detailed_metrics_4.csv
│ ├── k_fold_detailed_metrics_5.csv
│ ├── k_fold_detailed_metrics_6.csv
│ ├── k_fold_detailed_metrics_7.csv
│ ├── k_fold_detailed_metrics_8.csv
│ └── k_fold_detailed_metrics_9.csv
Details for: Table_S1.txt.tar.gz
- Description: Identifiers and extracted metadata for all Recount3 mouse datasets available at the time of analysis.
- Format(s): .txt.tar.gz
- Size(s): 4629793 bytes
- checksum: e48b5819d89d22b5d4111947ff79b689
- Dimensions: 225558 rows x 2696 columns
- Variables:
- Missing data codes: n/a
- Other encoding details: n/a
Details for: Table_S2.txt
- Description: Identifiers and extracted metadata for all Recount3 mouse datasets used in the training and validation of the graph neural network.
- Format(s): .txt
- Size(s): 35221577 bytes
- checksum: 4f97ae237208b842fcbbe848b6845390
- Dimensions: 7904 rows x 203 columns
- Variables:
- Missing data codes: n/a
- Other encoding details: n/a
Details for: Table_S3.txt
- Description: Eigenvector and closeness centrality measures for of top reactions identified by IG values for each percentage subset of original data.
- Format(s): .txt
- Size(s): 204429 bytes
- checksum: d2ecf779cae539bd8a416b6456b6c7fe
- Dimensions: 4500 rows x 5 columns
- Variables:
- Node: The reaction Id of each node from integrated gradient analysis.
- centrality_value: Describes how high the centrality of each reaction node is
- percent_subset: Split the binary dose testing dataset into 10 subsets.
- selection_criteria: Two different ways to extract the high centrality reaction node.
- centrality_method: Two different ways to do the centrality analysis.
- Missing data codes: n/a
- Other encoding details: n/a
Details for: Table_S4.txt
- Description: Area under the receiver operator curve for each tissue type using either a GNN model or ResNet18 in validation dataset.
- Format(s): .txt
- Size(s): 676 bytes
- checksum: 75543b2f2b3a966c2ff41751d52f7eb1
- Dimensions: 28 rows x 3 columns
- Variables:
- Group: The tissue name in validation dataset.
- Method: Two deep learning models.
- Mean_AUROC: The mean value of AUROC for each tissue in each model.
- Missing data codes: n/a
- Other encoding details: n/a
Details for: Table_S5.txt
- Description: Performance metric comparison between the GNN model and ResNet18 for validation data.
- Format(s): .txt
- Size(s): 1962 bytes
- checksum: 802fdcf791df0dbc7e40bfa751707398
- Dimensions: 30 rows x 15 columns
- Variables:
- name: The tissue name training and validation dataset.
- tissue: The index of each tissue.
- TP: True Positives
- TN: True Negatives
- FP: False Positives
- FN: False Negatives
- POSITIVES: Total number of actual positive instances
- NEGATIVES: Total number of actual negative instances
- MCC: Matthews Correlation Coefficient
- TPR: True Positive Rate (Sensitivity, Recall)
- TNR: True Negative Rate (Specificity)
- FPR: False Positive Rate (1 - Specificity)
- FNR: False Negative Rate (1 - Sensitivity)
- PRECISION: Precision (Positive Predictive Value)
- RECALL: Recall (Sensitivity, TPR)
- F1_score: Harmonic mean of Precision and Recall
- Missing data codes: n/a
- Other encoding details: n/a
Details for: Table_S6.txt
- Description: Performance metrics for the binary dose classification using the GNN model.
- Format(s): .txt
- Size(s): 137 bytes
- checksum: afabddc41c4d9f29d0245772932fce99
- Dimensions: 4 rows x 14 columns
- Variables:
- name: The dosage in binary dose testing dataset.
- TP: True Positives
- TN: True Negatives
- FP: False Positives
- FN: False Negatives
- POSITIVES: Total number of actual positive instances
- NEGATIVES: Total number of actual negative instances
- MCC: Matthews Correlation Coefficient
- TPR: True Positive Rate (Sensitivity, Recall)
- TNR: True Negative Rate (Specificity)
- FPR: False Positive Rate (1 - Specificity)
- FNR: False Negative Rate (1 - Sensitivity)
- PRECISION: Precision (Positive Predictive Value)
- RECALL: Recall (Sensitivity, TPR)
- F1_score: Harmonic mean of Precision and Recall
- Missing data codes: n/a
- Other encoding details: n/a
Details for: Table_S7.txt
- Description: Performance metrics for the dose-response classification using the GNN model.
- Format(s): .txt
- Size(s): 691 bytes
- checksum: b9a13712ac3947a7d6f642b6bebd9649
- Dimensions: 11 rows x 15 columns
- Variables:
- name: The dosage in dose response testing dataset.
- dose: The index of each dosage.
- TP: True Positives
- TN: True Negatives
- FP: False Positives
- FN: False Negatives
- POSITIVES: Total number of actual positive instances
- NEGATIVES: Total number of actual negative instances
- MCC: Matthews Correlation Coefficient
- TPR: True Positive Rate (Sensitivity, Recall)
- TNR: True Negative Rate (Specificity)
- FPR: False Positive Rate (1 - Specificity)
- FNR: False Negative Rate (1 - Sensitivity)
- PRECISION: Precision (Positive Predictive Value)
- RECALL: Recall (Sensitivity, TPR)
- F1_score: Harmonic mean of Precision and Recall
- Missing data codes: n/a
- Other encoding details: n/a
Details for: Table_S8.txt
- Description: Top 50 Reactions sorted by integrated gradient(IG) value from result of binary dose testing dataset of GNN.
- Format(s): .txt
- Size(s): 1629 bytes
- checksum: d16ef85167782b4fb2cff800d01925a7
- Dimensions: 52 rows x 3 columns
- Variables:
- start: The start node of reaction.
- end: The end node of reaction.
- repeated: Boolean representing whether value appears more than once among top 50 reactions.
- Missing data codes: n/a
- Other encoding details: n/a
Details for: Table_S9.txt
- Description: Accuracy of binary dose classification using the GNN model for data subset from 10% to 100% in 10% intervals.
- Format(s): .txt
- Size(s): 152 bytes
- checksum: b36606beac024cf063a4a1880de8ec14
- Dimensions: 4 rows x 11 columns
- Variables:
- Dataset: Split the binary dose testing dataset to 10 groups based on the percentage.
- Accuracy: The final accuracy of GNN for each group.
- Samples: Number of samples in each group.
- Missing data codes: n/a
- Other encoding details: n/a
Details for: Table_S10.txt
- Description: Spearman Correlation Coefficients for Rank-Ordered Reactions by IG Value Across Subsets of the Origin Dataset (10% - 90%) with 10 Repeated Tests (10%, 90%).
- Format(s): .txt
- Size(s): 1204 bytes
- checksum: ec4135c6732b68217b3a7ae4a9f9fa9f
- Dimensions: 29 rows x 4 columns
- Variables:
- Comparison: Indicate the comparison being made between two groups
- rho: Represents Spearman’s rank correlation coefficient (ρ). It measures the strength and direction of the monotonic relationship between two ranked variables. Values range from -1 to +1.
- P-value: This is the probability value. In the context of hypothesis testing (likely used to calculate the correlation), it indicates the probability of observing a correlation as strong as, or stronger than, the one calculated, if there were no true correlation between the variables.
- S-value: A value calculated from the data that is used to determine the p-value.
- Missing data codes: n/a
- Other encoding details: n/a
Details for: Fig S1.pdf
- Description: Comparison of GNN and ResNet18 performance metrics.
- Format(s): .pdf
- Size(s): 15259 bytes
- checksum: 78bd0121e3091437ca25890e82c2bdfe
- dimensions: n/a
- variables:
- MCC: Matthews Correlation Coefficient
- TPR: True Positive Rate (Sensitivity, Recall)
- TNR: True Negative Rate (Specificity)
- FPR: False Positive Rate (1 - Specificity)
- FNR: False Negative Rate (1 - Sensitivity)
- PRECISION: Precision (Positive Predictive Value)
- RECALL: Recall (Sensitivity, TPR)
- F1_score: Harmonic mean of Precision and Recall
- Missing data codes: n/a
- Other encoding details: n/a
Details for: Fig S2.pdf
- Description: Variance in gene expression counts for mouse and TCGA/GTEx datasets.
- Format(s): .pdf
- Size(s): 1262116 bytes
- checksum: 8be91faee63ccb7e26b5a9e9ec60cbe3
- dimensions: n/a
- variables:
- TCGA/GTex: Datasets from the TCGA/GTex samples.
- Mouse: Datasets from Recount3.
- Variance: Variance in the expression counts for each gene.
- Missing data codes: n/a
- Other encoding details: n/a
Details for: Fig S3.pdf
- Description: F1 score for 10-fold (A) repeated random sampling validation and (B) stratified validation analyses.
- Format(s): .pdf
- Size(s): 323829 bytes
- checksum: 8827aed2f9b712d721ec9a1cd000194b
- dimensions: n/a
- variables:
- Missing data codes: n/a
- Other encoding details: n/a
Details for: Fig S4.pdf
- Description: Graphical representation of manuscript analyses.
- Format(s): .pdf
- Size(s): 56224 bytes
- checksum: e49df1628fc5f7ceedbd680abc597ba9
- dimensions: n/a
- variables:
- Missing data codes: n/a
- Other encoding details: n/a
Details for: File S1.zip
- Description: This ZIP archive contains tabular data and figures related to clustering performance for each tissue, assessed using the Adjusted Rand Index (ARI). Separate results are provided for training and validation datasets. Each table reports the top 10 ARI scores per tissue, along with tissue-specific misclassification visualizations.
- Format(s): .zip
- Size(s): 4629793 bytes
- checksum: 61af580710c2b0780509201ea47252fc
- dimensions: n/a
- variables:
- x-axis (in figure): Adjusted Rand Index (same as the ARI column above).
- y-axis (in figure): Classification accuracy, expressed as 1-misclassification rate.
- RXN_ID: , a stable identifier for a specific biological reaction in the Reactome pathway database (e.g., R-MMU-XXXXXX)
- [Tissue name]: Tissue-specific column (e.g., Liver, Kidney, Heart) representing the classification accuracy for that tissue, computed as 1 - misclassification rate. Values closer to 1 indicate better performance. Unitless.
- ARI: A metric (unitless) for evaluating clustering accuracy, ranging from -1 to 1. Higher ARI values indicate stronger agreement between predicted and reference clusters.
- Missing data codes: n/a
- Other encoding details: n/a
Details for: File_S2.cys
- Description: Visualization of GNN model network of the binary dose test dataset.
- Format(s): .cys
- Size(s): 786004 bytes
- checksum: 1fee78ef9f2a88214f2f7ef8c590db6c
- dimensions: n/a
- variables:
- Missing data codes: n/a
- Other encoding details: n/a
- Network Data: networks/87-sorted_by_ig1030_time_cross.csv.xgmml: The main network in XGMML format; views/155-45772-sorted_by_ig1030_time_cross.csv.xgmml: The corresponding visual style/view information.
- Tables (Attributes): SHARED_ATTRS-org.cytoscape.model.CyNode-…cytable; LOCAL_ATTRS-org.cytoscape.model.CyEdge-…cytable
- Session Settings: session_vizmap.xml: Visual mapping and style settings; filters.json & filterChains.json: Any filters applied within the session.
Details for: File_S3.zip
- Description: This ZIP archive contains classification performance metrics derived from repeated random subsampling and stratified k-fold cross-validation. The results are summarized by tissue and include standard confusion matrix components and derived performance measures.
- Format(s): .zip
- Size(s): 35900 bytes
- checksum: 3e5fdb15e80af53eb91c0eeba87b22c8
- dimensions: n/a
- variables:
- tissue: Name of the tissue (e.g., Liver, Kidney) corresponding to the model evaluation results.
- TP: Number of correctly classified positive instances (True Positive).
- TN: Number of correctly classified negative instances (True Negative).
- FP: Number of negative instances incorrectly classified as positive (False Positive).
- FN: Number of positive instances incorrectly classified as negative (False Negative).
- POSITIVES: Total number of actual positive instances in the dataset.
- NEGATIVES: Total number of actual negative instances in the dataset.
- MCC: Matthews Correlation Coefficient. A balanced metric for binary classification performance, ranging from -1 (inverse prediction) to +1 (perfect prediction). Unitless.
- TPR: True Positive Rate. Also known as Recall or Sensitivity; calculated as TP / (TP + FN). Unitless.
- TNR: True Negative Rate. Also known as Specificity; calculated as TN / (TN + FP). Unitless.
- FPR: False Positive Rate. Complement of specificity; calculated as FP / (FP + TN). Unitless.
- FNR: False Negative Rate. Complement of sensitivity; calculated as FN / (FN + TP). Unitless.
- PRECISION: Also known as Positive Predictive Value; calculated as TP / (TP + FP). Unitless.
- RECALL: Another term for Sensitivity or TPR; calculated as TP / (TP + FN). Unitless.
- F1_score: Harmonic mean of Precision and Recall; provides a balanced measure of performance when classes are imbalanced. Unitless.
- Missing data codes: n/a
- Other encoding details: n/a