This study applies a graph neural network (GNN)-based approach to investigate metabolic perturbations in mouse liver transcriptomic data following toxicant exposure. A mouse-specific metabolic reaction network was constructed from Reactome, replacing the human network used in prior models. Publicly available transcriptomic datasets (n = 7,903 control samples across 26 tissues) were curated from Recount3 for model training and validation. Test datasets (n = 299) included liver samples from mice exposed to the environmental toxicant 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD).

Gene counts were filtered to retain only those linked to known metabolic reactions and transformed using DESeq2. Principal component analysis (PCA) was applied to genes per reaction, with the first principal component (PC1) used as node features. A GNN architecture using PyTorch Geometric with GraphConv layers and global mean pooling was trained to classify tissue type and later adapted via transfer learning for toxicant response classification.

Integrated Gradients were used to estimate the importance of individual edges in the reaction network, and network centrality measures identified key reactions. Comparative differential gene expression and enrichment analyses were performed to contextualize GNN findings. All data were obtained from public sources and all code is available on GitHub.

Provenance for this README

File name: README.txt
Authors: Keji Yuan, Rance Nault
Date created: 2025-04-15
Date modified: 2025-04-22

Dataset Version and Release History

Current Version:
- Number: 1.0.0
- Date: 2025-04-22
- Persistent identifier: https://doi.org/10.5061/dryad.k3j9kd5kz
- Summary of changes: n/a
Embargo Provenance: n/a
- Scope of embargo: n/a
- Embargo period: n/a

Dataset Attribution and Usage

Dataset Title: Data for the article “Application of a metabolic network-based graph neural network for the identification of toxicant-induced perturbations”
Persistent Identifier: n/a
Dataset Contributors:
- Creators: Keji Yuan, Rance Nault
Date of Issue: 2025-04-15
Publisher: Michigan State University
License: Use of these data is covered by the following license:
- Title: CC0 1.0 Universal (CC0 1.0)
- Specification: https://creativecommons.org/publicdomain/zero/1.0/; The authors respectfully request that in the case of re-use that the associated manuscripts be cited.
Suggested Citations:
- Dataset citation:
  
  Yuan K. and Nault, R. 2025. Data from: Application of a metabolic network-based graph neural network for the identification of toxicant-induced perturbations, Dryad, Dataset, https://doi.org/10.5061/dryad.k3j9kd5kz
- Corresponding publication:
  
  Yuan K. and Nault, R. 2025. Application of a metabolic network-based graph neural network for the identification of toxicant-induced perturbations.

Contact Information

Name: Rance Nault
Affiliations: Department of Pharmacology and Toxicology, Institute for Integrative Toxicology, Michigan State University
ORCID ID: https://orcid.org/0000-0002-6822-4962
Email: naultran@msu.edu
Address: e-mail preferred

Additional Dataset Metadata

Acknowledgements

Funding sources: National Institute of Environmental Research Superfund Research Program, Award: P42ES004911

Data and File Overview

Summary Metrics

File count: 17
Total file size: 42526832 bytes
Range of individual file sizes: 137 bytes - 35221577 bytes
File formats: .pdf .tar.gz .txt

Naming Conventions

File naming scheme: files contain the prefix “Table_S”, “File_S”, or “Fig S” followed by the number representing their order of appearance in the associated manuscript.

Table_S1.txt.tar.gz
Table_S2.txt
Table_S3.txt
Table_S4.txt
Table_S5.txt
Table_S6.txt
Table_S7.txt
Table_S8.txt
Table_S9.txt
Table_S10.txt
File_S1.zip
File_S2.cys
File_S3.zip
Fig S1.pdf
Fig S2.pdf
Fig S3.pdf
Fig S4.pdf

Setup

Unpacking instructions:
tar -xzf Table_S1.txt.tar.gz
unzip File_S1.zip
unzip File_S3.zip
Relationships between files/folders: n/a
Recommended software/tools:
Cytoscape
Text editor

File/Folder Details

File Structure

This dataset is organized into several files and directories to support the findings of our research. Below is a description of the data structure and how a potential user might navigate and utilize these files.
├── Fig S1.pdf
├── Fig S2.pdf
├── Fig S3.pdf
├── Fig S4.pdf
├── File_S1
│   ├── Training_ARI
│   │   ├── ARI_v_ Adipose misclassification.png
│   │   ├── ARI_v Adrenal misclassification.png
│   │   ├── ARI_v Aorta misclassification.png
│   │   ├── ARI_v Bone misclassification.png
│   │   ├── ARI_v Brain misclassification.png
│   │   ├── ARI_v Eye misclassification.png
│   │   ├── ARI_v Heart misclassification.png
│   │   ├── ARI_v Intestine misclassification.png
│   │   ├── ARI_v Kidney misclassification.png
│   │   ├── ARI_v Liver misclassification.png
│   │   ├── ARI_v Lung misclassification.png
│   │   ├── ARI_v MOE misclassification.png
│   │   ├── ARI_v Mammary gland misclassification.png
│   │   ├── ARI_v Muscle misclassification.png
│   │   ├── ARI_v Nerve misclassification.png
│   │   ├── ARI_v Olfactory misclassification.png
│   │   ├── ARI_v Ovary misclassification.png
│   │   ├── ARI_v Pancreas misclassification.png
│   │   ├── ARI_v Skin misclassification.png
│   │   ├── ARI_v Sperm misclassification.png
│   │   ├── ARI_v Spinal cord misclassification.png
│   │   ├── ARI_v Stomach misclassification.png
│   │   ├── ARI_v Tendon misclassification.png
│   │   ├── ARI_v Testes misclassification.png
│   │   ├── ARI_v Tongue misclassification.png
│   │   ├── ARI_v Uterus misclassification.png
│   │   ├── ARI_v VNO misclassification.png
│   │   ├── top_10_Adipose_rxns.csv
│   │   ├── top_10_Adrenal_rxns.csv
│   │   ├── top_10_Aorta_rxns.csv
│   │   ├── top_10_Bone_rxns.csv
│   │   ├── top_10_Brain_rxns.csv
│   │   ├── top_10_Eye_rxns.csv
│   │   ├── top_10_Heart_rxns.csv
│   │   ├── top_10_Intestine_rxns.csv
│   │   ├── top_10_Kidney_rxns.csv
│   │   ├── top_10_Liver_rxns.csv
│   │   ├── top_10_Lung_rxns.csv
│   │   ├── top_10_MOE_rxns.csv
│   │   ├── top_10_Mammary gland_rxns.csv
│   │   ├── top_10_Muscle_rxns.csv
│   │   ├── top_10_Nerve_rxns.csv
│   │   ├── top_10_Olfactory_rxns.csv
│   │   ├── top_10_Ovary_rxns.csv
│   │   ├── top_10_Pancreas_rxns.csv
│   │   ├── top_10_Skin_rxns.csv
│   │   ├── top_10_Sperm_rxns.csv
│   │   ├── top_10_Spinal cord_rxns.csv
│   │   ├── top_10_Stomach_rxns.csv
│   │   ├── top_10_Tendon_rxns.csv
│   │   ├── top_10_Testes_rxns.csv
│   │   ├── top_10_Uterus_rxns.csv
│   │   └── top_10_VNO_rxns.csv
│   └── Validation_ARI
│   ├── ARI_v Adipose misclassification.png
│   ├── ARI_v Brain misclassification.png
│   ├── ARI_v Eye misclassification.png
│   ├── ARI_v Heart misclassification.png
│   ├── ARI_v Intestine misclassification.png
│   ├── ARI_v Kidney misclassification.png
│   ├── ARI_v Liver misclassification.png
│   ├── ARI_v Lung misclassification.png
│   ├── ARI_v MOE misclassification.png
│   ├── ARI_v Muscle misclassification.png
│   ├── ARI_v Pancreas misclassification.png
│   ├── ARI_v Skin misclassification.png
│   ├── ARI_v Testes _misclassification.png
│   ├── top_10_Adipose_rxns.csv
│   ├── top_10_Brain_rxns.csv
│   ├── top_10_Eye_rxns.csv
│   ├── top_10_Heart_rxns.csv
│   ├── top_10_Intestine_rxns.csv
│   ├── top_10_Kidney_rxns.csv
│   ├── top_10_Liver_rxns.csv
│   ├── top_10_Lung_rxns.csv
│   ├── top_10_MOE_rxns.csv
│   ├── top_10_Muscle_rxns.csv
│   ├── top_10_Pancreas_rxns.csv
│   ├── top_10_Skin_rxns.csv
│   └── top_10_Testes_rxns.csv
├── File_S2.cys
├── File_S3
│   ├── repeated random subsampling validation
│   │   ├── rrs_detailed_metrics_1.csv
│   │   ├── rrs_detailed_metrics_10.csv
│   │   ├── rrs_detailed_metrics_2.csv
│   │   ├── rrs_detailed_metrics_3.csv
│   │   ├── rrs_detailed_metrics_4.csv
│   │   ├── rrs_detailed_metrics_5.csv
│   │   ├── rrs_detailed_metrics_6.csv
│   │   ├── rrs_detailed_metrics_7.csv
│   │   ├── rrs_detailed_metrics_8.csv
│   │   └── rrs_detailed_metrics_9.csv
│   └── stratified_kfold_validation
│   ├── k_fold_detailed_metrics_1.csv
│   ├── k_fold_detailed_metrics_10.csv
│   ├── k_fold_detailed_metrics_2.csv
│   ├── k_fold_detailed_metrics_3.csv
│   ├── k_fold_detailed_metrics_4.csv
│   ├── k_fold_detailed_metrics_5.csv
│   ├── k_fold_detailed_metrics_6.csv
│   ├── k_fold_detailed_metrics_7.csv
│   ├── k_fold_detailed_metrics_8.csv
│   └── k_fold_detailed_metrics_9.csv

Details for: Table_S1.txt.tar.gz

Description: Identifiers and extracted metadata for all Recount3 mouse datasets available at the time of analysis.
Format(s): .txt.tar.gz
Size(s): 4629793 bytes
checksum: e48b5819d89d22b5d4111947ff79b689
Dimensions: 225558 rows x 2696 columns
Variables:
Missing data codes: n/a
Other encoding details: n/a

Details for: Table_S2.txt

Description: Identifiers and extracted metadata for all Recount3 mouse datasets used in the training and validation of the graph neural network.
Format(s): .txt
Size(s): 35221577 bytes
checksum: 4f97ae237208b842fcbbe848b6845390
Dimensions: 7904 rows x 203 columns
Variables:
Missing data codes: n/a
Other encoding details: n/a

Details for: Table_S3.txt

Description: Eigenvector and closeness centrality measures for of top reactions identified by IG values for each percentage subset of original data.
Format(s): .txt
Size(s): 204429 bytes
checksum: d2ecf779cae539bd8a416b6456b6c7fe
Dimensions: 4500 rows x 5 columns
Variables:
- Node: The reaction Id of each node from integrated gradient analysis.
- centrality_value: Describes how high the centrality of each reaction node is
- percent_subset: Split the binary dose testing dataset into 10 subsets.
- selection_criteria: Two different ways to extract the high centrality reaction node.
- centrality_method: Two different ways to do the centrality analysis.
Missing data codes: n/a
Other encoding details: n/a

Details for: Table_S4.txt

Description: Area under the receiver operator curve for each tissue type using either a GNN model or ResNet18 in validation dataset.
Format(s): .txt
Size(s): 676 bytes
checksum: 75543b2f2b3a966c2ff41751d52f7eb1
Dimensions: 28 rows x 3 columns
Variables:
- Group: The tissue name in validation dataset.
- Method: Two deep learning models.
- Mean_AUROC: The mean value of AUROC for each tissue in each model.
Missing data codes: n/a
Other encoding details: n/a

Details for: Table_S5.txt

Description: Performance metric comparison between the GNN model and ResNet18 for validation data.
Format(s): .txt
Size(s): 1962 bytes
checksum: 802fdcf791df0dbc7e40bfa751707398
Dimensions: 30 rows x 15 columns
Variables:
- name: The tissue name training and validation dataset.
- tissue: The index of each tissue.
- TP: True Positives
- TN: True Negatives
- FP: False Positives
- FN: False Negatives
- POSITIVES: Total number of actual positive instances
- NEGATIVES: Total number of actual negative instances
- MCC: Matthews Correlation Coefficient
- TPR: True Positive Rate (Sensitivity, Recall)
- TNR: True Negative Rate (Specificity)
- FPR: False Positive Rate (1 - Specificity)
- FNR: False Negative Rate (1 - Sensitivity)
- PRECISION: Precision (Positive Predictive Value)
- RECALL: Recall (Sensitivity, TPR)
- F1_score: Harmonic mean of Precision and Recall
Missing data codes: n/a
Other encoding details: n/a

Details for: Table_S6.txt

Description: Performance metrics for the binary dose classification using the GNN model.
Format(s): .txt
Size(s): 137 bytes
checksum: afabddc41c4d9f29d0245772932fce99
Dimensions: 4 rows x 14 columns
Variables:
- name: The dosage in binary dose testing dataset.
- TP: True Positives
- TN: True Negatives
- FP: False Positives
- FN: False Negatives
- POSITIVES: Total number of actual positive instances
- NEGATIVES: Total number of actual negative instances
- MCC: Matthews Correlation Coefficient
- TPR: True Positive Rate (Sensitivity, Recall)
- TNR: True Negative Rate (Specificity)
- FPR: False Positive Rate (1 - Specificity)
- FNR: False Negative Rate (1 - Sensitivity)
- PRECISION: Precision (Positive Predictive Value)
- RECALL: Recall (Sensitivity, TPR)
- F1_score: Harmonic mean of Precision and Recall
Missing data codes: n/a
Other encoding details: n/a

Details for: Table_S7.txt

Description: Performance metrics for the dose-response classification using the GNN model.
Format(s): .txt
Size(s): 691 bytes
checksum: b9a13712ac3947a7d6f642b6bebd9649
Dimensions: 11 rows x 15 columns
Variables:
- name: The dosage in dose response testing dataset.
- dose: The index of each dosage.
- TP: True Positives
- TN: True Negatives
- FP: False Positives
- FN: False Negatives
- POSITIVES: Total number of actual positive instances
- NEGATIVES: Total number of actual negative instances
- MCC: Matthews Correlation Coefficient
- TPR: True Positive Rate (Sensitivity, Recall)
- TNR: True Negative Rate (Specificity)
- FPR: False Positive Rate (1 - Specificity)
- FNR: False Negative Rate (1 - Sensitivity)
- PRECISION: Precision (Positive Predictive Value)
- RECALL: Recall (Sensitivity, TPR)
- F1_score: Harmonic mean of Precision and Recall
Missing data codes: n/a
Other encoding details: n/a

Details for: Table_S8.txt

Description: Top 50 Reactions sorted by integrated gradient(IG) value from result of binary dose testing dataset of GNN.
Format(s): .txt
Size(s): 1629 bytes
checksum: d16ef85167782b4fb2cff800d01925a7
Dimensions: 52 rows x 3 columns
Variables:
- start: The start node of reaction.
- end: The end node of reaction.
- repeated: Boolean representing whether value appears more than once among top 50 reactions.
Missing data codes: n/a
Other encoding details: n/a

Details for: Table_S9.txt

Description: Accuracy of binary dose classification using the GNN model for data subset from 10% to 100% in 10% intervals.
Format(s): .txt
Size(s): 152 bytes
checksum: b36606beac024cf063a4a1880de8ec14
Dimensions: 4 rows x 11 columns
Variables:
- Dataset: Split the binary dose testing dataset to 10 groups based on the percentage.
- Accuracy: The final accuracy of GNN for each group.
- Samples: Number of samples in each group.
Missing data codes: n/a
Other encoding details: n/a

Details for: Table_S10.txt

Description: Spearman Correlation Coefficients for Rank-Ordered Reactions by IG Value Across Subsets of the Origin Dataset (10% - 90%) with 10 Repeated Tests (10%, 90%).
Format(s): .txt
Size(s): 1204 bytes
checksum: ec4135c6732b68217b3a7ae4a9f9fa9f
Dimensions: 29 rows x 4 columns
Variables:
- Comparison: Indicate the comparison being made between two groups
- rho: Represents Spearman’s rank correlation coefficient (ρ). It measures the strength and direction of the monotonic relationship between two ranked variables. Values range from -1 to +1.
- P-value: This is the probability value. In the context of hypothesis testing (likely used to calculate the correlation), it indicates the probability of observing a correlation as strong as, or stronger than, the one calculated, if there were no true correlation between the variables.
- S-value: A value calculated from the data that is used to determine the p-value.
Missing data codes: n/a
Other encoding details: n/a

Details for: Fig S1.pdf

Description: Comparison of GNN and ResNet18 performance metrics.
Format(s): .pdf
Size(s): 15259 bytes
checksum: 78bd0121e3091437ca25890e82c2bdfe
dimensions: n/a
variables:
- MCC: Matthews Correlation Coefficient
- TPR: True Positive Rate (Sensitivity, Recall)
- TNR: True Negative Rate (Specificity)
- FPR: False Positive Rate (1 - Specificity)
- FNR: False Negative Rate (1 - Sensitivity)
- PRECISION: Precision (Positive Predictive Value)
- RECALL: Recall (Sensitivity, TPR)
- F1_score: Harmonic mean of Precision and Recall
Missing data codes: n/a
Other encoding details: n/a

Details for: Fig S2.pdf

Description: Variance in gene expression counts for mouse and TCGA/GTEx datasets.
Format(s): .pdf
Size(s): 1262116 bytes
checksum: 8be91faee63ccb7e26b5a9e9ec60cbe3
dimensions: n/a
variables:
- TCGA/GTex: Datasets from the TCGA/GTex samples.
- Mouse: Datasets from Recount3.
- Variance: Variance in the expression counts for each gene.
Missing data codes: n/a
Other encoding details: n/a

Details for: Fig S3.pdf

Description: F1 score for 10-fold (A) repeated random sampling validation and (B) stratified validation analyses.
Format(s): .pdf
Size(s): 323829 bytes
checksum: 8827aed2f9b712d721ec9a1cd000194b
dimensions: n/a
variables:
Missing data codes: n/a
Other encoding details: n/a

Details for: Fig S4.pdf

Description: Graphical representation of manuscript analyses.
Format(s): .pdf
Size(s): 56224 bytes
checksum: e49df1628fc5f7ceedbd680abc597ba9
dimensions: n/a
variables:
Missing data codes: n/a
Other encoding details: n/a

Details for: File S1.zip

Description: This ZIP archive contains tabular data and figures related to clustering performance for each tissue, assessed using the Adjusted Rand Index (ARI). Separate results are provided for training and validation datasets. Each table reports the top 10 ARI scores per tissue, along with tissue-specific misclassification visualizations.
Format(s): .zip
Size(s): 4629793 bytes
checksum: 61af580710c2b0780509201ea47252fc
dimensions: n/a
variables:
- x-axis (in figure): Adjusted Rand Index (same as the ARI column above).
- y-axis (in figure): Classification accuracy, expressed as 1-misclassification rate.
- RXN_ID: , a stable identifier for a specific biological reaction in the Reactome pathway database (e.g., R-MMU-XXXXXX)
- [Tissue name]: Tissue-specific column (e.g., Liver, Kidney, Heart) representing the classification accuracy for that tissue, computed as 1 - misclassification rate. Values closer to 1 indicate better performance. Unitless.
- ARI: A metric (unitless) for evaluating clustering accuracy, ranging from -1 to 1. Higher ARI values indicate stronger agreement between predicted and reference clusters.
Missing data codes: n/a
Other encoding details: n/a

Details for: File_S2.cys

Description: Visualization of GNN model network of the binary dose test dataset.
Format(s): .cys
Size(s): 786004 bytes
checksum: 1fee78ef9f2a88214f2f7ef8c590db6c
dimensions: n/a
variables:
Missing data codes: n/a
Other encoding details: n/a
Network Data: networks/87-sorted_by_ig1030_time_cross.csv.xgmml: The main network in XGMML format; views/155-45772-sorted_by_ig1030_time_cross.csv.xgmml: The corresponding visual style/view information.
Tables (Attributes): SHARED_ATTRS-org.cytoscape.model.CyNode-…cytable; LOCAL_ATTRS-org.cytoscape.model.CyEdge-…cytable
Session Settings: session_vizmap.xml: Visual mapping and style settings; filters.json & filterChains.json: Any filters applied within the session.

Details for: File_S3.zip

Description: This ZIP archive contains classification performance metrics derived from repeated random subsampling and stratified k-fold cross-validation. The results are summarized by tissue and include standard confusion matrix components and derived performance measures.
Format(s): .zip
Size(s): 35900 bytes
checksum: 3e5fdb15e80af53eb91c0eeba87b22c8
dimensions: n/a
variables:
- tissue: Name of the tissue (e.g., Liver, Kidney) corresponding to the model evaluation results.
- TP: Number of correctly classified positive instances (True Positive).
- TN: Number of correctly classified negative instances (True Negative).
- FP: Number of negative instances incorrectly classified as positive (False Positive).
- FN: Number of positive instances incorrectly classified as negative (False Negative).
- POSITIVES: Total number of actual positive instances in the dataset.
- NEGATIVES: Total number of actual negative instances in the dataset.
- MCC: Matthews Correlation Coefficient. A balanced metric for binary classification performance, ranging from -1 (inverse prediction) to +1 (perfect prediction). Unitless.
- TPR: True Positive Rate. Also known as Recall or Sensitivity; calculated as TP / (TP + FN). Unitless.
- TNR: True Negative Rate. Also known as Specificity; calculated as TN / (TN + FP). Unitless.
- FPR: False Positive Rate. Complement of specificity; calculated as FP / (FP + TN). Unitless.
- FNR: False Negative Rate. Complement of sensitivity; calculated as FN / (FN + TP). Unitless.
- PRECISION: Also known as Positive Predictive Value; calculated as TP / (TP + FP). Unitless.
- RECALL: Another term for Sensitivity or TPR; calculated as TP / (TP + FN). Unitless.
- F1_score: Harmonic mean of Precision and Recall; provides a balanced measure of performance when classes are imbalanced. Unitless.
Missing data codes: n/a
Other encoding details: n/a

Data from: Application of a metabolic network-based graph neural network for the identification of toxicant-induced perturbations

Data files

Abstract

Provenance for this README

Dataset Version and Release History

Dataset Attribution and Usage

Contact Information

Additional Dataset Metadata

Acknowledgements

Data and File Overview

Summary Metrics

Naming Conventions

Table of Contents

Setup

File/Folder Details

File Structure

Details for: Table_S1.txt.tar.gz

Details for: Table_S2.txt

Details for: Table_S3.txt

Details for: Table_S4.txt

Details for: Table_S5.txt

Details for: Table_S6.txt

Details for: Table_S7.txt

Details for: Table_S8.txt

Details for: Table_S9.txt

Details for: Table_S10.txt

Details for: Fig S1.pdf

Details for: Fig S2.pdf

Details for: Fig S3.pdf

Details for: Fig S4.pdf

Details for: File S1.zip

Details for: File_S2.cys

Details for: File_S3.zip

Data from: Application of a metabolic network-based graph neural network for the identification of toxicant-induced perturbations

Data files

Abstract

README: Application of a metabolic network-based graph neural network for the identification of toxicant-induced perturbations

Provenance for this README

Dataset Version and Release History

Dataset Attribution and Usage

Contact Information

Additional Dataset Metadata

Acknowledgements

Data and File Overview

Summary Metrics

Naming Conventions

Table of Contents

Setup

File/Folder Details

File Structure

Details for: Table_S1.txt.tar.gz

Details for: Table_S2.txt

Details for: Table_S3.txt

Details for: Table_S4.txt

Details for: Table_S5.txt

Details for: Table_S6.txt

Details for: Table_S7.txt

Details for: Table_S8.txt

Details for: Table_S9.txt

Details for: Table_S10.txt

Details for: Fig S1.pdf

Details for: Fig S2.pdf

Details for: Fig S3.pdf

Details for: Fig S4.pdf

Details for: File S1.zip

Details for: File_S2.cys

Details for: File_S3.zip