Data and code from: Fatty acid metabolism reprograms immune microenvironment in retinal artery occlusion: Multi-Omics analysis highlights immunometabolic crosstalk
Data files
May 28, 2026 version files 19.14 MB
Abstract
This dataset contains integrated multi-omics data from a study investigating fatty acid metabolism and immune dysregulation in retinal artery occlusion (RAO). Data include serum metabolomics profiles, peripheral blood mononuclear cell (PBMC) transcriptomics, and immune phenotyping from 66 RAO patients (34 with dyslipidemia, 32 without) and 66 cataract controls. Analytical methods applied include multivariate analyses (sPLS-DA, OPLS-DA), random forest modeling, and SHAP evaluations to identify key metabolic and immune features. The dataset highlights differential fatty acid metabolites (e.g., C22:5n-3, C22:2n-6, C22:1n-9, C20:2n-6), associated gene expression changes, and immune cell alterations (elevated neutrophils, reduced lymphocytes). The data support exploration of metabolic–immune crosstalk in RAO pathogenesis and provide resources for biomarker discovery and therapeutic target validation.
Dataset DOI: 10.5061/dryad.98sf7m0wt
File: Fatty_acid_metabolomics_code_and_related_data.zip
Description:
This package contains code and data used for machine learning and statistical analyses of fatty acid metabolism in RAO. The repository is organized into folders corresponding to Random Forest, SVM, XGBoost, LightGBM, and ROC analyses. All code files are written in R, and input data are provided as tabular files in .csv format.
Folder Structure
· Random forest analysis/: contains R scripts and input data for two random forest feature-importance approaches, including MeanDecreaseAccuracy and MeanDecreaseGini.
· SVM/: contains R code and input data for support vector machine classification and SHAP-based model interpretation.
· XGBoost/: contains R code and input data for extreme gradient boosting modeling, feature ranking, and SHAP visualization.
· LightGBM/: contains R code and input data for Light Gradient Boosting Machine modeling and SHAP visualization.
· ROC/: contains R code and associated data used for receiver operating characteristic curve generation and area under the curve calculation.
Data Tables
· Only fatty acids.csv: contains fatty acid measurements for RAO patients and cataract controls. The group variable is defined as 1 = RAO patients and 2 = cataract controls. Fatty acid concentrations are expressed in µmol/L.
· RAO Only fatty acids.csv: contains fatty acid measurements for RAO patients stratified by lipid status. The group variable is defined as 1 = dyslipidemia group and 0 = non-dyslipidemia group.
Dyslipidemia was defined according to the following criteria: total cholesterol (TC) >= 6.2 mmol/L, low-density lipoprotein cholesterol (LDL-C) >= 4.1 mmol/L, triglycerides (TG) >= 2.3 mmol/L, or high-density lipoprotein cholesterol (HDL-C) < 1.0 mmol/L.
File: Joint_analysis_code_of_metabolomics_and_transcriptomics_and_related_data.zip
This package contains the integrated analysis scripts and related data for linking fatty acid metabolomics, immune feature scores, and transcriptomic data. It includes overall analyses using all available samples and subgroup analyses focusing on RAO patients stratified by dyslipidemia status.
· All/: contains the full dataset and analysis scripts integrating immune features, transcriptomics, and metabolomics.
· RAO/: contains subgroup analyses focusing on RAO patients stratified by dyslipidemia status. Patient files are labeled with prefix C for RAO patients with normal lipid levels and prefix H for RAO patients with dyslipidemia.
· Immune inflammation gene set.xlsx or related gene-set files: provide immune-related gene sets used to calculate immune feature scores from transcriptomic data.
· Combination of metabolome and immune characteristics.csv and Combination of metabolome and immune characteristics dyslipidemia.csv: integrated matrices linking fatty acid metabolites and immune characteristic scores.
File: Transcriptomics_codes_and_related_data.zip
This package contains transcriptomic data and R scripts for preprocessing, differential expression analysis, enrichment analysis, PCA, and immune infiltration analysis. It includes two main components: an overall comparison folder and a RAO subgroup folder for dyslipidemia-related analyses.
· All/: contains gene.expression.csv, group information, and scripts for overall transcriptomic analysis.
· RAO/: contains gene.expression dyslipidemia.csv, subgroup information, and scripts for transcriptomic analysis of RAO patients with or without dyslipidemia.
· Original transcriptomic expression values are provided as FPKM and converted to TPM during preprocessing.
All analyses were conducted in R version 4.2.0 or higher using freely available packages. The repository contains scripts and data for fatty acid-based machine learning, ROC analysis, immune feature scoring, immune–metabolic correlation analysis, transcriptomic preprocessing, differential gene expression analysis, GO and KEGG enrichment, PCA, and CIBERSORT immune infiltration analysis. Each script corresponds to a specific analytical module and can be used to reproduce the corresponding results and figures.
Software and Dependencies
· Core packages: tidyverse, dplyr, caret, ggplot2, readr, readxl, tidyr, reshape2.
· Random Forest: randomForest, rfPermute, vegan, RColorBrewer.
· SVM: e1071, pROC, rms, kernelshap, shapviz.
· XGBoost: xgboost, Matrix, data.table, skimr, DataExplorer, GGally, ggpubr, ggprism, vip, Ckmeans.1d.dp, shapviz.
· LightGBM: lightgbm, shapviz, ROCit.
· ROC analysis: pROC.
· Correlation analysis: Hmisc, pheatmap, igraph, ggraph.
· Transcriptomic analysis: limma, pheatmap, ggsci, WGCNA, GSEABase, GSVA.
· Functional enrichment: clusterProfiler, enrichplot, org.Hs.eg.db, topGO, Rgraphviz, pathview, GOplot.
· PCA and immune infiltration: FactoMineR, factoextra, CIBERSORT, ggpubr, ggsci.
Random Forest Analysis: MeanDecreaseAccuracy
This R script builds random forest classification models using fatty acid variables as predictors. It reads the fatty acid datasets, converts the group variable into a factor, removes missing values, splits the dataset into training and testing subsets, and constructs a random forest model with 500 trees. The rfPermute package is used to perform permutation-based significance testing for variable importance. Features are ranked by MeanDecreaseAccuracy, and the top variables are visualized as bar plots with p-value-based significance annotations. Separate outputs are generated for RAO versus control comparison and for dyslipidemia versus non-dyslipidemia comparison among RAO patients.
Random Forest Analysis: MeanDecreaseGini
This R script follows the same random forest modeling framework but ranks fatty acid variables according to MeanDecreaseGini. The code extracts scaled importance scores, obtains permutation-based p values, selects the top-ranked variables, and generates publication-ready bar plots with significance symbols. The script is applied to both the full RAO-control dataset and the RAO dyslipidemia subgroup dataset.
Support Vector Machine Analysis
This R script constructs radial-kernel support vector machine models for fatty acid-based classification. It performs stratified data partitioning, converts the outcome variable into labeled factors, tunes cost and gamma parameters using 10-fold cross-validation, builds the final SVM model with optimal parameters, and evaluates model performance using confusion matrices. SHAP values are then calculated with kernelshap and visualized with shapviz to identify fatty acids contributing most strongly to model prediction.
XGBoost Analysis
This R script implements XGBoost binary classification using fatty acid profiles. It includes data cleaning, missing-value inspection, feature and label extraction, DMatrix conversion, model parameter setting, model training with early stopping, and initial performance evaluation. Feature importance is ranked using Gain, Cover, and Frequency. SHAP values are calculated with shapviz, and SHAP bar plots are exported to summarize feature contributions.
LightGBM Analysis
This R script builds LightGBM binary classification models using fatty acid variables. It splits the dataset into training and testing subsets, converts input features into matrix format, constructs LightGBM datasets, defines model parameters including learning rate, maximum tree depth, and number of leaves, and trains the model with early stopping. Test-set accuracy is calculated, and SHAP importance plots are generated to interpret the contribution of each fatty acid feature.
ROC Curve Analysis
This R script evaluates the diagnostic or classification performance of individual fatty acids using ROC curves. The pROC package is used to calculate ROC curves and AUC values. For the full fatty acid dataset, the script loops through multiple fatty acid variables and overlays their ROC curves in a single PDF plot. For the RAO-only dyslipidemia dataset, the script generates ROC analysis for the selected fatty acid variable and exports the figure as a PDF file.
Immune Feature Score Calculation: Full Dataset
This R script reads immune-related gene sets from the immune inflammation gene-set table and transcriptomic expression data from gene.expression.csv. For each immune feature, the script identifies matched genes in the transcriptome matrix and calculates the mean expression value across matched genes for each patient. The resulting immune characteristic score matrix is exported as a CSV file for downstream immune–metabolic correlation analysis.
Immune Feature Score Calculation: Dyslipidemia Subgroup
This R script performs the same immune feature scoring procedure for the RAO dyslipidemia subgroup using gene.expression dyslipidemia.csv. It calculates patient-level immune characteristic scores for each immune gene set and exports the dyslipidemia-specific immune score table for downstream subgroup correlation analysis.
Immune–Metabolic Correlation Heatmap: Full Dataset
This R script integrates fatty acid metabolomic variables and immune feature scores. It separates fatty acid columns and immune feature columns, converts them to numeric matrices, and calculates correlation coefficients and p values using Hmisc::rcorr. Multiple testing correction is performed using the false discovery rate method. Significant correlations are annotated, and a heatmap is generated using pheatmap to visualize immune–metabolic associations in the full dataset.
Immune–Metabolic Correlation Heatmap: Dyslipidemia Subgroup
This R script applies the immune–metabolic correlation workflow to the RAO dyslipidemia subgroup. It calculates correlations between fatty acid metabolites and immune features, marks statistical significance, handles missing values, and exports a dyslipidemia-specific correlation heatmap as a PDF file.
Transcriptomic Preprocessing and Differential Expression: RAO versus Control
This R script preprocesses transcriptomic expression data for the overall RAO-control comparison. It converts FPKM values to TPM, removes genes with all-zero expression, applies log2 transformation when necessary, and normalizes expression values using limma::normalizeBetweenArrays. Boxplots before and after normalization are generated for quality assessment. Differential expression analysis is performed using the standard limma workflow, and both full differential results and filtered significant genes are exported as CSV files.
Volcano Plot: RAO versus Control
This code section visualizes differential expression results from the RAO-control comparison. Genes are classified as upregulated, downregulated, or not significant based on log2 fold-change and p-value thresholds. The plot includes threshold lines, professional color coding, selected gene labels using ggrepel, and high-resolution TIFF export.
GO and KEGG Enrichment: RAO versus Control
This code section performs functional enrichment analysis for differentially expressed genes. Gene symbols are converted to Entrez IDs using org.Hs.eg.db. GO enrichment is performed for Biological Process, Cellular Component, and Molecular Function categories, and KEGG pathway enrichment is conducted using clusterProfiler::enrichKEGG. Enrichment results are exported as CSV files and visualized as bar plots. GO and KEGG plots are also combined into a single functional enrichment figure.
PCA: RAO versus Control
This code section performs principal component analysis based on normalized transcriptomic expression data. Samples are grouped as control and RAO-related samples, and PCA plots with group-specific colors and confidence ellipses are generated to visualize transcriptomic separation between groups.
CIBERSORT Immune Infiltration: RAO versus Control
This code section estimates immune cell proportions from transcriptomic data using the CIBERSORT package and the LM22 signature matrix. Duplicate gene symbols are merged by mean expression before analysis. CIBERSORT is run with 1,000 permutations, and immune cell fractions are exported as CSV files. Violin and box plots are generated to compare immune cell proportions between groups, with Kruskal–Wallis tests used for statistical comparison.
Transcriptomic Preprocessing and Differential Expression: Dyslipidemia Subgroup
This R script performs transcriptomic preprocessing and differential expression analysis for RAO patients stratified by dyslipidemia status. It converts FPKM values to TPM, removes all-zero genes, applies log transformation and normalization, defines non-dyslipidemia and dyslipidemia groups, and uses limma to identify differentially expressed genes. Full and filtered differential expression tables are exported for downstream analysis.
Volcano Plot: Dyslipidemia Subgroup
This code section generates a volcano plot for the dyslipidemia subgroup comparison. Genes are classified according to significance and log2 fold-change thresholds. The plot uses professional red, blue, and gray color coding, threshold lines, selected gene labels, and high-resolution TIFF output.
GO and KEGG Enrichment: Dyslipidemia Subgroup
This code section performs GO and KEGG enrichment analyses for genes differentially expressed between RAO patients with and without dyslipidemia. GO terms are summarized across Biological Process, Cellular Component, and Molecular Function categories. KEGG pathways are ranked by gene count. Separate and combined enrichment plots are generated as high-resolution TIFF files.
PCA: Dyslipidemia Subgroup
This code section conducts PCA using transcriptomic profiles of RAO patients with or without dyslipidemia. Samples are colored according to lipid status, and group ellipses are added to visualize transcriptomic differences between dyslipidemia and non-dyslipidemia groups.
CIBERSORT Immune Infiltration: Dyslipidemia Subgroup
This code section estimates immune cell fractions in RAO patients stratified by dyslipidemia status. After merging duplicate genes and preparing the CIBERSORT input matrix, immune infiltration proportions are calculated using the LM22 signature file. The resulting immune cell profiles are visualized with violin and box plots, and between-group comparisons are annotated using Kruskal–Wallis tests.
· Random forest feature importance plots based on MeanDecreaseAccuracy and MeanDecreaseGini.
· SHAP importance plots from SVM, XGBoost, and LightGBM models.
· ROC curves and AUC values for fatty acid variables.
· Immune characteristic score tables derived from transcriptomic gene sets.
· Correlation heatmaps between fatty acid metabolites and immune characteristics.
· TPM-normalized transcriptomic expression matrices.
· Differential gene expression result tables for overall and subgroup analyses.
· Volcano plots of differentially expressed genes.
· GO and KEGG enrichment result tables and plots.
· PCA plots of transcriptomic profiles.
· CIBERSORT immune infiltration result tables and immune cell comparison plots.
· Missing values are indicated as "NA".
· Gene expression values are provided as FPKM in the original transcriptomic files and converted to TPM during preprocessing.
· Metabolite concentrations are expressed in µM or µmol/L, depending on the corresponding data table.
· All code files are written in R and are provided to ensure reproducibility of statistical analyses, machine learning models, and figure generation.
· File paths in scripts should be modified according to the local working directory before running the code.
Other publicly accessible locations of the data: Not Applicable
Data was derived from the following sources: Not Applicable
| Project / Module | Main Input Files | Main Outputs |
|---|---|---|
| Random Forest | Only fatty acids.csv; RAO Only fatty acids.csv | MeanDecreaseAccuracy and MeanDecreaseGini feature-importance plots |
| SVM | Only fatty acids.csv; RAO Only fatty acids.csv | Confusion matrices and SVM SHAP importance plots |
| XGBoost | Only fatty acids.csv; RAO Only fatty acids.csv | XGBoost feature rankings and SHAP importance plots |
| LightGBM | Only fatty acids.csv; RAO Only fatty acids.csv | Test-set accuracy and LightGBM SHAP importance plots |
| ROC Analysis | Only fatty acids.csv; RAO Only fatty acids.csv | ROC curves and AUC values |
| Immune Feature Scoring | Immune inflammation gene set.xlsx; gene.expression files | Immune characteristic score result tables |
| Immune-Metabolic Correlation | Combination of metabolome and immune characteristics files | Correlation heatmaps between fatty acids and immune features |
| Transcriptomic Differential Analysis | gene.expression.csv; gene.expression dyslipidemia.csv | TPM matrices, DEG tables, volcano plots |
| Functional Enrichment | Differentially expressed gene lists | GO and KEGG enrichment tables and plots |
| PCA | Normalized transcriptomic expression matrices | PCA plots |
| CIBERSORT | Transcriptomic expression matrices; LM22 signature matrix | Immune cell proportion tables and comparison plots |
Human subjects data
De-identification statement
The human-derived omics data included in this dataset were de-identified before sharing. All direct personal identifiers, including names, hospital identification numbers, contact information, and other individually identifying information, were removed. Samples were assigned anonymized study IDs, and no key linking these IDs to individual participants is included in the shared files. Clinical and omics data are provided only in a de-identified format to minimize the risk of participant identification.
