Multidimensional scaling informed by F-statistic: Visualizing microbiome for inference
Data files
May 07, 2025 version files 4.37 MB
-
Data.zip
918.23 KB
-
Figure_codes.zip
35.86 KB
-
README.md
7.99 KB
-
source.zip
3.41 MB
Oct 13, 2025 version files 11.91 MB
-
Data.zip
8.40 MB
-
Figure_codes.zip
92.19 KB
-
README.md
9.61 KB
-
source.zip
3.41 MB
Abstract
Multidimensional scaling (MDS) is a widely used dimensionality reduction technique in microbial ecology data analysis that captures the multivariate structure of the data while preserving pairwise distances between samples. While improvements in MDS have enhanced the ability to reveal group-specific data patterns, these MDS-based methods require prior assumptions for inference, limiting their application in general microbiome analysis. In this study, we introduce a new MDS-based ordination method, "F-informed MDS," which configures the data distribution based on the F-statistic, the ratio of dispersion between groups sharing common and different characteristics. Using semisynthetic datasets, we demonstrate that the proposed method is robust to hyperparameter selection while maintaining statistical significance throughout the ordination process. Various quality metrics for evaluating dimensionality reduction confirm that F-informed MDS is comparable to state-of-the-art methods in preserving both local and global data structures. Its application to a diatom-associated bacterial community suggests the role of this new method in interpreting the community’s response to the host. Our approach offers a well-founded refinement of MDS that aligns with statistical test results, which can be beneficial for broader multidimensional data analyses in microbiology and ecology. This new visualization tool can be incorporated into standard microbiome data analyses.
- Software: https://bioconductor.org/packages/FinfoMDS
- File or folder names are italicized. Package or variable names are
monospaced.
File: Data.zip
Description: Raw data used in this study. Includes 4 folders and 4 files (see below).
- Folder Simulated
- Contains pairwise distances and ordination results. Includes 6 subfolders and 20 files. See below.
- Folder F-MDS contains traning log by epoch (folder TrainingLog) and resulting representations
Z(folder Results).- File names inside the folder are formatted as "sim_rev_{x}-N{n}-{method}-{param}-{type}.csv". Formatting rule is described in table below.
- "-Z.csv" file is tabulated by each sample and its location in 2D coordinate in each row and column, respectively.
- "-log.csv" file is tabulated at each row by training epoch. Column
objrefers to the value of F-MDS objective. It is the sum ofobj_mdsthe stress term (metric MDS) andobj-confrthe confirmatory term weighed by hyperparameterlambda. Columnsp-zandp-0refer to P values computed from data distributions in 2D and original structure, respectively.
- Folders Isomap, superMDS, t-SNE, UMAP-S, and UMAP-U contain the trained
Zfrom each ordination method. - 5 files ending with "-Y.csv" are the response vectors for each training dataset.
- Similar name rules are applied as "-Z.csv".
- Note that response vectors are the same for all replicates and their names do not have the replicate number.
- 15 files ending with "-data.Rds" are the training datasets.
- Similar name rules are applied as "-Z.csv".
- Folder Alga
- Ordination results using microbiome data sampled from laboratory algal mesocosm [1].
- Microbiome dataset can be found in GitHub package
FinfoMDS(https://github.com/soob-kim/FinfoMDS). - Contains 8 folders of different ordination methods.
- File names are formatted similarly to those described for the folder Simulated.
- Folder HumanGut
- Contains microbiome data sampled from human gut [2-4] and analyses results.
- The data is retrieved from a publicly available repository [5], previously merged at the genus level [6].
- Requires R package
phyloseq. - File names start with either "cirrhosis-" or "t2d-". It represents the microbiome dataset sampled from patients diagnosed with liver cirrhosis or type 2 diabetes, respectively.
- "-mds.csv " files are the ordination results using MDS.
- "-phyloseq.rds" files are the phyloseq object of each community
- "-Y.csv" files are the data labels ("1" being patients and "2" the control).
- Folder Ternary
- Contains 8 files of simulated, ternary data (Fig 7 of main text).
- File names are formatted similarly to those described for the folder Simulated.
- File simulated.R
- Generates binary simulation datasets using SparseDOSSA2 [7] and performs F-MDS.
- Requires R package
vegan,SparseDOSSA2,parallel, andFinfoMDS.
- File alga.R
- Performs F-MDS usign algal microbiome dataset.
- Requires R package
parallel, andFinfoMDS.
- File ternary.R
- Generates ternary simulation dataset using SparseDOSSA2 [7] and performs F-MDS.
- Requires R package
vegan,SparseDOSSA2, andFinfoMDS.
- File ordinations.R
- Produces 2D representation of a dataset using ordination methods.
- Requires R packages
vegan,tsneanduwot.
Table: Formatting rules for data files named "sim_rev_{x}-N{n}-{method}-{param}-{type}.csv".
| Format | Description |
|---|---|
| {x} | Replicate number |
| {n} | Data size |
| {method} | Ordination method |
| {param} | Hyperparameter |
| {type} | Data type, e.g., Z, Y, log or data |
File: Figure_codes.zip
Description: Codes written in R language for producing main figures and supplementary materials.
| Filename | Required package(s) | Description |
|---|---|---|
| fig_util.R | Defines functions that are frequently used | |
| fig2_rev.R | ggh4x ggplot2 |
Main Figure 2 |
| fig3_rev.R | dplyr ggplot2 vegan |
Main Figure 3 |
| fig4_rev.R | dplyr dreval ggplot2 vegan |
Main Figure 4 |
| fig5_rev.R | dreval ggplot2 |
Main Figure 5 |
| fig6_rev.R | ggplot2 |
Main Figure 6 |
| fig7_rev.R | ggplot2 |
Main Figure 7 |
| S1 Fig_rev.R | dplyr ggnewscale tidyr vegan |
Supplementary Figure S1 |
| S2 Fig_rev.R | ggplot2 lemon tidyr |
Supplementary Figure S2 |
| S3 Fig_rev.R | dplyr ggplot2 vegan |
Supplementary Figure S3 |
| S4 Fig_rev.R | cowplot ggforce ggplot2 vegan |
Supplementary Figure S4 |
| S5 Fig_rev.R | dplyr dreval ggplot2 vegan |
Supplementary Figure S5 |
| S6 Fig_rev.R | Supplementary Figure S6 | |
| S7 Fig_rev.R | scales |
Supplementary Figure S7 |
| S8 Fig_rev.R | cowplot dplyr ggplot2 scales |
Supplementary Figure S8 |
| S9 Fig_rev.R | dplyr ggnewscale ggplot2 tidyr vegan |
Supplementary Figure S9 |
| S4 Appendix.R | ape FinfoMDS phyloseq |
Figures for Appendix S4 |
File: source.zip
Description: Cloned repositories or packages from other work. Includes one subfolder.
- Package
PopPhy(Reiman et al., 2020) provides a neural network model for classifying algal microbiome. The package was modified in this study by combining with SimCLR framework (Chen et al., 2020), and contains three subfolders.- Folder src includes source codes to construct the convolutional neural network as outlined in Reiman et al., 2020. Otherwise specified below, please refer to the authors' documentation.
- File "config.py" specifies parameter settings for training and evaluating the model.
- File "train.ipynb" outlines overall process to load / train the data and evaluate the model. Additionally implemented is the contrastive learning as a pretraining stage using library
TensorFlow. - Subfolder models contains three supporting Python class files to build the models. Additionally included in this work is file "ContrastivePopPhy.py" that performs contrastive learning via SimCLR.
- Subfolder utils contains seven supporting files that are used to perform the above. Additionally included is "image_augment.py" with functions that adjust data values for self-supervised pretraining.
- Folder trees includes a reference file of a phylogenetic tree generated by
phyloT(see Supplementary Note S3). - The algal microbiome dataset can be alternatively found in GitHub package
FinfoMDS(https://github.com/soob-kim/FinfoMDS).
- Folder src includes source codes to construct the convolutional neural network as outlined in Reiman et al., 2020. Otherwise specified below, please refer to the authors' documentation.
References
[1] H. Kim, J. A. Kimbrel, C. A. Vaiana, J. R. Wollard, X. Mayali, and C. R. Buie. Bacterial response to spatial gradients of algal-derived nutrients in a porous microplate. The ISME Journal, 16(4):1036–1045, 2022.
[2] J. Qin, Y. Li, Z. Cai, S. Li, J. Zhu, F. Zhang et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature, 490(7418):55–60, 2012.
[3] N. Qin, F. Yang, A. Li, E. Prifti, Y. Chen, L. Shao et al. Alterations of the human gut microbiome in liver cirrhosis. Nature, 513(7516):59–64, 2014.
[4] F. H. Karlsson, V. Tremaroli, I. Nookaew, G. Bergstrom, C. J. Behre, B. Fagerberg, J. Nielsen, and F. Backhed. Gut metagenome in european women with normal, impaired and diabetic glucose control. Nature, 498(7452):99–103, 2013.
[5] E. Pasolli, D. T. Truong, F. Malik, L. Waldron, and N. Segata. Machine learning meta-analysis of large metagenomic datasets: Tools and biological insights. PLoS Computational Biology, 12(7):e1004977, 2016.
[6] D. Reiman, A. A. Metwally, J. Sun, and Y. Dai. PopPhy-CNN: A phylogenetic tree embedded architecture for convolutional neural networks to predict host phenotype from metagenomic data. IEEE Journal of Biomedical and Health Informatics, 24(10):2993–3001, 2020.
[7] S. Ma, B. Ren, H. Mallick, Y. S. Moon, E. Schwager, S. Maharjan, T. L. Tickle, Y. Lu, R. N. Carmody, E. A. Franzosa, L. Janson, and C. Huttenhower. A statistical model for describing and simulating microbial community profiles. PLoS Computational Biology, 17(9):e1008913, 2021.
Changes after May 7, 2025:
File: Data.zip
Folder Alga and files alga.R, simulated.R, ternary.R have newly been added. Folders Simulated and Ternary have been revised.
Newly added files/folders
- Folder Alga
- Ordination results from algal microbiome dataset (Kim et al., 2022).
- Microbiome dataset can be obtained elsewhere, e.g., https://github.com/soob-kim/FinfoMDS
- File alga.R
- File simulated.R
- File ternary.R
Revised folders
- Folder Simulated
- Previous 6 files have been replaced with new 20 files.
- The replacement represents new simulation datasets with revised conditions, i.e., data size, dimension.
- Previous folder
MDShas been removed as it is not used in revised manuscript version. - All other folders (
F-MDS,Isomap,superMDS,t-SNE,UMAP-S,UMAP-U) contains newly replaced files after performing the ordinations with the new simulation datasets.
- Folder Ternary
- Previous 5 files have been replaced with new 8 files.
- The replacement represents new simulation datasets and F-MDs results with revised conditions, i.e., data size, dimension.
File: Figure_codes.zip
13 files have been revised. 2 files have been newly added. 1 file has been removed. 1 file has its name changed.
| Filename | Filename (old) | Change details |
|---|---|---|
| fig2_rev.R | Fig2.R | revised |
| fig3_rev.R | Fig3.R | revised |
| fig4_rev.R | Fig4.R | revised |
| fig5_rev.R | Fig5.R | revised |
| fig6_rev.R | Fig6.R | revised |
| fig7_rev.R | Fig7.R | revised |
| S1 Fig_rev.R | S1 Fig.R | revised |
| S2 Fig_rev.R | S2 Fig.R | revised |
| S3 Fig_rev.R | S3 Fig.R | revised |
| S4 Fig_rev.R | S4 Fig.R | revised |
| S5 Fig_rev.R | S5 Fig.R | revised |
| S6 Fig_rev.R | S6 Fig.R | revised |
| S7 Fig_rev.R | S7 Fig.R | revised |
| S8 Fig_rev.R | added | |
| S9 Fig_rev.R | added | |
| S4 Appendix.R | S3 Appendix.R | renamed |
| S7 Fig.xlsx | removed |
