Data from: Improving the robustness of phylogenetic independent contrasts: Addressing abrupt evolutionary shifts with outlier- and distribution-guided correlation
Data files
Mar 17, 2026 version files 350.65 MB
-
RawData.zip
350.53 MB
-
README.md
7.89 KB
-
Supplementary_Tables.xlsx
120.02 KB
Abstract
This dataset provides a comprehensive resource for evaluating phylogenetic comparative methods under diverse evolutionary scenarios. The dataset includes: simulated phylogenetic trees (fixed-full-balanced and randomly generated), trait data for 16, 128, and 256 species, incorporating both gradual and abrupt evolutionary shifts, statistical outputs from multiple phylogenetic comparative methods, including PIC-OGC, PIC-MM, and other robust regression models, and benchmark results for detecting trait correlations under varying degrees of phylogenetic autocorrelation and noise. This dataset enables researchers to explore the impact of evolutionary shifts on trait correlation analysis, compare the performance of phylogenetic methods, and validate novel approaches for handling outliers and non-normal data distributions.
https://doi.org/10.5061/dryad.8w9ghx3xp
Description of the data and file structure
This dataset was generated to evaluate and enhance phylogenetic comparative methods, particularly in handling abrupt evolutionary shifts in trait data. The study aimed to develop and test the Outlier- and Distribution-Guided Correlation (OGC) method, comparing its performance with other phylogenetic methods such as PIC-MM and PGLS under diverse evolutionary scenarios. Simulated phylogenetic trees and trait data were created to reflect varying levels of noise, phylogenetic autocorrelation, and evolutionary shifts. This dataset supports robust statistical analyses and benchmarks for detecting true trait correlations and minimizing false positives in comparative studies.
Files and variables
File: Supplemental_Tables.xlsx
Description: This file includes Supplementary Tables S1–S28, which contain the data used to plot the figures presented in the main text and Supplementary materials.
colnames description:
For Supplementary_Tables.xlsx,
Table S1: ID - the index of the tips, X1 - trait X[1] value, X2 - trait X[2] value, label - labels corresponding to the two subtrees generated by the ancestral branch or labels corresponding to the PIC on shift branch or unshift branch.
Table S2: label - the simulation scenario, Method - phylogenetic comparative methods, -3-7 - Shift magnitude or error term variance
Table S3: Method - phylogenetic comparative methods, 1-7 - the depth of shift location
Table S4: PGLS_TR - PGLS accuracy on the second benchmark(Data-Pattern–Based) PGLS_NR - PGLS accuracy on the first benchmark(Structural-Coupling–Based)
Table S5-S7: Coefficient - the value of slope (or correlation coefficient), p - the p-value of the Coefficient
Table S8 : dataset - the dataset used to analysis, sample_size - the sample size of the dataset
Table S10-S19 : the columns are identical in Table S2
Table S20: the columns are identical in Table S3
Table S21-S28: the columns are identical in Table S10-S19
measurement units description: the simulated data do not have measurement units, measurement units were occurred in empirical data, Table S5,S6 and S7.
Table S5: PIC_log_GS - 'GB' ; PIC_log_PS - 'cm'
Table S6: PIC_SO and PIC_FG both are the rate
Table S7: PIC_BodyMass - 'kg' ; PIC_PopDensity - 'ind./km²'
File: RawData.zip
Description: The zip file contains the raw data for the simulated phylogenetic analyses, including trait correlation data for 16-species fixed tree, 128-species fixed tree, 128-species random trees and 256-species fixed tree across shift and non-shift scenarios and the results of shifts in different locations.
For RawData.zip - shiftlevel/ - it contains two sub-directories, Dualtrait_shifts and Singletrait_shifts, corresponding to the results of Sensitivity of Methods to the Phylogenetic Location of Evolutionary Shifts two shift scenarios in main text, the X.csv means the results of the shift in depth X.
For RawData.zip - all .csv files,
the columns XXX_est - means regression coefficients of XXX method.
XXX_p - means p value of the regression coefficients of XXX method.
for example,BM_xy_p - means the p value of PGLS(BM) model with regression formula "X1X2",and BM_yx_p - means the p value of PGLS(BM) model with regression formula "X2X1", PIC_L1_xy_p - means the p value of PIC-L1 with regression formula "X1~X2".
X1_lambda: means X1 phylogenetic signal lambda, X2_lambda: means X2 phylogenetic signal lambda, X1_K: means X1 phylogenetic signal K, X2_K: means X2 phylogenetic signal K
change_pearson_est: means Pearson correlation coefficient of traits change among the branches.change_pearson_p : means the p value of the Pearson correlation coefficient of traits change among the branches.change_spearman_est: means Spearman correlation coefficient of traits change among the branches.change_spearman_p: means p value of Spearman correlation coefficient of traits change among the branches.change_final_est: means the correlation coefficient of traits change among the branches with the selection of Pearson or Spearman.change_final_p: means the p value of traits change among the branches with the selection of Pearson or Spearman.
Some variables pertain to regression models, such as AIC valuesOUrandom_xy_aic, likelihood valuesOUrandom_xy_loglik..., or some intermediate variables X1_lambda_p,... These variables were not used in the final analysis, so their definitions are not provided.
Code/software
The scripts provided in this dataset were used to generate the results presented in the main text. These scripts cover all major steps of data processing and analysis, including phylogenetic tree construction, trait data simulation, outlier detection, and regression model fitting. All scripts are bundled in Code.zip. Below is a detailed description of the scripts, the required software, and the workflow:
Software Requirements:
- R: Version 4.1.3.
- R Packages: The following R packages need to be installed and loaded:
ape(for trait simulation, PIC calculation and Corphylo model fitting).phytools(for phylogenetic tree construction and manipulation).phylolm(for PGLS regression).MCMCglmm(for Bayesian phylogenetic modeling).ROBRT(for robust phylogenetic regression).
Scripts Overview:
- Phylogenetic Tree Construction:
-
01Simulation_BalancedTree.R: Generates balanced tree topology. -
01Simulation_RandomTree.R: Creates random tree topologies.Traits Simulation:
-
02Simulation_BM1&BM1+BM2.R: Simulates traits of "BM1&BM1+BM2" scenario. -
02Simulation_BM&BM+Norm.R: Simulates traits of "BM&BM+Norm" scenario. -
02Simulation_Norm&Norm+BM.R: Simulates traits of "Norm&Norm+BM" scenario. -
02Simulation_ShiftsinBothTraits_fixedlocation.R: Simulates traits of "shifts in both traits" scenario in root branch. -
02Simulation_ShiftsinOneTrait_fixedlocation.R: Simulates traits of "shifts only in one trait" scenario in root branch. -
02Simulation_ShiftsinBothTraits_randomlocation.R: Simulates traits of "shifts in both traits" scenario in random branch. -
02Simulation_ShiftsinOneTrait_randomlocation.R: Simulates traits of "shifts only in one trait" scenario in random branch.
-
- Outlier Detection:
CalculateIQR.R: Detects outliers in trait data using the IQR method.
- Regression Model Fitting:
03Analysis_01_PIC.R: Implements Phylogenetic Independent Contrasts (PIC) for trait correlation analysis.03Analysis_02_PGLS_RPR.R: Performs Phylogenetic Generalized Least Squares (PGLS) regression and robust regression.03Analysis_03_Corphylo.R: Applies the Corphylo method for trait correlation analysis.03Analysis_04_MRPMM.R: Fits models using MRPMM.
Workflow:
- Tree Construction and Trait Simulation:
- Run
01Simulation_BalancedTree.Rand01Simulation_RandomTree.Rto generate phylogenetic trees and associated trait data. - Run
02Simulation_BM1&BM1+BM2.R,02Simulation_BM&BM+Norm.Rand02Simulation_Norm&Norm+BM.Rfor traits simulation without jump - Run
02Simulation_ShiftsinBothTraits.Rand02Simulation_ShiftsinOneTrait.Rfor traits simulation with jumps.
- Run
- Outlier Detection:
- Use
CalculateIQR.Rto identify outliers in trait data. Results are saved as CSV files with labeled outlier points.
- Use
- Regression Analysis:
- Execute
03Analysis_01_PIC.R,03Analysis_02_PGLS_RPR.R,03Analysis_03_Corphylo.R, and03Analysis_04_MRPMM.Rsequentially to fit regression models.
- Execute
Phylogenetic tree simulation: Two types of phylogenetic trees were simulated: balanced trees with fixed topologies and randomly generated trees using a coalescent model. The random trees introduced variability in branching rates to reflect diverse phylogenetic scenarios. Tree sizes included 16,128 and 256 species.
Trait data simulation: Trait data were generated under both Brownian motion (BM) and abrupt evolutionary shift scenarios. For abrupt shifts, two traits (X1 and X2) were simulated with independent evolution except for a significant shift at the root branch. Gradual evolution data were simulated under BM with varying levels of noise.
Statistical analysis: Multiple phylogenetic comparative methods were applied to the datasets, including:
- PIC-OGC: A hybrid framework integrating Pearson and Spearman correlations to handle outliers and non-normal data.
- Robust regression methods (PIC-MM, PIC-L1, etc.).
- PGLS models optimized across evolutionary scenarios (BM, λ, OU fixed/random, EB).
- PGLMM : Phylogenetic generalized linear mixed model.
- MR-PMM: Multi-response phylogenetic mixed model.
- Benchmarks for evaluating true and spurious correlations were constructed using simulation parameters.
Data processing: All simulations and analyses were conducted using R (version 4.1.3). Packages including phytools, ape, phylolm, ROBRTand MCMCglmm were employed for tree generation, PIC calculation, and statistical modeling. The dataset was pre-processed to include raw and processed outputs for reproducibility and ease of use.
