Data from: Impact of model violations on the inference of species boundaries under the multispecies coalescent

Barley, Anthony J.1; Brown, Jeremy M.2; Thomson, Robert C.1

Published Sep 01, 2017 on Dryad. https://doi.org/10.5061/dryad.h6s2k

Data files

Sep 01, 2017 version files 105.28 KB

Abstract

The use of genetic data for identifying species-level lineages across the tree of life has received increasing attention in the field of systematics over the past decade. The multispecies coalescent model provides a framework for understanding the process of lineage divergence, and has become widely adopted for delimiting species. However, because these studies lack an explicit assessment of model fit, in many cases, the accuracy of the inferred species boundaries are unknown. This is concerning given the large amount of empirical data and theory that highlight the complexity of the speciation process. Here, we seek to fill this gap by using simulation to characterize the sensitivity of inference under the multispecies coalescent to several violations of model assumptions thought to be common in empirical data. We also assess the fit of the multispecies coalescent model to empirical data in the context of species delimitation. Our results show substantial variation in model fit across datasets. Posterior predictive tests find the poorest model performance in datasets that were hypothesized to be impacted by model violations. We also show that while the inferences assuming the multispecies coalescent are robust to minor model violations, such inferences can be biased under some biologically plausible scenarios. Taken together, these results suggest that researchers can identify individual datasets in which species delimitation under the multispecies coalescent is likely to be problematic, thereby highlighting the cases where additional lines of evidence to identify species boundaries are particularly important to collect. Our study supports a growing body of work highlighting the importance of model checking in phylogenetics, and the usefulness of tailoring tests of model fit to assess the reliability of particular inferences.

Simulation in CoMuS for BPP

This python script can be used to simulate datasets under the multispecies coalescent using CoMuS and to setup input files for analyzing these datasets in BPP. Use instructions are contained in the header of the file.

Comus_BPP.py

Simulation in CoMuS for STACEY

This python script can be used to simulate datasets under the multispecies coalescent using CoMuS and to setup input files for analyzing these datasets in STACEY. Use instructions are contained in the header of the file.

Comus_stacey.py

Simulation in ms for BPP

This python script can be used to simulate coalescent genealogies using ms and sequence datasets using ms, and then to setup input files for analyzing these datasets in BPP. Use instructions are contained in the header of the file.

msSimulations_BPP.py

Simulation in ms for STACEY

This python script can be used to simulate coalescent genealogies using ms and sequence datasets using ms, and then to setup input files for analyzing these datasets in STACEY. Use instructions are contained in the header of the file.

msSimulations_stacey.py

Posterior prediction in BPP

This python script can be used to perform posterior prediction by sampling the posterior of a BPP analysis, simulating new datasets using McCoal, and setting up input files for analyzing the simulated datasets in BPP. Use instructions are contained in the header of the file.

posteriorprediction_BPP.py

Posterior predictive test statistics

This python script can be used to calculate several posterior predictive test statistics using the output from a BPP empirical analysis and the associated posterior predictive BPP analyses. Use instructions are contained in the header of the file.

PPDTestStatistics.py

IbdSim Settings

This text file contains the simulation parameters used for the data simulations done in IBDSim.

IbdSettings_file.txt

Supplementary Table 1

This table shows summaries of the dataset characteristics for eight empirical datasets analyzed using posterior prediction.

TableS1.pdf

Supplementary Table 2

This table shows summaries of the numbers of species, divergence time, and population size estimates for the empirical data (posterior) and across all posterior predictive datasets (PPD) for the seven datasets analyzed in BPP.

TableS2.pdf