DNA-based species delimitation may be compromised by limited sampling effort and species rarity, including “singleton” representatives of species, which hampers estimates of intra- versus interspecies evolutionary processes. In a case study of southern African chafers (beetles in the family Scarabaeidae), many species and subclades were poorly represented and 48.5% of species were singletons. Using cox1 sequences from >500 specimens and ∼100 species, the Generalized Mixed Yule Coalescent (GMYC) analysis as well as various other approaches for DNA-based species delimitation (Automatic Barcode Gap Discovery (ABGD), Poisson tree processes (PTP), Species Identifier, Statistical Parsimony), frequently produced poor results if analyzing a narrow target group only, but the performance improved when several subclades were combined. Hence, low sampling may be compensated for by “clade addition” of lineages outside of the focal group. Similar findings were obtained in reanalysis of published data sets of taxonomically poorly known species assemblages of insects from Madagascar. The low performance of undersampled trees is not due to high proportions of singletons per se, as shown in simulations (with 13%, 40% and 52% singletons). However, the GMYC method was highly sensitive to variable effective population size (NeNe), which was exacerbated by variable species abundances in the simulations. Hence, low sampling success and rarity of species affect the power of the GMYC method only if they reflect great differences in NeNe among species. Potential negative effects of skewed species abundances and prevalence of singletons are ultimately an issue about the variation in NeNe and the degree to which this is correlated with the census population size and sampling success. Clade addition beyond a limited study group can overcome poor sampling for the GMYC method in particular under variable NeNe. This effect was less pronounced for methods of species delimitation not based on coalescent models.
MS_Ahrensetal_SupplementFigure1
Supplementary Fig. 1. Map of collecting sites (numbers refer to Supplementary Table 1).
MS_Ahrensetal_SupplementFigure2
Supplementary Fig. 2. Ultrametric tree of the southern African Sericini species showing tip labels for each haplotype, branch support values (aLRT) as well as the principal clades analysed separately.
MS_Ahrensetal_SupplementFigure3
Supplementary Fig. 3. The fit of the GMYC model to cox1 data of the subclades (A, C, E, G, M, Q, R) and the complete Sericini data set (All). Top panels: LTT plot with GMYC single threshold time. Middle panels: likelihood surface and best solution of the GMYC model. Bottom panels: likelihood-time relationship.
MS_Ahrensetal_SupplementFigure4b
Supplementary Fig. 4. Match ratio of cumulative GMYC subclade analysis on empirical data (Sericini) in respect to the number of sampled species with alternative accumulation order (from bottom to top and inverse: set 1 and 2) of subclades and respective pLRT values.
MS_Ahrensetal_SupplementFigure5
Supplementary Fig. 5. Comparison of the performance of the subclade’s distance-based cluster analyses (A,C,E,G,M,Q,R) with that of the complete Sericini data set (All). X-axis: threshold divergence (%), Y-axis: number of species. Blue graph – estimated species number; pink graph – number of matching species with a priori species assignments.
MS_Ahrensetal_SupplementFigure6
Supplementary Fig. 6. Example of the simulated trees with increased species samples distributed along with log-normal distribution of mean 5.
MS_Ahrensetal_SupplementFigure7
Supplementary Fig. 7. Mean match ratio for the different number of sampled species under the random, clustered and clade-wise GMYC sampling simulations for simulation schemes with constant sample size and Ne (sd=0; cross), with variable sample size but constant Ne (simple line); with variable Ne but constant sample size (square), and variable Ne and sample size (triangle) assuming a median proportion of singleton species of 13% (red), 40% (green) and 52% (orange), (sd =1, 1.5, or 2, respectively).
MS_Ahrensetal_SupplementFigure8
Supplementary Fig. 8. Lumping and oversplitting behavior of the GMYC model in simulations in relation to the sampling bias: ratio of GMYC vs true species compared for the different sampling schemes for constant Ne and constant sample size (sd=0).
MS_Ahrensetal_SupplementFigure9
Supplementary Fig. 9. Relation of AIC confidence sets (GMYC) from simulations to the number of sampled species (above) and to the match ratio (GMYC entities vs. true species; below) in the framework of the different sampling schemes.
MS_Ahrensetal_SupplementTable1
Supplementary Table 1. Sampling site data as given for the localities, collection site numbers refer to plots of Supplementary Fig. 1.
MS_Ahrensetal_SupplementTable2
Supplementary Table 2. Genbank accession numbers for the data set including voucher number (*numbers without “DA” refer to the BMNH voucher codes used at the NHM London), shortcut, and locality information.
MS_Ahrensetal_SupplementTable3
Supplementary Table 3. Model outputs of GMYC modeling with the empirical data: Likelihoods for null hypothesis (L0; i.e., no shift in branching rate) and GMYC (LGMYC) models, their likelihood ratio (LR) and its significance (pLRT, evaluated using a chi-square test with 3 degrees of freedom to compare GMYC and null hypothesis models), and the threshold genetic distance (T).
Nexus_Files_datasets
Sericini data: full and subclade data sets
Examples_simulated.trees
Examples of simulated trees for random, clustered, and clade-wise sampling
Code
code for simulation of gene tree within species tree with sampling.
requirements:
R packages "ape", "apTreeshape"
SIMCOAL (SIMCOAL must be in your working directory)
codes:
simulation.R: an example code for simulations used in the manuscripts
gene.tree.simulations.R: functions used to run SIMCOAL simulation, sample species and simulate species trees
simcoal.functions.R: functions for running SIMCOAL from R
sample.lineage.R: functions for sampling species from trees
usage:
"gene.tree.simulations.R", "simcoal.functions.R" and "sample.lineage.R" include functions used to run simulations. "simulation.R" is a code to call these functions and run a simulation.
put the four R script files and SIMCOAL in a directory. Then set your working directory to it (setwd("...")).
run codes in simulation.R by source("./simulation.R") or copy and paste the codes.