Skip to main content

Coalescent-based species delimitation is sensitive to geographic sampling and isolation by distance

Cite this dataset

Mason, Nicholas et al. (2020). Coalescent-based species delimitation is sensitive to geographic sampling and isolation by distance [Dataset]. Dryad.


Species are a fundamental unit of biodiversity that are delimited via genetic data and coalescent-based methods with increasing frequency. Despite the widespread use of coalescent-based species delimitation, we do not fully understand the sensitivity of these methods to potential sources of bias and violations of their underlying assumptions. One implicit assumption of coalescent-based species delimitation is that geographic sampling is adequate and representative of genetic variation among populations within the lineage of interest. Yet exhaustive geographic sampling is logistically difficult, if not impossible, for many taxa that span large geographic expanses or occupy remote regions. Here, we examine the impact of geographic sampling on the output of Bayes-factor delimitation with SNAPP, a popular coalescent-based species delimitation pipeline. First, we demonstrate the problematic nature of sparse geographic sampling and isolation by distance for species delimitation using simulated data sets of populations connected by different levels of gene flow. We then examine whether similar trends are present in an empirical dataset of Andesiops mayflies (Ephemeroptera: Baetidae) from a high elevation transect in the Ecuadorian Andes. In both the simulated and empirical analyses, we systematically exclude geographically intermediate sites to quantify the impact of geographic sampling and isolation by distance on coalescent-based species delimitation. We find that removing intermediate sites with genetically admixed individuals incorrectly favors multi-species delimitation scenarios. Oversplitting is especially pronounced when isolation by distance is strong, but exists even when gene flow among neighboring populations is relatively high. These findings highlight the importance of adequate geographic sampling in species delimitation and urge caution in interpreting the output of such methods when species’ distributions are sparsely sampled and in systems characterized by strong patterns of isolation by distance.


This data set is a combination of empirical and simulated data. As described in the methods section, the empirical data is from a genus of Mayflies, Andesiops, that are distributed along a high-elevation drainage in the Andes of Ecuador. The data come from a ddRAD experiment, while loci were assembled with Stacks. The simulated data set were generated with the program msLandscape.


National Science Foundation, Award: DEB-1046408

National Science Foundation, Award: DEB-1045960

National Science Foundation, Award: DBI-1710739