Skip to main content
Dryad

Data for: PickMe: sample selection for species tree reconstruction using coalescent weighted quartets

Abstract

After collecting large data sets for phylogenomics studies, researchers must decide which genes or samples to include when reconstructing a species tree. Incomplete or unreliable data sets make the empiricist's decision more difficult. Researchers rely on ad hoc strategies to maximize sampling while ensuring sufficient data for accurate inferences. An algorithm called PickMe formalizes the sample selection process, assuming that the samples evolved under the Tree Multispecies Coalescent model. We propose a Bayesian framework for selecting samples for species tree analysis. Given a collection of gene trees, we compute a posterior probability for each quartet, describing the likelihood that the species tree displays this topology. From this, we assign individual samples reliability scores computed as the average of a scaled version of the posterior probabilities. PickMe uses these weights to recommend which samples to include in a species tree analysis. Analysis of simulated data showed that including the samples suggested by \textit{Pickme} produced species trees closer to the true species trees than both unfiltered data sets and data sets with ad hoc gene occupancy cut-offs applied.  To further illustrate the efficacy of this tool, we apply PickMe to gene trees generated from target capture data from milkweeds. PickMe indicates more samples could have reliably been included in a previous milkweed phylogenomic analysis than the authors analyzed without access to a formal methodology for sample selection. Using simulated and empirical data, we also compare \emph{PickMe} to existing sample selection methods. Inclusion of PickMe will enhance phylogenomics data analysis pipelines by providing a formal structure for sample selection.