Skip to main content

# Data for: PickMe: sample selection for species tree reconstruction using coalescent weighted quartets

## Citation

Rusinko, Joseph et al. (2021), Data for: PickMe: sample selection for species tree reconstruction using coalescent weighted quartets, Dryad, Dataset, https://doi.org/10.5061/dryad.3r2280ggv

## Abstract

After collecting large data sets of many genes for many species for phylogenomics studies, researchers may make ad hoc decisions about which genes or samples to include in a species tree reconstruction analysis based on various parameters, including the amount of missing data. Optimally, sampling would be maximized, but it can be difficult for empiricists to determine where to draw the line for sample inclusion when data sets are incomplete. Under the multispecies coalescent model, in which the dominant quartet topology displayed across gene trees matches the topology of that quartet on the species tree, we propose a Bayesian framework to select samples for which there is support for inclusion in a species tree analysis. Given a collection of gene trees, a posterior probability is assigned to each quartet topology, describing the likelihood that the species tree displays this topology. From this, individual samples are assigned reliability scores computed as the average of a rescaling of the posterior probabilities. These weights are used in a Bayesian framework in an algorithm called PickM}, which determines which individuals should be included in a species tree analysis. To illustrate the efficacy of this tool, PickMe is applied to gene trees generated from target capture data from milkweeds. PickMe indicates that more samples could have reliably been included in a previous milkweed phylogenomic analysis than the authors analyzed, without access to a formal decision-making procedure. Thus, PickMe will be a valuable addition to data analysis pipelines for phylogenomics studies.

## Methods

We obtained targeted sequence data for 763 putatively single-copy nuclear loci for samples of 62 North American and two African outgroup species, Asclepias physocarpa and A. fornicata, using the target enrichment baits of Weitemier et al. (2014). Data for 32 of these samples and orthologs from the genome sequence of Asclepias syriaca W(eitemier et al., 2019) were included in the analyses of Boutte et. al. (2019), and nuclear sequence data for the additional 30 samples were generated using the DNA sequencing and assembly methods described therein. Boutte et. al. (2019( had excluded the 30 newly analyzed samples based on an ad hoc minimum gene recovery criterion of 600 genes (79\%) with the goal of high gene occupancy for species tree analyses. For the analyses conducted here, we masked assembled sequences with Ns for very low read depth ($\le 2$ reads) and at heterozygous sites (i.e., intra-individual SNPs). For each gene, we aligned masked sequences using Mafft v. 7.245 with default parameters s (Katoh and Standley, 2013), and then removed sequences with less than 50\% of the total alignment length following Sayyari 285 et al. (2017).

We selected a subset of 703 genes, which had been identified by Boutte et. al. (2019) as producing the best resolved milkweed phylogenies based on bootstrap support across the gene trees, for further analysis. For the complete data set of 62 species, we first estimated the 703 gene trees using Neighbor Joining on uncorrected distances (the proportion of observed differences in the aligned sequences) as implemented in the ape package e (Paradis and Schliep, 2018) in R v. 3.5.1 (R Core Team, 2013). Using these estimated gene trees, we then identified the samples to be included in species tree analyses using \emph{PickMe}. To determine whether the gene tree inference method affected the sample selection results, we also used the GTR+Gamma model in RAxML v. 8.2.12; (Stamatakis, 2014) to estimate the initial gene trees. For the set of samples identified as reliable by PickMe, we realigned the sequences and then removed small alignments ($< 100$ bp) following Boutte et. al. (2019). We then used IQ-Tree v. 1.5.4 (Nguyen et al., 2014; Chernomor et al., 2016) t to select the best model of molecular evolution for the retained alignments and inferred the gene tree for each locus using the same parameters as \cite{BOUTTE2019106534}. Using ASTRAL-II v. 4.10.12 (Mirarab and Warnow, 2015)with default parameters, we inferred a species tree and calculated local posterior probability support (Sayyari and Mirarab, 302 2016).. We calculated gene concordance factors using the method of Minh et al. (2020), implemented in IQ-Tree v. 2.1.2 (Nguyen et al., 2014; Chernomor et al., 2016).

## Usage Notes

Uploaded Readme contains description of uploaded datafiles.

## Funding

National Science Foundation, Award: DMS 1616186

National Science Foundation, Award: DEB 1457510

National Science Foundation, Award: DEB 1457473