Data for: PickMe: Sample selection for species tree reconstruction using coalescent weighted quartets

Rusinko, Joseph 1 ; Cai, Yu1 ; Crysler, Allison1 ; Thompson, Katherine2 ; Boutte, Julien3 ; Fishbein, Mark4 ; Straub, Shannon1

Published Aug 06, 2021; Updated Jun 23, 2025 on Dryad. https://doi.org/10.5061/dryad.3r2280ggv

Abstract

After collecting large data sets for phylogenomics studies, researchers must decide which genes or samples to include when reconstructing a species tree. Incomplete or unreliable data sets make the empiricist's decision more difficult. Researchers rely on ad hoc strategies to maximize sampling while ensuring sufficient data for accurate inferences. An algorithm called PickMe formalizes the sample selection process, assuming that the samples evolved under the Tree Multispecies Coalescent model. We propose a Bayesian framework for selecting samples for species tree analysis. Given a collection of gene trees, we compute a posterior probability for each quartet, describing the likelihood that the species tree displays this topology. From this, we assign individual samples reliability scores computed as the average of a scaled version of the posterior probabilities. PickMe uses these weights to recommend which samples to include in a species tree analysis. Analysis of simulated data showed that including the samples suggested by \textit{Pickme} produced species trees closer to the true species trees than both unfiltered data sets and data sets with ad hoc gene occupancy cut-offs applied. To further illustrate the efficacy of this tool, we apply PickMe to gene trees generated from target capture data from milkweeds. PickMe indicates more samples could have reliably been included in a previous milkweed phylogenomic analysis than the authors analyzed without access to a formal methodology for sample selection. Using simulated and empirical data, we also compare \emph{PickMe} to existing sample selection methods. Inclusion of PickMe will enhance phylogenomics data analysis pipelines by providing a formal structure for sample selection.

https://doi.org/10.5061/dryad.3r2280ggv

Description of the data and file structure

Data was collected for the analysis of the evolutionary relationships among milkweeds. The remaining data was used to test the PickMe algorithm for sample selection in the context of phylogenomic analysis.

Supplemental Materials:

-SupplA1A7.pdf: Contains PDF of supplemental appendices A1-A7 as referenced in the published article.

Data Descriptions

- Milkweed-Sequence-Files.zip: Contains sequence data for the analysis. All sequences have been referenced on GenBank.

- estimated-gene-trees-NJ-Uncorrected and **estimated-gene-trees-RAxML ** estimated-gene-trees-NJ-Uncorrected: Contain all estimated Milkweed gene trees as described in the associated article. Sample names were cleaned up for the main manuscript. A log for matching is listed in a text file.

- OldSpeciesTree.cf.tree: The species tree referenced in the paper, based on the dataset used in Boutte et al. (2019).

- new_species_Tree.cf.tree: The new Milkweed phylogeny, including the additional samples selected by PickMe, as described in the article.

-Asclepias_sp_R1.fq.gz and Asclepias_sp_R2.fq.gz Illumina sequence data for the sample "Asclepias sp." that was determined to have been involved in a lab error.

Files and variables

Data Descriptions

- Milkweed-Sequence-Files.zip: Contains sequence data for the analysis. By the time of publication, all sequences will be referenced on GenBank.

- estimated-gene-trees-NJ-Uncorrected and estimated-gene-trees-RAxML: Contain all estimated Milkweed gene trees as described in the associated article.

- gates_newPickMeOut.csv: The output from running PickMe on the Estimated-gene-tree-Gates.tre, which contains estimated trees provided by Dan Gates for the method comparison.

- OldSpeciesTree.cf.tree: The species tree referenced in the paper, based on the dataset used in Boutte et al. (2019).

-PickMe_all_sp_ML_trees_species_tree.tre The species tree referenced in the paper which included all available samples with exclusion using PickMe.

PickMe_all_sp_ML_trees The species tree referenced in the paper which included all available samples without exclusion using PickMe.

- new_species_Tree.cf.tree: The new Milkweed phylogeny, including the additional samples selected by PickMe, as described in the article.

-Asclepias_sp_R1.fq.gz and Asclepias_sp_R2.fq.gz Illumina sequence data for the sample "Asclepias sp." that was determined to have been involved in a lab error.

-species_names_in_trees_differing_from_the_manuscript_text.txt Indexing file to match the species names published in the manuscripts with the specific sample names used in the analysis.

Code/software

Software Descriptions

- The sample selection software PickMe is available at PhyloPickMe.jl on GitHub and can also be installed through the Julia package PhyloPickMe.

- OccupancyDistribution.R: Contains the code used to sample occupancy rates in the Species Tree Analysis simulation.

- MissingSamplesForSim.R: The script used to drop samples from the DNA data, based on supplemental data from Sayyari et al. (2017), for the comparison between PickMe and occupancy-based sample selection.

- DropTipsFromPickMe.R: Takes a set of DNA files and the results of a PickMe run as input, writing out revised DNA sequences that exclude all samples identified by PickMe as "VeryBad" or "Bad" (for a single dataset).

- PickMeSimDropTipsPickMe.R: Used to generate DNA sequence data by removing samples classified as "Bad" or "VeryBad" in the simulation.

- PickMeSimDropTipsOccupancy.R: Used to generate DNA sequence data by removing samples with a gene occupancy rate below a certain threshold.

- OccAstral.sh: Script to run ASTRAL on the gene trees after samples were removed based on occupancy rates. A similar script was used to compute ASTRAL for the PickMe-dropped trees and for the estimated trees before any samples were dropped.

- PickMeRFAnalysis.R: Contains code to compare the results of PickMe to dropping samples based on occupancy rate. For ease of use, the script reads stored data from OccDropData.RData instead of recomputing distances.

- Kappa.jl: Runs PickMe on collections of estimated trees for the Kappa simulation.

- CreateMissingTaxaforKappa.R: Drops samples from gene trees according to a uniform gene occupancy distribution.

- CreateTaxaErrorsKappa.R: Generates "random" taxa that appear arbitrarily on gene trees for the Kappa simulation.

- KappaAnalysisforResubmission.R: Tests the efficacy of PickMe in identifying random taxa at a range of cutoff values.

-Slurm*arrayJob.txt **** and RAxML*****_occupancy script** are representative scripts used for estimating gene trees with RAxML on a cluster.

Access information

Data was derived from the following source: https://datadryad.org/stash/dataset/doi:10.6076/D14599

author={Sayyari, Erfan and Whitfield, James B and Mirarab, Siavash},

title  = {Fragmentary gene sequences negatively impact gene tree and species tree reconstruction [dataset]}, 

year         = {2017}, 

doi          = {DOI of the Dataset},

publisher    = {Dryad Digital Repository},

url          = {https://doi.org/10.6076/D14599},

Additional Iochroma data was provided by Dan Gates.

Change Log

14 October, 2024

- Added gatesnew_PickMeOut . No changes in the results. Just included output file for reference.

- Added Estimated-gene-tree-Gates.tre and PicKMeSimData.tar.zst Made data avaialble for reviewers
to check the simulation study, and empiritical comparison in the article. Both of these datasets contain
data which the authors of this paper are not the primary owners. This material will be removed or replaced with
references upon acceptance.

- added a variaety of R, bash and julia scripts which were used in the simulation and results for the paper. Primarily provided for transparency.

30 January. 2025

Removed Estimated-gene-tree-Gates.tre and PicKMeSimData.tar.zst Both of these datasets contained data which the authors of this paper are not the primary owners. They have been removed.

17 June, 2-025

Added: SupplementalMaterialsA1-A7.pdf. Contains supplemental appendices referenced in the published article.

We obtained targeted sequence data for 763 putatively single-copy nuclear loci for samples of 59 North American milkweed species, three African outgroup species, \textit{Asclepias physocarpa}, \textit{A. fruticosa}, and \textit{A. fornicata}, and one additional outgroup, \textit{Pergularia daemia} using the target enrichment baits of Weitemier et al. (2014) (Supplemental Material~\protect\ref{app:milkweed}). Data for 32 of these samples and orthologs from the genome sequence of \textit{Asclepias syriaca} \citep{weitemier2019draft} were included in the analyses of \cite{BOUTTE2019106534}, and nuclear sequence data for the additional 30 samples were generated using the DNA sequencing and assembly methods described therein. \cite{BOUTTE2019106534} had excluded the 30 newly analyzed samples based on an ad hoc minimum gene recovery criterion of 600 genes (79\%) with the goal of high gene occupancy for all samples for species tree analyses. For the analyses conducted here, we masked assembled sequences with Ns for very low read depth ($\le 2$ reads) and at heterozygous sites (i.e., intra-individual SNPs). For each gene, we aligned masked sequences using Mafft version 7.245 with default parameters \citep{katoh2013mafft}, and removed sequences with less than 50\% of the total alignment length \citep[i.e. Type II missing data;][]{hosner2016avoiding} to reduce gene tree error following \cite{sayyari2017fragmentary} and \cite{mirarab2019species}.

For further analysis, we selected a subset of 703 genes, which had been identified by \cite{BOUTTE2019106534} as producing the best-resolved milkweed phylogenies based on bootstrap support across the gene trees. For the complete data set of 63 species, we first estimated the 703 gene trees using Neighbor-Joining on uncorrected distances (the proportion of observed differences in the aligned sequences) as implemented in the ape package \citep{paradis2018ape} in R v. 3.5.1 \citep{R}. Using these estimated gene trees, we identified the samples to be included in species tree analyses using \emph{PickMe}. To determine whether the gene tree inference method affected the sample selection results, we also used the GTR+Gamma model in RAxML v. 8.2.12; \citep{stamatakis2014raxml} to estimate the initial gene trees. For the set of samples identified as reliable by \emph{PickMe}, we realigned the sequences and then removed small alignments ($< 100$ bp) following \cite{BOUTTE2019106534}. We then used IQ-Tree v. 1.5.4 \citep{nguyen2014iqtree,chernomor2016terrace} to select the best model of molecular evolution for the retained alignments and inferred the gene tree for each locus using the same parameters as \cite{BOUTTE2019106534}. Using ASTRAL-II v. 4.10.12 \citep{mirarab2015astral} with default parameters, we inferred a species tree and calculated local posterior probability support \citep{sayyari2016fast}. We calculated gene concordance factors using the method of \cite{Minh2020new}, implemented in IQ-Tree v. 2.1.2 \citep{nguyen2014iqtree,chernomor2016terrace}. For comparison, we repeated the gene and species tree analyses done for the subset of \textit{PickMe} reliable samples for the full data set using identical methods.

Data for: PickMe: Sample selection for species tree reconstruction using coalescent weighted quartets

Data files