Placing a new species on an existing phylogeny has increasing relevance to several applications. Placement can be used to update phylogenies in a scalable fashion and can help identify unknown query samples using (meta-)barcoding, skimming, or metagenomic data. Maximum likelihood (ML) methods of phylogenetic placement exist, but these methods are not scalable to reference trees with many thousands of leaves, limiting their ability to enjoy benefits of dense taxon sampling in modern reference libraries. They also rely on assembled sequences for the reference set and aligned sequences for the query. Thus, ML methods cannot analyze datasets where the reference consists of unassembled reads, a scenario relevant to emerging applications of genome-skimming for sample identification. We introduce APPLES, a distance-based method for phylogenetic placement. Compared to ML, APPLES is an order of magnitude faster and more memory efficient, and unlike ML, it is able to place on large backbone trees (tested for up to 200,000 leaves). We show that using dense references improves accuracy substantially so that APPLES on dense trees is more accurate than ML on sparser trees, where it can run. Finally, APPLES can accurately identify samples without assembled reference or aligned queries using kmer-based distances, a scenario that ML cannot handle. APPLES is available publically at github.com/balabanmetin/apples.
Assembly-free phylogenetic placement analysis on Lice
This is a set of 61 genome-skims by Boyd et al. (2017), including 45 known lice species (some represented multiple times) and 7 undescribed species. We generate lower coverage skims of 0.1Gb or 0.5Gb by randomly subsampling the reads from the sequence read archives (SRA) provided by the original publication (NCBI BioProject PRJNA296666). We use BBTools (Bushnell, 2014) to filter subsampled reads for adapters and contaminants and remove duplicated reads. Due to their large size, we include genome sketches generated by Skmer in this dataset. Since this dataset is not assembled, the coverage of the genome-skims is unknown; Skmer estimates the coverage to be between 0.2X and 1X for 0.1Gb samples (and 5 times that coverage with 0.5Gb). This dataset also includes an ML concatenation tree previously published by Boyd et. al 2017, scripts used in the data preparation, and placement trees output by APPLES.
lice.tar.gz
Simulated gene alignments based on GTR model
This package includes a 101-taxon dataset, previously made available from Mirarab and Warnow 2015. Sequences were simulated under the General Time Reversible (GTR) plus the Γ model of site rate heterogeneity using INDELible (Fletcher and Yang, 2009) on gene trees that were simulated using SimPhy (Mallo et al., 2016) under the coalescent model evolving on species trees generated under the Yule model. We took all 20 replicates of this dataset with mutation rates between 5 × 10−8 and 2 × 10−7, and for each replicate, randomly selected five estimated gene trees among those with 20% RF distance between estimated and true gene tree. Thus, we have a total of 100 backbone trees. The package includes estimated trees, the leave-one-out phylogenetic placement experiment files that made into the paper, and the scripts used in generating the data and running the experiments.
gtr.tar.bz2
Full RNAsim simulation data
Guo et al. 2009 designed a complex model of RNA evolution that does not make usual i.i.d assumptions of sequence evolution. Instead, it uses models of energy of the secondary structure to simulate RNA evolution by a mutation-selection population genetics model. This model is based on an inhomogeneous stochastic process without a global substitution matrix. This is an RNASim dataset of one million 227 sequences (with E.coli SSU rRNA used as the root), which consists of a multiple sequence alignment and true phylogeny.
full-RNAsim-simulation-files.tar.bz2
RNASim-AE: Estimated alignment dataset
Alignment Error (RNASim-AE) dataset. Mirarab et al. (2015) used PASTA to estimate alignments on subsets of the RNASim dataset with up to 200,000 sequences. This dataset contains their reported alignment with 200,000 or 10,000 sequences (taking only replicate 1 in this case) and experiment data&scripts on this dataset.
estimated-alignment-data.tar.bz2
RNASim-QS: Query scalability Dataset
We first randomly subsampled the full RNASim dataset to create a dataset of size 500. Then for k =1 to 49,152 queries (choosing all k = 3 × 2i, 0 <= i <= 14) we created 5 replicates of k query sequences, again randomly subsampling from the full alignment with one million sequences. This dataset includes backbone trees, backbone alignments, query alignments, placement trees, time measurements, and the scripts used in this experiment.
query-scalability.tar.bz2
RNASim-VS: Varied Size Dataset
We randomly subsampled the full RNASim dataset to create 5 replicates of datasets of size (n): 500, 1000, 5000, 10000, 50000, and 100000, and 1 replicate (due to size) of size 200000. For replicates that contain at least 5000 species, we removed sites that contain gaps in 95% or more of the sequences in the alignment. This dataset includes backbone alignment, backbone trees, query sequences, and scripts used in performing the experiment.
variable-size.tar.bz2
Varied diameter RNASim dataset
To evaluate the impact of the evolutionary diameter (i.e., the highest distance between any two leaves in the backbone), we also created datasets withow, medium, and high diameters. We sampled the largest five clades of size at most 250 from each of the 10 replicates used for the heterogeneous dataset. Among these 50 clades, we picked the bottom, middle, and top five clades in diameter, which had diameter in [0.3, 0.4] (mean: 0.36), [0.5, 0.52] (mean: 0.51), and [0.65, 1.07] (mean: 0.82), respectively.
varied-diameter.tar.bz2