Data from: APPLES: Scalable distance-based phylogenetic placement with or without alignments

Balaban, Metin 1 ; Sarmashghi, Shahab1; Mirarab, Siavash 1

Published Oct 07, 2019 on Dryad. https://doi.org/10.5061/dryad.78nf7dq

Data files

Oct 07, 2019 version files 12.27 GB

estimated-alignment-data.tar.bz2

434.54 MB
full-RNAsim-simulation-files.tar.bz2

234.05 MB
gtr.tar.bz2

19.53 MB
heterogeneous-dataset.tar.gz

50.50 MB
lice.tar.gz

9.56 GB
Online_Appendix.pdf

1.61 MB
query-scalability.tar.bz2

846.38 MB
variable-size.tar.bz2

1.11 GB
varied-diameter.tar.bz2

11.87 MB

Abstract

Placing a new species on an existing phylogeny has increasing relevance to several applications. Placement can be used to update phylogenies in a scalable fashion and can help identify unknown query samples using (meta-)barcoding, skimming, or metagenomic data. Maximum likelihood (ML) methods of phylogenetic placement exist, but these methods are not scalable to reference trees with many thousands of leaves, limiting their ability to enjoy benefits of dense taxon sampling in modern reference libraries. They also rely on assembled sequences for the reference set and aligned sequences for the query. Thus, ML methods cannot analyze datasets where the reference consists of unassembled reads, a scenario relevant to emerging applications of genome-skimming for sample identification. We introduce APPLES, a distance-based method for phylogenetic placement. Compared to ML, APPLES is an order of magnitude faster and more memory efficient, and unlike ML, it is able to place on large backbone trees (tested for up to 200,000 leaves). We show that using dense references improves accuracy substantially so that APPLES on dense trees is more accurate than ML on sparser trees, where it can run. Finally, APPLES can accurately identify samples without assembled reference or aligned queries using kmer-based distances, a scenario that ML cannot handle. APPLES is available publically at github.com/balabanmetin/apples.

Assembly-free phylogenetic placement analysis on Lice

This is a set of 61 genome-skims by Boyd et al. (2017), including 45 known lice species (some represented multiple times) and 7 undescribed species. We generate lower coverage skims of 0.1Gb or 0.5Gb by randomly subsampling the reads from the sequence read archives (SRA) provided by the original publication (NCBI BioProject PRJNA296666). We use BBTools (Bushnell, 2014) to filter subsampled reads for adapters and contaminants and remove duplicated reads. Due to their large size, we include genome sketches generated by Skmer in this dataset. Since this dataset is not assembled, the coverage of the genome-skims is unknown; Skmer estimates the coverage to be between 0.2X and 1X for 0.1Gb samples (and 5 times that coverage with 0.5Gb). This dataset also includes an ML concatenation tree previously published by Boyd et. al 2017, scripts used in the data preparation, and placement trees output by APPLES.

lice.tar.gz

Simulated gene alignments based on GTR model

This package includes a 101-taxon dataset, previously made available from Mirarab and Warnow 2015. Sequences were simulated under the General Time Reversible (GTR) plus the Γ model of site rate heterogeneity using INDELible (Fletcher and Yang, 2009) on gene trees that were simulated using SimPhy (Mallo et al., 2016) under the coalescent model evolving on species trees generated under the Yule model. We took all 20 replicates of this dataset with mutation rates between 5 × 10−8 and 2 × 10−7, and for each replicate, randomly selected five estimated gene trees among those with 20% RF distance between estimated and true gene tree. Thus, we have a total of 100 backbone trees. The package includes estimated trees, the leave-one-out phylogenetic placement experiment files that made into the paper, and the scripts used in generating the data and running the experiments.

gtr.tar.bz2

Full RNAsim simulation data

Guo et al. 2009 designed a complex model of RNA evolution that does not make usual i.i.d assumptions of sequence evolution. Instead, it uses models of energy of the secondary structure to simulate RNA evolution by a mutation-selection population genetics model. This model is based on an inhomogeneous stochastic process without a global substitution matrix. This is an RNASim dataset of one million 227 sequences (with E.coli SSU rRNA used as the root), which consists of a multiple sequence alignment and true phylogeny.

full-RNAsim-simulation-files.tar.bz2

RNASim Heterogeneous dataset

We first randomly subsampled the full dataset to create 10 datasets of size 10,000. Then, we chose the largest clade of size at most 250 from replicate; this gives us 10 backbone trees of mean size 249.

RNASim-AE: Estimated alignment dataset

Alignment Error (RNASim-AE) dataset. Mirarab et al. (2015) used PASTA to estimate alignments on subsets of the RNASim dataset with up to 200,000 sequences. This dataset contains their reported alignment with 200,000 or 10,000 sequences (taking only replicate 1 in this case) and experiment data&scripts on this dataset.

estimated-alignment-data.tar.bz2

RNASim-QS: Query scalability Dataset

We first randomly subsampled the full RNASim dataset to create a dataset of size 500. Then for k =1 to 49,152 queries (choosing all k = 3 × 2i, 0 <= i <= 14) we created 5 replicates of k query sequences, again randomly subsampling from the full alignment with one million sequences. This dataset includes backbone trees, backbone alignments, query alignments, placement trees, time measurements, and the scripts used in this experiment.

query-scalability.tar.bz2

RNASim-VS: Varied Size Dataset

We randomly subsampled the full RNASim dataset to create 5 replicates of datasets of size (n): 500, 1000, 5000, 10000, 50000, and 100000, and 1 replicate (due to size) of size 200000. For replicates that contain at least 5000 species, we removed sites that contain gaps in 95% or more of the sequences in the alignment. This dataset includes backbone alignment, backbone trees, query sequences, and scripts used in performing the experiment.

variable-size.tar.bz2

Varied diameter RNASim dataset

To evaluate the impact of the evolutionary diameter (i.e., the highest distance between any two leaves in the backbone), we also created datasets withow, medium, and high diameters. We sampled the largest five clades of size at most 250 from each of the 10 replicates used for the heterogeneous dataset. Among these 50 clades, we picked the bottom, middle, and top five clades in diameter, which had diameter in [0.3, 0.4] (mean: 0.36), [0.5, 0.52] (mean: 0.51), and [0.65, 1.07] (mean: 0.82), respectively.

varied-diameter.tar.bz2