Phylogenetic double placement of mixed samples
Cite this dataset
Balaban, Metin; Mirarab, Siavash (2023). Phylogenetic double placement of mixed samples [Dataset]. Dryad. https://doi.org/10.6076/D1QW25
Abstract
Motivation
Consider a simple computational problem. The inputs are (i) the set of mixed reads generated from a sample that combines two organisms and (ii) separate sets of reads for several reference genomes of known origins. The goal is to find the two organisms that constitute the mixed sample. When constituents are absent from the reference set, we seek to phylogenetically position them with respect to the underlying tree of the reference species. This simple yet fundamental problem (which we call phylogenetic double-placement) has enjoyed surprisingly little attention in the literature. As genome skimming (low-pass sequencing of genomes at low coverage, precluding assembly) becomes more prevalent, this problem finds wide-ranging applications in areas as varied as biodiversity research, food production and provenance, and evolutionary reconstruction.
Results
We introduce a model that relates distances between a mixed sample and reference species to the distances between constituents and reference species. Our model is based on Jaccard indices computed between each sample represented as k-mer sets. The model, built on several assumptions and approximations, allows us to formalize the phylogenetic double-placement problem as a non-convex optimization problem that decomposes mixture distances and performs phylogenetic placement simultaneously. Using a variety of techniques, we are able to solve this optimization problem numerically. We test the resulting method, called MIxed Sample Analysis tool (MISA), on a varied set of simulated and biological datasets. Despite all the assumptions used, the method performs remarkably well in practice.
README: Data from: Phylogenetic double placement of mixed samples
Citation
Balaban, M., & Mirarab, S. (2020). Phylogenetic double placement of mixed samples. Bioinformatics (Oxford, England), 36(1), i335–i343. doi:10.1093/bioinformatics/btaa489
Description of the data and file structure
In all the datasets, files called *results*.csv
have the following columns:
- 1st column:
query
gives the query name, - 2nd column: is one of
alien
is when both parents are removed from refpartial
is when one parent is removed from refpresent
is when neither parent is removed from ref
- 3rd column: the name of the method
- 4th column: Either Primary or Secondary, for the two placements; primary is always the one with lower error
- 5th column: Placement error in edges
- [optional] 6th column: the
k
value used
Columbicola (Lice) dataset (simulated mixture)
To evaluate the accuracy of our method on genome skimming data, we use a set of 61 genome skims by Boyd et al. (2017) (PRJNA296666), including 45 known Lice species (some represented multiple times) and seven undescribed species. We use randomly subsampled genome-skims of 4 Gb. We use BBTools (Bushnell, 2014) to filter subsampled reads for adapters and contaminants and remove duplicated reads. Then, we create five replicates each containing 20 organisms sampled from the full dataset at random. For each replicate, we simulate five mixtures with A and B chosen uniformly at random. We simulate mixtures by simply combining preprocessed genome skims of the two constituents. The exact coverage of the genome skims is unknown but is estimated to range between 4X and 15X by Skmer.
The following archives are provided:
- Each
SRR*.fastq.bz2
gives the preprocessed genome skim of one lice sample. These are the genome skims of lice used in this study, adapted from Boyd et al. In contrast to the original genomes, these files are preprocessed using BBTools. gold.tree
: The reference tree of the samples, used as the gold standard
lice.tar.gz
Once you untar the file, all the files are under a oasis/projects/nsf/uot138/balaban/mixture/
folder.
These files are related to actual leave-out experiments with different values of k
(e.g., 21, 27, ..., 31).
Recall that for each, we do 5 replicates of subsampling of backbone and for each, we have 5 replicates of queries.
Under this folder, we have the following files.
lice/ktest/all_results.csv
: A summary of all placement accuracy results for all methods across all tests with k=21 and k=31lice/ktest/additivity_eror.sh
: A small tester script used to find additivity errorlice/ktest/[k]/skmer:
diagreport.txt
: Error using APPLES various criteria (FM, etc.)dist.mat
: Skmer distance matrixjaccard.txt
: similarity matrix according to Jaccardmeta_backbone.tree
: backbone treelibrary
: includes aCONFIG
file giving skmer configuration. In addition, for each skim, we have:*.dat
: the skmer estimation of parameters such as coverage, length, etc.*.hist
: repeat spectra*.msh
: mash sketch
lice/ktest/[k]/exp-data/[sample replicate]
:species.txt
: list of species included in this sample replicatediameter.txt
: Diameter of the treetrue.tree
: true tree in newick formatmeta_backbone.tree
: true tree with branch lengths recomputedqueries/[query rep]
:things.txt
: name of query genomes (mixture) in this replicatedist.mat
and/ordist.txt
: the distance from the mixture to each reference- Three folders:
alien
is when both parents are removed from refpartial
is when one parent is removed from refpresent
is when neither parent is removed from ref Each of these folders includes these files:results_[method].csv
: placement error of different methods[method].nwk
: results of all methods in newick formatbackbone.tree
: the backbone tree used in analysesbaseline.*
: the best
lice/scripts/
: helper scripts used to run analyses, packaged for future referenceextract_error_from_jplace.py
: given jplace, extracts the error field output by APPLESmisa-lice.sh
: runs misa on licej2d.py
: translated Jaccard to phylogenetic distancepush_backbones.sh
: creates the backbone for each replicatereference-skim-parallel.sh
: Run skmer to create the skmer libraries
Yeast dataset (real hybridization)
In addition to simulated mixtures, we create a dataset of real hybrid yeast species. We select representative genomes for eight non-hybrid Saccharomyces species with assemblies available on NCBI. We also created a second extended dataset where we included seven more species from Genera Naumovozyma, Nakaseomyces, and Candida (see Supplementary Table S2 for accession numbers). We curate four assembled and two unassembled strains of hybrid yeast species, some of which were previously analyzed by Langdon et al. (2018). Unassembled hybrid strains muri (Krogerus et al., 2018) and YMD3265 are subsampled from NCBI SRA to 100Mb and filtered for contaminants in the same fashion as the previous dataset. We do not include strains such as Saccharomyces bayanus which are conjectured to be a hybrid of three species (Libkind et al., 2011). For each hybrid species, the hypothesized ancestors are known from the literature (Krogerus et al., 2018; Langdon et al., 2018, 2019) and NCBI Taxonomy annotation, and we use these postulated ancestors as the ground truth.
The archive yeast.tar.gz
is provided.
All experiment intermediate and output files, scripts, and Skmer sketches for all k=[21,23,25,27,29,31]. The archive has the following subdirectories:
The file includes (all prefixed by oasis/projects/nsf/uot138/balaban/mixture/yeast/
):
- k-mer size
k=[21,23,25,27,29,31]
, [query]
being one of the genomes,cond
being eitherpresent
(both ancestors present) orpartial
(one ancestor present) oralien
(no ancestor present).[db]
is eitherbase
for the smaller datasets of relevant yeast, orextended
for the larger dataset with all the yeastsmethod
being one of the methods, APPLES, MISA, or TOP2[data type]
being one ofassembly
for assemblies andgenome-skim
for genome skims.
The files provided include:
ktest
: Each experiment directory for parameters:ktest/all_results.csv
: the errors of methods across all the analysesktest/meta_backbone.tree
: please ignore this file. Backbone trees specific to each k are given underskmer
library.ktest/[k]/exp-data/all_results.csv
: the error values for this particular value ofk
ktest/[k]/exp-data/[data type]/[query]/dist.\*.mat
ordist.\*txt
: gives the full distance matrix from this query to all referencesktest/[k]/exp-data/[data type]/[query]/[query].fna
or[query].fastq
: The genome in fna or genome skims infastq
formatsktest/[k]/exp-data/[data type]/[query]/things.txt
: name of query genomes (mixture) in this replicatektest/[k]/exp-data/[data type]/[query]/[db]/[cond]/results\_[method].csv
: gives the error for a conditionktest/[k]/exp-data/[data type]/[query]/[db]/[cond]/[method].nwk
or[method].jplace
: gives the actual result of each method in newick or jplace formatsktest/[k]/exp-data/[data type]/[query]/[db]/[cond]/backbone.tree
: the backbone tree after removing queriesktest/[k]/exp-data/[data type]/[query]/[db]/[cond]/true.tree
: the tree with correct placements marked for queriesktest/[k]/exp-data/[data type]/[query]/[db]/[cond]/log.out
orlog.err
: the log file giving details of each runktest/[k]/skmer
: Has the skmer library used in the analyses, including the.dat
files (library info),- library config (
CONFIG
), - the mash sketches (
.msh
), - reference trees (
meta\_backbone.tree
) - distance matrices (
ref-dist-mat.txt\
)
genomes
: Yeast genome assemblies.- For each genome, we give the
.fna
file - For hybrid genomes (named in
genomes/hybrids.txt
) we also give the names of the ancestors (genomes/[genome]/things.txt
). For non-hybrids (genomes/nonhybrids.txt
) this is meaningless. genomes/nonhybrids_and_outlier
lists non-hybrids including the outgroups (the extended set mentioned above), which are species not in the Saccharomyces genus.- The genomes are also available at doi: 10.5281/zenodo.6974987
- For each genome, we give the
- SRA-subsample: Genome created by subsampling SRAs for the genome assemblies; here,
dist-[query].txt
gives the distance matrix obtainedmisa.jplace
: the MISA results in jplace formatlog.out
,[query].log
andlog.err
: log files of the experimentfastq
andmeta_backbone.tree
files give the input data (subsampled reads and the backbone tree)
Drosophila dataset (simulated mixture)
We use a set of 14 Drosophila assemblies published by Miller et al. (2018) (Supplementary Table S1) to evaluate the accuracy of our approach in an ideal setting where the mixed sample consists of the concatenation of the assemblies. We test 20 simulated mixtures of randomly chosen species in three scenarios where none, one, or both of the constituents are present in the reference library.
The following archives are provided under oasis/projects/nsf/uot138/balaban/mixture/drosophila
in drosophila.tar.gz
.
All experiment intermediate and output files, scripts, and Skmer sketches for all
k
is one of 21,23,25,27,29, or 31cond
being eitherpresent
(both ancestors present) orpartial
(one ancestor present) oralien
(no ancestor present).method
being one of the methods, APPLES, MISA, or TOP2; note thatbaseline
also represents APPLES
The archive has the following subdirectories:
assembly
: Drosophila genomes published by Miller et al. (2018).topo.tree
: The gold standard phylogeny for Drosophila (i.e. backbone tree.)ktest
: Each experiment directory for parameters:ktest/all_results.csv
: the errors of methods across all the analysesktest/dist.mat
: please ignore. The distance matrices for each analysis are given below.ktest/[k]/exp-data/all_results.csv: the error values of the analyses for this particular
k-
ktest/[k]/exp-data/[query]/all_results.csv`: error values pertaining to this queryktest/[k]/exp-data/[query]/species.txt
: list of all the species, same orderktest/[k]/exp-data/[query]/dist.*.mat
ordist.*txt
: gives the full distance matrix from this query to all referencesktest/[k]/exp-data/[query]/things.txt
: name of query genomes (mixture) in this replicatektest/[k]/exp-data/[query]/[cond]/results_[method].csv
: gives the error for a conditionktest/[k]/exp-data/[query]/[cond]/[method].nwk
or[method].jplace
: gives the actual result of each method in newick or jplace formatsktest/[k]/exp-data/[query]/[cond]/backbone.tree
: the backbone tree after removing queriesktest/[k]/exp-data/[query]/[cond]/log.out
orlog.err
: the log file giving details of each runktest/[k]/skmer
: Has the skmer library used in the analyses, including:- the library info such as coverage (
.dat
), - the mash sketches (
.msh
), - library config (
CONFIG
), - reference trees (
meta_backbone.tree
), - the FASTME log file (
dist.mat_fastme_stat.txt
), - distance matrices (
ref-dist-mat.txt
)
- the library info such as coverage (
Sharing/Access information
See more on:
Funding
National Science Foundation, Award: NSF-1815485
National Science Foundation, Award: NSF-1845967