Phylogenetic double placement of mixed samples
Data files
Nov 17, 2023 version files 109.01 GB
- 
              
                drosophila.tar.gz
                5.70 GB
- 
              
                gold.tree
                2.04 KB
- 
              
                lice.tar.gz
                24.30 GB
- 
              
                README.md
                11.99 KB
- 
              
                SRR3161912.fastq.bz2
                1.44 GB
- 
              
                SRR3161913.fastq.bz2
                1.36 GB
- 
              
                SRR3161914.fastq.bz2
                1.44 GB
- 
              
                SRR3161915.fastq.bz2
                1.54 GB
- 
              
                SRR3161916.fastq.bz2
                1.51 GB
- 
              
                SRR3161917.fastq.bz2
                1.75 GB
- 
              
                SRR3161918.fastq.bz2
                1.49 GB
- 
              
                SRR3161919.fastq.bz2
                1.50 GB
- 
              
                SRR3161920.fastq.bz2
                1.74 GB
- 
              
                SRR3161921.fastq.bz2
                1.57 GB
- 
              
                SRR3161922.fastq.bz2
                1.62 GB
- 
              
                SRR3161923.fastq.bz2
                1.61 GB
- 
              
                SRR3161924.fastq.bz2
                2.02 GB
- 
              
                SRR3161925.fastq.bz2
                1.31 GB
- 
              
                SRR3161926.fastq.bz2
                1.90 GB
- 
              
                SRR3161927.fastq.bz2
                1.73 GB
- 
              
                SRR3161928.fastq.bz2
                1.35 GB
- 
              
                SRR3161929.fastq.bz2
                1.27 GB
- 
              
                SRR3161930.fastq.bz2
                1.56 GB
- 
              
                SRR3161931.fastq.bz2
                1.32 GB
- 
              
                SRR3161932.fastq.bz2
                1.45 GB
- 
              
                SRR3161933.fastq.bz2
                1.36 GB
- 
              
                SRR3161934.fastq.bz2
                1.47 GB
- 
              
                SRR3161935.fastq.bz2
                1.40 GB
- 
              
                SRR3161936.fastq.bz2
                1.67 GB
- 
              
                SRR3161937.fastq.bz2
                1.65 GB
- 
              
                SRR3161938.fastq.bz2
                1.49 GB
- 
              
                SRR3161939.fastq.bz2
                1.46 GB
- 
              
                SRR3161940.fastq.bz2
                1.46 GB
- 
              
                SRR3161941.fastq.bz2
                1.68 GB
- 
              
                SRR3161942.fastq.bz2
                1.89 GB
- 
              
                SRR3161943.fastq.bz2
                1.69 GB
- 
              
                SRR3161944.fastq.bz2
                1.52 GB
- 
              
                SRR3161945.fastq.bz2
                1.69 GB
- 
              
                SRR3161946.fastq.bz2
                1.72 GB
- 
              
                SRR3161947.fastq.bz2
                1.75 GB
- 
              
                SRR3161948.fastq.bz2
                1.90 GB
- 
              
                SRR3161949.fastq.bz2
                1.50 GB
- 
              
                SRR3161950.fastq.bz2
                1.66 GB
- 
              
                SRR3161951.fastq.bz2
                1.50 GB
- 
              
                SRR3161952.fastq.bz2
                1.68 GB
- 
              
                SRR3161953.fastq.bz2
                1.06 GB
- 
              
                SRR3161954.fastq.bz2
                1.50 GB
- 
              
                SRR3161955.fastq.bz2
                1.37 GB
- 
              
                SRR3161956.fastq.bz2
                1.18 GB
- 
              
                SRR3161957.fastq.bz2
                1.49 GB
- 
              
                yeast.tar.gz
                7.75 GB
Abstract
Motivation
Consider a simple computational problem. The inputs are (i) the set of mixed reads generated from a sample that combines two organisms and (ii) separate sets of reads for several reference genomes of known origins. The goal is to find the two organisms that constitute the mixed sample. When constituents are absent from the reference set, we seek to phylogenetically position them with respect to the underlying tree of the reference species. This simple yet fundamental problem (which we call phylogenetic double-placement) has enjoyed surprisingly little attention in the literature. As genome skimming (low-pass sequencing of genomes at low coverage, precluding assembly) becomes more prevalent, this problem finds wide-ranging applications in areas as varied as biodiversity research, food production and provenance, and evolutionary reconstruction.
Results
We introduce a model that relates distances between a mixed sample and reference species to the distances between constituents and reference species. Our model is based on Jaccard indices computed between each sample represented as k-mer sets. The model, built on several assumptions and approximations, allows us to formalize the phylogenetic double-placement problem as a non-convex optimization problem that decomposes mixture distances and performs phylogenetic placement simultaneously. Using a variety of techniques, we are able to solve this optimization problem numerically. We test the resulting method, called MIxed Sample Analysis tool (MISA), on a varied set of simulated and biological datasets. Despite all the assumptions used, the method performs remarkably well in practice.
Citation
Balaban, M., & Mirarab, S. (2020). Phylogenetic double placement of mixed samples. Bioinformatics (Oxford, England), 36(1), i335–i343. doi:10.1093/bioinformatics/btaa489
Description of the data and file structure
In all the datasets, files called *results*.csv have the following columns:
- 1st column: querygives the query name,
- 2nd column: is one of
- alienis when both parents are removed from ref
- partialis when one parent is removed from ref
- presentis when neither parent is removed from ref
 
- 3rd column: the name of the method
- 4th column: Either Primary or Secondary, for the two placements; primary is always the one with lower error
- 5th column: Placement error in edges
- [optional] 6th column: the kvalue used
Columbicola (Lice) dataset (simulated mixture)
To evaluate the accuracy of our method on genome skimming data, we use a set of 61 genome skims by Boyd et al. (2017) (PRJNA296666), including 45 known Lice species (some represented multiple times) and seven undescribed species. We use randomly subsampled genome-skims of 4 Gb. We use BBTools (Bushnell, 2014) to filter subsampled reads for adapters and contaminants and remove duplicated reads. Then, we create five replicates each containing 20 organisms sampled from the full dataset at random. For each replicate, we simulate five mixtures with A and B chosen uniformly at random. We simulate mixtures by simply combining preprocessed genome skims of the two constituents. The exact coverage of the genome skims is unknown but is estimated to range between 4X and 15X by Skmer.
The following archives are provided:
- Each SRR*.fastq.bz2gives the preprocessed genome skim of one lice sample. These are the genome skims of lice used in this study, adapted from Boyd et al. In contrast to the original genomes, these files are preprocessed using BBTools.
- gold.tree: The reference tree of the samples, used as the gold standard
lice.tar.gz
Once you untar the file, all the files are under a oasis/projects/nsf/uot138/balaban/mixture/ folder.
These files are related to actual leave-out experiments with different values of k (e.g., 21, 27, ..., 31).
Recall that for each, we do 5 replicates of subsampling of backbone and for each, we have 5 replicates of queries.
Under this folder, we have the following files.
- lice/ktest/all_results.csv: A summary of all placement accuracy results for all methods across all tests with k=21 and k=31
- lice/ktest/additivity_eror.sh: A small tester script used to find additivity error
- lice/ktest/[k]/skmer:- diagreport.txt: Error using APPLES various criteria (FM, etc.)
- dist.mat: Skmer distance matrix
- jaccard.txt: similarity matrix according to Jaccard
- meta_backbone.tree: backbone tree
- library: includes a- CONFIGfile giving skmer configuration. In addition, for each skim, we have:
- *.dat: the skmer estimation of parameters such as coverage, length, etc.
- *.hist: repeat spectra
- *.msh: mash sketch
 
- lice/ktest/[k]/exp-data/[sample replicate]:- species.txt: list of species included in this sample replicate
- diameter.txt: Diameter of the tree
- true.tree: true tree in newick format
- meta_backbone.tree: true tree with branch lengths recomputed
- queries/[query rep]:- things.txt: name of query genomes (mixture) in this replicate
- dist.matand/or- dist.txt: the distance from the mixture to each reference
- Three folders:
- alienis when both parents are removed from ref
- partialis when one parent is removed from ref
- presentis when neither parent is removed from ref
 Each of these folders includes these files:
 - results_[method].csv: placement error of different methods
- [method].nwk: results of all methods in newick format
- backbone.tree: the backbone tree used in analyses
- baseline.*: the best
 
 
 
- lice/scripts/: helper scripts used to run analyses, packaged for future reference- extract_error_from_jplace.py: given jplace, extracts the error field output by APPLES
- misa-lice.sh: runs misa on lice
- j2d.py: translated Jaccard to phylogenetic distance
- push_backbones.sh: creates the backbone for each replicate
- reference-skim-parallel.sh: Run skmer to create the skmer libraries
 
Yeast dataset (real hybridization)
In addition to simulated mixtures, we create a dataset of real hybrid yeast species. We select representative genomes for eight non-hybrid Saccharomyces species with assemblies available on NCBI. We also created a second extended dataset where we included seven more species from Genera Naumovozyma, Nakaseomyces, and Candida (see Supplementary Table S2 for accession numbers). We curate four assembled and two unassembled strains of hybrid yeast species, some of which were previously analyzed by Langdon et al. (2018). Unassembled hybrid strains muri (Krogerus et al., 2018) and YMD3265 are subsampled from NCBI SRA to 100Mb and filtered for contaminants in the same fashion as the previous dataset. We do not include strains such as Saccharomyces bayanus which are conjectured to be a hybrid of three species (Libkind et al., 2011). For each hybrid species, the hypothesized ancestors are known from the literature (Krogerus et al., 2018; Langdon et al., 2018, 2019) and NCBI Taxonomy annotation, and we use these postulated ancestors as the ground truth.
The archive yeast.tar.gz is provided.
All experiment intermediate and output files, scripts, and Skmer sketches for all k=[21,23,25,27,29,31]. The archive has the following subdirectories:
The file includes (all prefixed by oasis/projects/nsf/uot138/balaban/mixture/yeast/):
- k-mer size k=[21,23,25,27,29,31],
- [query]being one of the genomes,
- condbeing either- present(both ancestors present) or- partial(one ancestor present) or- alien(no ancestor present).
- [db]is either- basefor the smaller datasets of relevant yeast, or- extendedfor the larger dataset with all the yeasts
- methodbeing one of the methods, APPLES, MISA, or TOP2
- [data type]being one of- assemblyfor assemblies and- genome-skimfor genome skims.
The files provided include:
- ktest: Each experiment directory for parameters:- ktest/all_results.csv: the errors of methods across all the analyses
- ktest/meta_backbone.tree: please ignore this file. Backbone trees specific to each k are given under- skmerlibrary.
- ktest/[k]/exp-data/all_results.csv: the error values for this particular value of- k
- ktest/[k]/exp-data/[data type]/[query]/dist.\*.mator- dist.\*txt: gives the full distance matrix from this query to all references
- ktest/[k]/exp-data/[data type]/[query]/[query].fnaor- [query].fastq: The genome in fna or genome skims in- fastqformats
- ktest/[k]/exp-data/[data type]/[query]/things.txt: name of query genomes (mixture) in this replicate
- ktest/[k]/exp-data/[data type]/[query]/[db]/[cond]/results\_[method].csv: gives the error for a condition
- ktest/[k]/exp-data/[data type]/[query]/[db]/[cond]/[method].nwkor- [method].jplace: gives the actual result of each method in newick or jplace formats
- ktest/[k]/exp-data/[data type]/[query]/[db]/[cond]/backbone.tree: the backbone tree after removing queries
- ktest/[k]/exp-data/[data type]/[query]/[db]/[cond]/true.tree: the tree with correct placements marked for queries
- ktest/[k]/exp-data/[data type]/[query]/[db]/[cond]/log.outor- log.err: the log file giving details of each run
- ktest/[k]/skmer: Has the skmer library used in the analyses, including the- .datfiles (library info),
- library config (CONFIG),
- the mash sketches (.msh),
- reference trees (meta\_backbone.tree)
- distance matrices (ref-dist-mat.txt\)
 
 
- genomes: Yeast genome assemblies.- For each genome, we give the .fnafile
- For hybrid genomes (named in genomes/hybrids.txt) we also give the names of the ancestors (genomes/[genome]/things.txt). For non-hybrids (genomes/nonhybrids.txt) this is meaningless.
- genomes/nonhybrids_and_outlierlists non-hybrids including the outgroups (the extended set mentioned above), which are species not in the Saccharomyces genus.
- The genomes are also available at doi: 10.5281/zenodo.6974987
 
- For each genome, we give the 
- SRA-subsample: Genome created by subsampling SRAs for the genome assemblies; here,
- dist-[query].txtgives the distance matrix obtained
- misa.jplace: the MISA results in jplace format
- log.out,- [query].logand- log.err: log files of the experiment
- fastqand- meta_backbone.treefiles give the input data (subsampled reads and the backbone tree)
 
Drosophila dataset (simulated mixture)
We use a set of 14 Drosophila assemblies published by Miller et al. (2018) (Supplementary Table S1) to evaluate the accuracy of our approach in an ideal setting where the mixed sample consists of the concatenation of the assemblies. We test 20 simulated mixtures of randomly chosen species in three scenarios where none, one, or both of the constituents are present in the reference library.
The following archives are provided under oasis/projects/nsf/uot138/balaban/mixture/drosophila in drosophila.tar.gz.
All experiment intermediate and output files, scripts, and Skmer sketches for all
- kis one of 21,23,25,27,29, or 31
- condbeing either- present(both ancestors present) or- partial(one ancestor present) or- alien(no ancestor present).
- methodbeing one of the methods, APPLES, MISA, or TOP2; note that- baselinealso represents APPLES
The archive has the following subdirectories:
- assembly: Drosophila genomes published by Miller et al. (2018).
- topo.tree: The gold standard phylogeny for Drosophila (i.e. backbone tree.)
- ktest: Each experiment directory for parameters:- ktest/all_results.csv: the errors of methods across all the analyses
- ktest/dist.mat: please ignore. The distance matrices for each analysis are given below.
- ktest/[k]/exp-data/all_results.csv: the error values of the analyses for this particulark- -ktest/[k]/exp-data/[query]/all_results.csv`: error values pertaining to this query
- ktest/[k]/exp-data/[query]/species.txt: list of all the species, same order- ktest/[k]/exp-data/[query]/dist.*.mator- dist.*txt: gives the full distance matrix from this query to all references
- ktest/[k]/exp-data/[query]/things.txt: name of query genomes (mixture) in this replicate
- ktest/[k]/exp-data/[query]/[cond]/results_[method].csv: gives the error for a condition
- ktest/[k]/exp-data/[query]/[cond]/[method].nwkor- [method].jplace: gives the actual result of each method in newick or jplace formats
- ktest/[k]/exp-data/[query]/[cond]/backbone.tree: the backbone tree after removing queries
- ktest/[k]/exp-data/[query]/[cond]/log.outor- log.err: the log file giving details of each run
- ktest/[k]/skmer: Has the skmer library used in the analyses, including:- the library info such as coverage (.dat),
- the mash sketches (.msh),
- library config (CONFIG),
- reference trees (meta_backbone.tree),
- the FASTME log file (dist.mat_fastme_stat.txt),
- distance matrices (ref-dist-mat.txt)
 
- the library info such as coverage (
 
 
Sharing/Access information
See more on:
