Phylogenetic double placement of mixed samples

Published Nov 17, 2023 on Dryad. https://doi.org/10.6076/D1QW25

Abstract

Motivation

Consider a simple computational problem. The inputs are (i) the set of mixed reads generated from a sample that combines two organisms and (ii) separate sets of reads for several reference genomes of known origins. The goal is to find the two organisms that constitute the mixed sample. When constituents are absent from the reference set, we seek to phylogenetically position them with respect to the underlying tree of the reference species. This simple yet fundamental problem (which we call phylogenetic double-placement) has enjoyed surprisingly little attention in the literature. As genome skimming (low-pass sequencing of genomes at low coverage, precluding assembly) becomes more prevalent, this problem finds wide-ranging applications in areas as varied as biodiversity research, food production and provenance, and evolutionary reconstruction.

Results

We introduce a model that relates distances between a mixed sample and reference species to the distances between constituents and reference species. Our model is based on Jaccard indices computed between each sample represented as k-mer sets. The model, built on several assumptions and approximations, allows us to formalize the phylogenetic double-placement problem as a non-convex optimization problem that decomposes mixture distances and performs phylogenetic placement simultaneously. Using a variety of techniques, we are able to solve this optimization problem numerically. We test the resulting method, called MIxed Sample Analysis tool (MISA), on a varied set of simulated and biological datasets. Despite all the assumptions used, the method performs remarkably well in practice.

Citation

Balaban, M., & Mirarab, S. (2020). Phylogenetic double placement of mixed samples. Bioinformatics (Oxford, England), 36(1), i335–i343. doi:10.1093/bioinformatics/btaa489

Description of the data and file structure

In all the datasets, files called *results*.csv have the following columns:

1st column: query gives the query name,
2nd column: is one of
- alien is when both parents are removed from ref
- partial is when one parent is removed from ref
- present is when neither parent is removed from ref
3rd column: the name of the method
4th column: Either Primary or Secondary, for the two placements; primary is always the one with lower error
5th column: Placement error in edges
[optional] 6th column: the k value used

Columbicola (Lice) dataset (simulated mixture)

To evaluate the accuracy of our method on genome skimming data, we use a set of 61 genome skims by Boyd et al. (2017) (PRJNA296666), including 45 known Lice species (some represented multiple times) and seven undescribed species. We use randomly subsampled genome-skims of 4 Gb. We use BBTools (Bushnell, 2014) to filter subsampled reads for adapters and contaminants and remove duplicated reads. Then, we create five replicates each containing 20 organisms sampled from the full dataset at random. For each replicate, we simulate five mixtures with A and B chosen uniformly at random. We simulate mixtures by simply combining preprocessed genome skims of the two constituents. The exact coverage of the genome skims is unknown but is estimated to range between 4X and 15X by Skmer.

The following archives are provided:

Each SRR*.fastq.bz2 gives the preprocessed genome skim of one lice sample. These are the genome skims of lice used in this study, adapted from Boyd et al. In contrast to the original genomes, these files are preprocessed using BBTools.
gold.tree: The reference tree of the samples, used as the gold standard

`lice.tar.gz`

Once you untar the file, all the files are under a oasis/projects/nsf/uot138/balaban/mixture/ folder.
These files are related to actual leave-out experiments with different values of k (e.g., 21, 27, …, 31).
Recall that for each, we do 5 replicates of subsampling of backbone and for each, we have 5 replicates of queries.
Under this folder, we have the following files.

lice/ktest/all_results.csv: A summary of all placement accuracy results for all methods across all tests with k=21 and k=31
lice/ktest/additivity_eror.sh: A small tester script used to find additivity error
lice/ktest/[k]/skmer:
- diagreport.txt: Error using APPLES various criteria (FM, etc.)
- dist.mat: Skmer distance matrix
- jaccard.txt: similarity matrix according to Jaccard
- meta_backbone.tree: backbone tree
- library: includes a CONFIG file giving skmer configuration. In addition, for each skim, we have:
- *.dat: the skmer estimation of parameters such as coverage, length, etc.
- *.hist: repeat spectra
- *.msh: mash sketch
lice/ktest/[k]/exp-data/[sample replicate]:
- species.txt: list of species included in this sample replicate
- diameter.txt: Diameter of the tree
- true.tree: true tree in newick format
- meta_backbone.tree: true tree with branch lengths recomputed
- queries/[query rep]:
  - things.txt: name of query genomes (mixture) in this replicate
  - dist.mat and/or dist.txt: the distance from the mixture to each reference
  - Three folders:
    1. alien is when both parents are removed from ref
    2. partial is when one parent is removed from ref
    3. present is when neither parent is removed from ref
      Each of these folders includes these files:
      - results_[method].csv: placement error of different methods
      - [method].nwk: results of all methods in newick format
      - backbone.tree: the backbone tree used in analyses
      - baseline.*: the best
lice/scripts/: helper scripts used to run analyses, packaged for future reference
- extract_error_from_jplace.py: given jplace, extracts the error field output by APPLES
- misa-lice.sh: runs misa on lice
- j2d.py: translated Jaccard to phylogenetic distance
- push_backbones.sh: creates the backbone for each replicate
- reference-skim-parallel.sh: Run skmer to create the skmer libraries

Yeast dataset (real hybridization)

In addition to simulated mixtures, we create a dataset of real hybrid yeast species. We select representative genomes for eight non-hybrid Saccharomyces species with assemblies available on NCBI. We also created a second extended dataset where we included seven more species from Genera Naumovozyma, Nakaseomyces, and Candida (see Supplementary Table S2 for accession numbers). We curate four assembled and two unassembled strains of hybrid yeast species, some of which were previously analyzed by Langdon et al. (2018). Unassembled hybrid strains muri (Krogerus et al., 2018) and YMD3265 are subsampled from NCBI SRA to 100Mb and filtered for contaminants in the same fashion as the previous dataset. We do not include strains such as Saccharomyces bayanus which are conjectured to be a hybrid of three species (Libkind et al., 2011). For each hybrid species, the hypothesized ancestors are known from the literature (Krogerus et al., 2018; Langdon et al., 2018, 2019) and NCBI Taxonomy annotation, and we use these postulated ancestors as the ground truth.

The archive yeast.tar.gz is provided.

All experiment intermediate and output files, scripts, and Skmer sketches for all k=[21,23,25,27,29,31]. The archive has the following subdirectories:

The file includes (all prefixed by oasis/projects/nsf/uot138/balaban/mixture/yeast/):

k-mer size k=[21,23,25,27,29,31],
[query] being one of the genomes,
cond being either present (both ancestors present) or partial (one ancestor present) or alien (no ancestor present).
[db] is either base for the smaller datasets of relevant yeast, or extended for the larger dataset with all the yeasts
method being one of the methods, APPLES, MISA, or TOP2
[data type] being one of assembly for assemblies and genome-skim for genome skims.

The files provided include:

ktest: Each experiment directory for parameters:
- ktest/all_results.csv: the errors of methods across all the analyses
- ktest/meta_backbone.tree: please ignore this file. Backbone trees specific to each k are given under skmer library.
- ktest/[k]/exp-data/all_results.csv: the error values for this particular value of k
- ktest/[k]/exp-data/[data type]/[query]/dist.\*.matordist.\*txt: gives the full distance matrix from this query to all references
- ktest/[k]/exp-data/[data type]/[query]/[query].fnaor[query].fastq: The genome in fna or genome skims in fastq formats
- ktest/[k]/exp-data/[data type]/[query]/things.txt: name of query genomes (mixture) in this replicate
- ktest/[k]/exp-data/[data type]/[query]/[db]/[cond]/results\_[method].csv: gives the error for a condition
- ktest/[k]/exp-data/[data type]/[query]/[db]/[cond]/[method].nwkor[method].jplace: gives the actual result of each method in newick or jplace formats
- ktest/[k]/exp-data/[data type]/[query]/[db]/[cond]/backbone.tree: the backbone tree after removing queries
- ktest/[k]/exp-data/[data type]/[query]/[db]/[cond]/true.tree: the tree with correct placements marked for queries
- ktest/[k]/exp-data/[data type]/[query]/[db]/[cond]/log.outorlog.err: the log file giving details of each run
- ktest/[k]/skmer: Has the skmer library used in the analyses, including the
  - .dat files (library info),
  - library config (CONFIG),
  - the mash sketches (.msh),
  - reference trees (meta\_backbone.tree)
  - distance matrices (ref-dist-mat.txt\)
genomes: Yeast genome assemblies.
- For each genome, we give the .fna file
- For hybrid genomes (named in genomes/hybrids.txt) we also give the names of the ancestors (genomes/[genome]/things.txt). For non-hybrids (genomes/nonhybrids.txt) this is meaningless.
- genomes/nonhybrids_and_outlier lists non-hybrids including the outgroups (the extended set mentioned above), which are species not in the Saccharomyces genus.
- The genomes are also available at doi: 10.5281/zenodo.6974987
SRA-subsample: Genome created by subsampling SRAs for the genome assemblies; here,
- dist-[query].txt gives the distance matrix obtained
- misa.jplace: the MISA results in jplace format
- log.out, [query].log and log.err: log files of the experiment
- fastq and meta_backbone.tree files give the input data (subsampled reads and the backbone tree)

Drosophila dataset (simulated mixture)

We use a set of 14 Drosophila assemblies published by Miller et al. (2018) (Supplementary Table S1) to evaluate the accuracy of our approach in an ideal setting where the mixed sample consists of the concatenation of the assemblies. We test 20 simulated mixtures of randomly chosen species in three scenarios where none, one, or both of the constituents are present in the reference library.

The following archives are provided under oasis/projects/nsf/uot138/balaban/mixture/drosophila in drosophila.tar.gz.
All experiment intermediate and output files, scripts, and Skmer sketches for all

k is one of 21,23,25,27,29, or 31
cond being either present (both ancestors present) or partial (one ancestor present) or alien (no ancestor present).
method being one of the methods, APPLES, MISA, or TOP2; note that baseline also represents APPLES

The archive has the following subdirectories:

assembly: Drosophila genomes published by Miller et al. (2018).
topo.tree: The gold standard phylogeny for Drosophila (i.e. backbone tree.)
ktest: Each experiment directory for parameters:
- ktest/all_results.csv: the errors of methods across all the analyses
- ktest/dist.mat: please ignore. The distance matrices for each analysis are given below.
- ktest/[k]/exp-data/all_results.csv: the error values of the analyses for this particular k-ktest/[k]/exp-data/[query]/all_results.csv`: error values pertaining to this query
- ktest/[k]/exp-data/[query]/species.txt: list of all the species, same order
  - ktest/[k]/exp-data/[query]/dist.*.mat or dist.*txt: gives the full distance matrix from this query to all references
  - ktest/[k]/exp-data/[query]/things.txt: name of query genomes (mixture) in this replicate
  - ktest/[k]/exp-data/[query]/[cond]/results_[method].csv: gives the error for a condition
  - ktest/[k]/exp-data/[query]/[cond]/[method].nwk or [method].jplace: gives the actual result of each method in newick or jplace formats
  - ktest/[k]/exp-data/[query]/[cond]/backbone.tree: the backbone tree after removing queries
  - ktest/[k]/exp-data/[query]/[cond]/log.out or log.err: the log file giving details of each run
  - ktest/[k]/skmer: Has the skmer library used in the analyses, including:
    - the library info such as coverage (.dat),
    - the mash sketches (.msh),
    - library config (CONFIG),
    - reference trees (meta_backbone.tree),
    - the FASTME log file (dist.mat_fastme_stat.txt),
    - distance matrices (ref-dist-mat.txt)

Sharing/Access information

See more on:

Phylogenetic double placement of mixed samples

Data files