Alignment-free methods for polyploid genomes: quick and reliable genetic distance estimation
Cite this dataset
Van Wallendael, Acer (2021). Alignment-free methods for polyploid genomes: quick and reliable genetic distance estimation [Dataset]. Dryad. https://doi.org/10.5061/dryad.fqz612jss
Polyploid genomes pose several inherent challenges to population genetic analyses. While alignment-based methods are fundamentally limited in their applicability to polyploids, alignment-free methods bypass most of these limits. We investigated the use of Mash, a k-mer analysis tool that uses the MinHash method to reduce complexity in large genomic datasets, for basic population genetic analyses of polyploid sequences. We measured the degree to which Mash correctly estimated pairwise genetic distance in simulated haploid and polyploid short-read sequences with various levels of missing data. Mash-based estimates of genetic distance were comparable to alignment-based estimates, and were less impacted by missing data. We also used Mash to analyze publicly available short-read data for three polyploid and one diploid species, then compared Mash results to published results. For both simulated and real data, Mash accurately estimated pairwise genetic differences for polyploids as well as diploids as much as 476 times faster than alignment-based methods, though we found that Mash genetic distance estimates could be biased by per-sample read depth. Mash may be a particularly useful addition to the toolkit of polyploid geneticists for rapid confirmation of alignment-based results and for basic population genetics in reference-free systems or those with only poor quality sequence data available.
Simulated reads for haploid and polyploid samples for genetic distance estimation.
We simulated phylogenetic trees using toytree (Eaton, 2020) in Python 3.7 for 50 individuals using a tree height of 1e6, then simulated SNP loci following these trees by generating sequences in ipcoal (McKenzie & Eaton, 2020). For haploid data, we simulated 1000 loci with 100bp each for 50 individuals, diverging with an effective population size of 1e5, and a recombination rate of 1e-9. We simulated polyploid data in an allotetraploid scenario, wherein an ancestrally diverged lineage hybridizes, then diversifies into population groups. To accomplish this, we built a phylogenetic tree with a single deep node (700000 generations), and two symmetrical trees with shallower nodes (Figure 1), which represent the mean locus tree after polyploidization. We simulated 1000 loci for each of 30 individuals using the parameters listed above, then combined loci from symmetrical branches to create homeologous loci. This simulation method does not account for incomplete lineage sorting, wherein individual locus trees do not match the true population genetic history, but effectively simulates the problem of similar homeologous loci that can confound alignment-based methods. To simulate missing data, we used custom scripts in R to randomly remove sequence data for 5-75% of reads. We compared Mash and alignment-based analyses for each level of missing data.
Files are in fasta format. Associated processing scripts can be found at: https://github.com/avanwallendael/mash_sim.