Data from: A k-mer-based approach for phylogenetic classification of taxa in environmental genomic data
Data files
Jun 21, 2023 version files 25.93 MB
Abstract
In the age of genome sequencing, whole genome data is readily and frequently generated, leading to a wealth of new information that can be used to advance various fields of research. New approaches, such as alignment-free phylogenetic methods that utilize k-mer-based distance scoring, are becoming increasingly popular given their ability to rapidly generate phylogenetic information from whole genome data. However, these methods have not yet been tested using environmental data, which often tends to be highly fragmented and incomplete. Here we compare the results of one alignment-free approach (which utilizes the D2 statistic) to traditional multi-gene maximum likelihood trees in three algal groups that have high-quality genome data available. In addition, we simulate lower-quality, fragmented genome data using these algae to test method robustness to genome quality and completeness. Finally, we apply the alignment-free approach to environmental metagenome assembled genome data of unclassified Saccharibacteria and Trebouxiophyte algae, and single-cell amplified data from uncultured marine stramenopiles to demonstrate its utility with real datasets. We find that in all instances, the alignment-free method produces phylogenies that are comparable, and often more informative, than those created using the traditional multi-gene approach. The k-mer-based method performs well even when there is significant missing data, that includes marker genes traditionally used for tree reconstruction. Our results demonstrate the value of alignment-free approaches for classifying novel, often cryptic or rare, species, that may not be culturable or are difficult to access using single-cell methods but fill important gaps in the tree of life.