Intraspecific genome SNP frequencies comparison
Citation
Zhao, Kuan; Xu, Jian-ping (2023), Intraspecific genome SNP frequencies comparison, Dryad, Dataset, https://doi.org/10.5061/dryad.kh18932b1
Abstract
Genome sequence analyses can provide crucial for understanding the origin and spread of infectious diseases, population history, speciation, and taxonomy. In Class Agaricomycete where most mushroom-forming fungi belong, most species so far have been defined based on morphological, ecological, and/or molecular features, but there is no defined threshold for any type of features that can be applied across multiple genera, families, and orders. In this study, we investigated genome-wide single nucleotide polymorphism (SNP) frequencies within species to understand the patterns of variation within both the nuclear and mitochondrial genomes of the current whole-genome sequenced species. In total, our analyses included 398 and 106 published available nuclear and mitochondrial genomes of Agaricomycetes, respectively. The SNP frequencies among nuclear genomes within individual species ranged 0.00~7.69% while for the mitochondrial genome comparison, the intraspecific SNP frequencies ranged 0.00~4.41%. The Spearman’s non-parametric rank correlation test showed a weak but statistically significant positive correlation between the paired nuclear and mitochondrial genome datasets. Overall, we observed a significantly higher SNP frequency in the nuclear genome than in the mitochondrial genomes between strains within most species. Interestingly, across the broad Basidiomycetes, the ratios of mitochondrial genome SNPs and nuclear genome SNPs between pairs of strains within each species were highly similar, with a mean of 0.24. We discuss the implications of these results for Agaricomycetes systematics and the implementation of genome sequence-based species delimitation in fungi.
Methods
The assembled nuclear and mitochondrial (mt) genome data of Agaricomycetes were downloaded from the National Center of Biology Information (NCBI, https://www.ncbi.nlm.nih.gov/genome/?term=) and the Joint Genome Institute (JGI, https://mycocosm.jgi.doe.gov/mycocosm/home) genome database up to August 31, 2022. For each analyzed genome, the sequencing technology used, assembled genome size, sequencing read coverage depth, number of scaffolds and/or contigs, N50 (the minimum scaffold/contig length needed to cover 50% of the genome, L50 (the number of contigs required to reach N50), the mitogenome size and the related references were all retrieved when available. The species containing at least two nuclear or two mt genomes were selected for further analyses.
Usage notes
The genome-wide SNP analyses within individual species were determined by the alignment-based program MUMmer 3.23, with longer assemblies (larger genome and better-assembled genomes/fewer scaffolds) in each pairwise comparison serving as the reference for each analyzed species. Our alignments used the following specific commands: “–mum -p” parameter for aligning each pair of assembled genomes and identifying overlapping regions between two profiles (maxgap=500, mincluster=100), followed by “delta-filter -1” processing to filter out repeated comparisons, then “show-snps -CHITrl” to detect base substitutions. Insertions and deletions (InDels) in those overlapping regions were excluded from SNP frequency calculations.
Funding
Natural Sciences and Engineering Research Council of Canada, Award: RGPIN-2020-05732