Data from: Extensive intragenomic variation in the Internal Transcribed Spacer (ITS) region, the universal barcode in Kingdom Fungi
Bradshaw, Michael (2022), Data from: Extensive intragenomic variation in the Internal Transcribed Spacer (ITS) region, the universal barcode in Kingdom Fungi, Dryad, Dataset, https://doi.org/10.5061/dryad.g79cnp5t7
Fungi are among the most biodiverse organisms in the world with a conservatively estimated 2.2–3.8 million species but only ~150,000 have been so far described. Accurate species identification is critical for studies on fungal ecology and evolution. Due to their cryptic, and often microscopic, nature, fungal species identification is often accomplished with molecular markers. The internal transcribed spacer (ITS) rDNA region is widely accepted as the universal barcode for fungi. However, several recent studies have uncovered intragenomic sequence variation within the ITS in multiple fungal species. Here, we mined the genome of 2,414 fungal species to determine the prevalence of intragenomic variation in Kingdom Fungi and its consequences for studies of fungal ecology and evolution. We found that the genomes of 641 species, about one-quarter of the 2,414 species examined, contained multiple ITS copies and 419 (17%) contained variation among copies. We found that intragenomic variation and potential ‘pseudogenes’ (highly divergent ITS copies) are common in species throughout the Kingdom. The high prevalence of ITS intragenomic variation in fungi has wide implications for taxonomy, environmental DNA studies and barcode-based species diversity estimates.
A GenBank assembly was mined from every publicly available fungal species from September-December of 2021. Data mining was accomplished as follows: (1) A list of all taxa with publicly available assemblies was compiled. (2) For each taxon, GenBank was searched for a fully annotated ITS region. (3) The GenBank accession number determined from (2) was GenBank blasted (blastn) to ensure the taxon was identified correctly. (4) The ITS region from (2) was trimmed to include only nucleotides present in the ITS1+5.8S+ITS2 region. (5) The length of the ITS region and the given GenBank accession were recorded. (6) The trimmed ITS region was GenBank blasted within a genome assembly of the taxa of interest. If multiple assemblies for a given taxon were available, the assembly with the smallest number of scaffolds/contigs was evaluated first; if no ITS region was located or if it was very fragmented other assemblies were checked. (7) The results of the assembly blast were downloaded into Geneious Version 2021.2.2 and aligned. (8) ITS copies from the genome assembly that were ~ >50 bp shorter than the length of the ITS region determined in step (5) were discarded to eliminate short contigs and to keep the data consistent. (9) Alignments for these taxa are available in Supplementary File 1 in both a .geneious and .fasta file format. (10) The number of ITS copies in the assembly, identical site % and pairwise identity % among the different copies were calculated in Geneious and recorded. (11) The ITS accessions used to blast the assemblies were downloaded into Geneious and their GC content was recorded. (12) The remaining data from the assemblies were recorded from GenBank (Taxa ID, assembly method, sequencing technology used, genome coverage, contigs, scaffolds, assembly GC content (%), assembly release date, and genome size).
Geneious or any .fasta file reader