Skip to main content
Dryad

Extensive intragenomic variation in the internal transcribed spacer (ITS) region of fungi

Cite this dataset

Bradshaw, Michael (2022). Extensive intragenomic variation in the internal transcribed spacer (ITS) region of fungi [Dataset]. Dryad. https://doi.org/10.5061/dryad.g79cnp5t7

Abstract

Fungi are among the most biodiverse organisms in the world. Accurate species identification is imperative for studies on fungal ecology and evolution. The internal transcribed spacer (ITS) rDNA region has been widely accepted as the universal barcode for fungi. However, several recent studies have uncovered intragenomic sequence variation within the ITS in multiple fungal species. Here, we mined the genome of 2414 fungal species to determine the prevalence of intragenomic variation and found that the genomes of 641 species, about one-quarter of the 2414 species examined, contained multiple ITS copies. Of those 641 species, 419 (~65%) contained variation among copies revealing that intragenomic variation is common in fungi. We proceeded to show how these copies could result in the erroneous description of hundreds of fungal species and skew studies evaluating eDNA especially when making diversity estimates. Additionally, many genomes were found to be contaminated, especially those of unculturable fungi.

Methods

Data were mined from at least one genome assembly of every fungal species from September-December of 2021 on GenBank. Data mining was accomplished by extracting the multiple ITS copies from a given assembly and then aligning and analyzing the extracted copies for variation. Detailed methods are as follows:

(1) A list of all taxa with publicly available assemblies was compiled.

(2) For each taxon, GenBank’s nucleotide database was searched for a fully annotated ITS region.

(3) The GenBank accession number determined from (2) was GenBank blasted (blastn) to ensure the taxon was identified correctly.

(4) The ITS region from (2) was trimmed to include only nucleotides present in the ITS1+5.8S+ITS2 region.

(5) A genome assembly was chosen for each fungal species on GenBank. If multiple assemblies for a given taxon were available, the assembly with the smallest number of scaffolds/contigs was evaluated first.

(6) A genome assembly was GenBank blasted (blastn) with the trimmed ITS region. For example, in Supplementary Dataset 1, column A (‘Assembly Reference’) was GenBank blasted with column E (‘GenBank Accession Number of ITS Region used to blast assembly’); if no ITS region was located or if it was very fragmented other assemblies were checked.

(7) The results of the assembly blast were downloaded into Geneious version 2021.2.2 and aligned.

(8) ITS copies from the genome assembly that were ~ >50 bases shorter than the length of the ITS region determined in step (4) were discarded to eliminate short contigs and to keep the data consistent.

(9) Alignments for these taxa are available here in both a .geneious and .fasta file format.

(10) The number of ITS copies in the assembly, identical site % and pairwise identity % among the different copies were calculated in Geneious and recorded.

(11) The ITS accessions used to blast the assemblies were downloaded into Geneious and their GC content was recorded.

(12) The remaining data from the assemblies were recorded from GenBank (Taxa ID, assembly method, sequencing technology used, genome coverage, contigs, scaffolds, assembly GC content (%), assembly release date, and genome size).

Usage notes

Geneious or any .fasta file reader