Do pseudogenes pose a problem for metabarcoding marine animal communities?

Schultz, Jessica A.1 ; Hebert, Paul D. N.1

Research facility: University of Guelph

Published Jun 13, 2022; Updated Aug 18, 2023 on Dryad. https://doi.org/10.5061/dryad.1jwstqjvt

Data files

Jun 13, 2022 version files 1.09 MB

Table_S1_List_of_COI_NUMTs_in_sequenced_marine_genomes.csv

1.05 MB
Table_S2_Summary_of_COI_NUMTs_by_species.csv

44.71 KB

Abstract

Because DNA metabarcoding typically employs sequence diversity among mitochondrial amplicons to estimate species composition, nuclear mitochondrial pseudogenes (NUMTs) can inflate diversity. This study quantifies the incidence and attributes of NUMTs derived from the 658 bp barcode region of cytochrome c oxidase I (COI) in 156 marine animal genomes. NUMTs were examined to ascertain if they could be recognized by their possession of indels or stop codons. In total, 309 NUMTs  150 bp were detected, with an average of 1.98 per species (range = 0–33) and a mean length of 391 bp  200 bp. Among this total, 75 (23.4%) lacked indels or stop codons. NUMTs appear to pose the greatest interpretational risk when short (< 313 bp) amplicons are used, such as in eDNA studies, dietary analyses, or processed fish identification. Employing the standard amplicon length (313 bp) for marine metabarcoding, NUMTs could potentially inflate the OTU count by 21% above the true species count while also raising intraspecific variation at COI by 15%. However, when both amplicon length and position are considered, inflation in OTU counts and in barcode variation were just 9% and 10%, respectively, suggesting NUMTs will not seriously distort biodiversity assessments. There was a weak positive correlation between genome size and NUMT count but no variation among phyla or trophic groups. Until bioinformatic advances improve NUMT detection, the best defense involves targeting long amplicons and developing reference databases that include both mitochondrial sequences and their NUMT derivatives.

Data collection

We examined the incidence of COI NUMTs in the genomes of marine animals on the NCBI genome browser (Clark et al. 2016). To identify candidate genomes, we compared taxonomic names in the World Register of Marine Species (WoRMS; Horton et al. 2020) with the NCBI genome browser (https://www.ncbi.nlm.nih.gov/genome/browse). All genomes for marine invertebrates were downloaded together with those for at least one species per order of marine vertebrates, selected haphazardly. When more than one genome was available for a species, the reference genome (if available) or the most recent assembly was selected. In addition, we downloaded the COI sequence from the mitochondrial genome of each species and used AliView (Larson 2014) to extract the 658 bp recovered by primers targeting the barcode region (Hebert et al. 2003). When available, the reference sequence for the full COI gene was also retained. When a COI sequence was unavailable on GenBank, the Barcode of Life Database (BOLD; Ratnasingham and Hebert 2007) was searched for a sequence.

NUMT search and identification

We conducted BLAST searches for mitochondrial COI against the genome sequence available for each species using the 658 bp barcode region as the query. Using Geneious Prime (version 2020.2.1), we conducted a BLASTn search with a maximum of 1000 hits and a maximum expectation value of e = 0.0001 to generate a list of hits. We excluded BLASTn hits < 150 bp in length, or those with both 100% coverage and ≥ 99.8% ID as these likely represented a mitochondrial sequence inadvertently included in the nuclear assembly.

The remaining hits were considered putative COI NUMTs and summary information (hit length, GC content, query coverage, percent similarity, e-value) were exported to Excel (Supp. Data Table S1). Using the BLASTn alignments in Geneious Prime (version 2020.2.1), each hit was individually aligned with the mitochondrial COI sequence for that species to visually search for any insertions and deletions. Each sequence was then translated using the appropriate mitochondrial code to determine if premature stop codons were present. To determine the correct reading frame, we tested each codon position until no premature stop codons appeared in the COI reference sequence. The presence of indels or premature stop codons (IPSCs) at any position along the sequence was recorded for all hits ≥ 150 bp.

Since using a longer COI query length could reveal additional NUMTs beyond the 658 bp barcode region, we conducted a second BLASTn search among the invertebrates in our dataset using the full-length (1500 bp) COI sequence when available (Supp. Figure S1). We used the same BLASTn parameters and strategy for identifying IPSCs as for the 658 bp query and retained all hits >150 bp. This analysis made it possible to ascertain if certain regions of COI were more prone to incorporation into NUMTs by mapping hits to the reference COI sequence and then quantifying the coverage at each nucleotide position. We then plotted the coverage for all species in a particular phylum to determine the frequency with which each nucleotide position of COI appeared in a NUMT.

NUMT diagnosis

In addition to examining all NUMTs ≥ 150 bp for diagnostic features, we quantified the incidence of NUMTs that presented an interpretational threat by ascertaining the proportion of hits lacking diagnostic features at four sequence lengths (150, 313, 500, 650 bp) commonly used in marine barcoding, metabarcoding, and eDNA approaches (Hebert, Ratnasingham, & Dewaard, 2003; Leray et al., 2013; Ratnasingham & Hebert, 2013; Shokralla et al., 2015; van der Loos & Nijland, 2020; S. Zhang, Zhao, & Yao, 2020). HTS platforms have the potential to recoverall NUMTs, but only those with IPSCs in the target region will be diagnosed. For instance, platforms that generate 150 bp reads will capture all NUMTs 150 bp, but they will only be recognizable as NUMTs if they possess IPSCs within the first 150 bp. Using results from the 658 bp query, we therefore considered hits diagnosable if they contained IPSCs in the target sequence region (i.e., the first 150, 313, 500, or 650 bp). The mean number of both diagnosable and non-diagnosable hits per species was compared among the four sequence length categories using a Friedman test with a Bonferroni correction of k = 2 and adjusted α of 0.025 (see Supp. Info Table S3 for a summary of statistical tests used). The median total number hits per species was compared to the median number of undiagnosable hits per species within each length category using Sign tests (Bonferroni correction: k = 4, adjusted α = 0.0123). In addition, we examined the impact of NUMT divergence values on metabarcoding results, employing a fixed sequence divergence threshold for delineating OTUs. We chose a 3% divergence threshold because of its widespread use in marine metazoan surveys (Leray and Knowlton 2015, 2017, Cahill et al. 2018). However, we also report results with 2% and 4% thresholds as some studies have employed them. For instance, the BIN system uses 2.2% for initial clustering as part of the RESL clustering approach (Ratnasingham and Hebert 2013), and 4% is sometimes used as an upper bound in Bayesian approaches (e.g., Hao et al. 2011, Leray and Knowlton 2017). Divergence values are only important for non-diagnosable NUMTs that are not excluded from downstream analysis. We therefore ascertained the proportion of undiagnosable NUMTs that would either inflate the OTU count (>2, 3, or 4% divergence) or intraspecific barcode variation (<2, 3, or 4% divergence).

Because the 313 bp Leray fragment (Leray et al. 2013) is commonly used in marine metabarcoding applications, we further examined the number of sequences likely to appear in studies targeting this fragment when both the length and position of the NUMT in the barcode region are considered. The Leray fragment is located near the 3’ terminus of the 658 bp barcode amplified by Folmer (Folmer et al. 1994) or Geller (Geller et al. 2013) primers. Because it begins at approximately the 345th bp of COI, we reasoned that a NUMT is likely to be amplified by the Leray primers if: 1) the starting base pair of the NUMT is at least 10 bp before the 345th bp of COI to allow binding of the forward primer; 2) it is long enough to span the Leray fragment (313 bp); and 3) it includes an additional 10 bp to allow binding of the reverse primer. Because our query length was restricted to 658 bp, we not could not ascertain if a hit meeting the first two criteria also met the third criterion, but this should often have been the case. We quantified the number of NUMTs that met these criteria and that lacked an IPSC to determine the proportion that could pose a problem for studies targeting this region.

Patterns of NUMT abundance among species

We examined the relationship between the number of hits (greater than or equal to 150 bp) and genome size, the quality of the assembly (contig N50), and genome coverage reported on NCBI using Spearman’s rank correlations (Bonferroni correction: k = 3, adjusted α = 0.0167). In addition, we examined if species in certain phyla or those with differing ecological traits were more likely to possess NUMTs. For these comparisons, we used results from the COI barcode query length (658 bp) and included all hits greater than or equal to 150 bp. We compared the average number of total hits per species among phyla, the average number of undiagnosable hits among phyla, and the average number of total hits among trophic groups using Kruskal-Wallis rank sum tests (Bonferroni correction: k = 3; adjusted α = 0.0167).

Ecological information was primarily compiled from the Encyclopedia of Life (http://eol.org; accessed 12 Sept 2020), but additional information was obtained from the primary literature to fill gaps (see Supp. Data Table S2 for specific references). We recognized six trophic categories (predator/carnivore, grazer/herbivore, parasite, suspension feeder, omnivore, other) on the basis of adult feeding habits. The ‘suspension feeder’ category included passive suspension feeders, active filter feeders, and mucous net feeders. The ‘omnivore’ category included species with more than one equally prevalent trophic level or guild. ‘Other’ included a chemosymbiotroph, a surface deposit feeder, and two detritivores.

The data used for analysis are contained in two .csv files, "Table S1 - List of COI NUMTs in sequenced marine genomes.csv" and "Table S2 - Summary of COI NUMTs by species.csv." Table S1 is a table of all of the COI hits found in each genome, including the datafields exported by Geneious Prime. Table S2 contains summary information for each organism included in the analysis, including GenBank accession numbers or Barcode Index Numbers, as applicable, for all genomes and COI query sequences employed in our analysis. A description of the data columns in each file are as follows.

Table S1 - List of COI NUMTs in sequenced marine genomes.csv

• Name – identifier for the hit

• #.Nucleotides - number of nucleotides in the BLASTn hit

• #.Sequences – number of sequences aligned in the BLASTn search

• %.Identical.Sites – percent of identical sites

• %.Pairwise.Identity – perrcent of pairwise ID

• GC – GC content in percent

• Bit-Score – bit score

• Created.Date – date and time of the BLASTn search

• Database – name of the NCBI database (species name)

• Description – description of the NCBI database

• E.Value - “Expect value” represents the number of hits with at least this score that you would expect purely by chance, given the size of the database and query sequence. The lower the E-value, the more likely that the hit is real.

• Free.end.gaps

• Grade - a percentage calculated by Geneious by combining the query coverage, e-value and identity values for each hit with weights 0.5, 0.25 and 0.25 respectively. See the Geneious Prime User Manual for more details.

• Hit.end – the starting nucleotide of the hit within the genome assembly

• Hit.start – the ending nucleotide of the hit within the genome assembly

• Max.Sequence.Length – the maximum sequence length of the hit

• Mean.Coverage – the mean coverage of the hit

• Min.Sequence.Length – the minimum sequence length of the hit

• Modified – modified date and time

• Molecule.Type – molecule type

• Query – name of the query sequence

• Query.coverage – coverage of the query sequence

• Query.end – starting nucleotide of the query alignment with the assembly

• Query.start – ending nucleotide of the query alignment with the assembly

• Sequence – nucleotide sequence of the hit

• Sequence.Length – length of the sequence

• Topology – sequence topology

• URN - URN

• Hit.length – the hit end mimus the hit start

• Indels – indicates whether or not an insertion or deletion of any length was present along the hit (0 = no, 1 = yes, n/a = not evaluated for hits < 100 bp)

• Stop.codons - indicates whether or not a premature stop codon was present along the hit (0 = no, 1 = yes, n/a = not evaluated for hits < 100 bp)

• Phylum – the phylum of the organism

• CO1.query.length.category – the category of COI reference sequence length used in the BLASTn search (658 – the barcode region, 1000 – 100 bp, 1500 – the full-length mtCOI sequence)

• CO1.query.length – the exact length of the COI reference sequence used in the BLASTn search

• Hit.length.category – the category of hit length used in the analysis

• indel.or.stop – indicates whether or not an insertion or deletion of any length or a premature stop codon was present along the hit (0 = no, 1 = yes, , n/a = not evaluated for hits < 100 bp)

• ipsc.first150 – indicates whether or not an indel or premature stop codon was present in the first 150 bp of the hit (0 = no, 1 = yes, n/a = not evaluated for hits < 150 bp)

• ipsc.first300 – indicates whether or not an indel or premature stop codon was present in the first 313 bp of the hit (0 = no, 1 = yes, n/a = not evaluated for hits < 313 bp)

• ipsc.first450 – indicates whether or not an indel or premature stop codon was present in the first 500 bp of the hit (0 = no, 1 = yes, n/a = not evaluated for hits < 500 bp)

• ipsc.first600 – indicates whether or not an indel or premature stop codon was present in the first 650 bp of the hit (0 = no, 1 = yes, n/a = not evaluated for hits < 650 bp)

• Divergence – divergence between the hit and the COI reference sequence for that organism in percent

Table S2 - Summary of COI NUMTs by species.csv.

• group – the organism groups as designated by NCBI

• type - whether the organism is an invertebrate or a vertebrate

• phylum – phylum of the organism

• class - taxonomic class of the organism (n/a = not designated)

• order – order of the organism (n/a = not designated)

• family - taxonomic family of the organism (n/a = not designated)

• genus - genus of the organism

• organism.name – the species name as listed

• accepted.as – the accepted species name for the assembly

• common.name – the common name of the organism (n/a = not designated)

• size.Mb –genome size in Mb

• contig.N50 – genome quality in Mb

• coverage - genome coverage of the assembly (n/a = data not available)

• whole.genome – NCBI accession number for the whole genome assembly

• mtCOI – the NCBI accession number or BOLD reference for the mtCOI reference used in the BLASTn search

• date.of.download – date the assembly and COI reference sequence were downloaded

• trophic.group – trophic category of the adult phase of the organism

• hermaphroditism.present – indicates whether or not hermaphroditism is present at any phase of the organisms life history (0 = no, 1 = yes)

• sexual.reproduction.present - indicates whether or not sexual reproduction is present at any phase of the organisms life history (0 = no, 1 = yes)

• asexual.reproduction.present - indicates whether or not asexual reproduction is present at any phase of the organisms life history (0 = no, 1 = yes)

• colonial - indicates whether or not colonialism is present at any phase of the organisms life history (0 = no, 1 = yes)

• endothermy - indicates whether or not the organism is endothermic, exothermic or an invertebrate

• Additional life history references – references used to determine life history traits, other than those cited in the main text (n/a = no additional references were used).

• hits.150bp – the total number of BLASTn hits (potential NUMTs) greater than or equal to 150bp found using the 658bp COI query

• content.150bp - the sum of the sequence lengths of all BLASTn hits (potential NUMTs) greater than or equal to 150bp found using the 658bp COI query

• hits.per.100Mb – the number of BLASTn hits corrected for genome size (per 100 Mb)

• hits.150bp.no.ipsc – the total number of BLASTn hits greater than or equal to 150bp found using the 658bp COI query that contained no indels or premature stop codons

Do pseudogenes pose a problem for metabarcoding marine animal communities?

Data files

Abstract

Methods

Usage notes

Works referencing this dataset