Optimal sequence similarity thresholds for clustering of molecular operational taxonomic units in DNA metabarcoding studies
Data files
Oct 24, 2021 version files 161.70 MB
-
Arth02_r140.ecopcr.gz
2.37 MB
-
Bact02_r140.ecopcr.gz
110.12 MB
-
Coll01_r140.ecopcr.gz
6.78 MB
-
Euka02_r140.ecopcr.gz
8.01 MB
-
Fung02_r140.ecopcr.gz
9.53 MB
-
Inse01_r140.ecopcr.gz
12.92 MB
-
Olig01_r140.ecopcr.gz
7.06 MB
-
Sper01_r140.ecopcr.gz
4.92 MB
Mar 30, 2022 version files 172.94 MB
-
Arth02_r140.ecopcr.gz
2.37 MB
-
Bact02_r140.ecopcr.gz
110.12 MB
-
COI_BF1-BR2_r140.ecopcr.gz
11.24 MB
-
Coll01_r140.ecopcr.gz
6.78 MB
-
Euka02_r140.ecopcr.gz
8.01 MB
-
Fung02_r140.ecopcr.gz
9.53 MB
-
Inse01_r140.ecopcr.gz
12.92 MB
-
Olig01_r140.ecopcr.gz
7.06 MB
-
Sper01_r140.ecopcr.gz
4.92 MB
Abstract
Clustering approaches are pivotal to handle the many sequence variants obtained in DNA metabarcoding datasets, therefore they have become a key step of metabarcoding analysis pipelines. Clustering often relies on a sequence similarity threshold to gather sequences in Molecular Operational Taxonomic Units (MOTUs), each of which ideally representing a homogeneous taxonomic entity, e.g. a species or a genus. However, the choice of the clustering threshold is rarely justified, and its impact on MOTU over-splitting or over-merging even less tested. Here, we evaluated clustering threshold values for several metabarcoding markers under different criteria: limitation of MOTU over-merging, limitation of MOTU over-splitting, and trade-off between over-merging and over-splitting. We extracted sequences from a public database for nine markers, ranging from generalist markers targeting Bacteria or Eukaryota, to more specific markers targeting a class or a subclass (e.g. Insecta, Oligochaeta). Based on the distributions of pairwise sequence similarities within species and within genera, and on the rates of over-splitting and over-merging across different clustering thresholds, we were able to propose threshold values minimizing the risk of over-splitting, that of over-merging, or offering a trade-off between the two risks. For generalist markers, high similarity thresholds (0.96-0.99) are generally appropriate, while more specific markers require lower values (0.85-0.96). These results do not support the use of a fixed clustering threshold. Instead, we advocate a careful examination of the most appropriate threshold based on the research objectives, the potential costs of over-splitting and over-merging, and the features of the studied markers.