Environmental niche models improve species identification in DNA barcoding
Data files
Sep 23, 2024 version files 465.28 KB
-
Appendix_S1_Simulated_datasets.zip
183.89 KB
-
Appendix_S2_Empirical_datasets.zip
138.94 KB
-
Appendix_S3_Simulated_datasets_-_EnvSelected.zip
85.42 KB
-
Appendix_S4_Empirical_datasets_-_EnvSelected.zip
53.26 KB
-
README.md
3.76 KB
Abstract
Recent advances in DNA barcoding have immeasurably advanced global biodiversity research in the last two decades. However, inherent limitations in barcode sequences, such as hybridization, introgression, or incomplete lineage sorting can lead to misidentifications when relying solely on barcode sequences. Here, we propose a new Niche-model-Based Species Identification (NBSI) method based on the idea that species distribution information is a potential complement to DNA barcoding species identifications. NBSI performs species membership inference by incorporating niche modeling predictions and traditional DNA barcoding identifications. Systematic tests across diverse scenarios show significant improvements in species identification success rates under the newly proposed NBSI framework, where the largest increase is from 4.7% (95%CI: 3.51%-6.25%) to 94.8% (95%CI: 93.19%-96.06%). Additionally, obvious improvements were observed when using NBSI on potentially ambiguous sequences whose genetic nearest neighbors belong to another species or more than two species, commonly occurring with species represented by single or short DNA barcodes. These results support our assertion that environmental factors/variables are valuable complements to DNA sequence data for species identification by avoiding potential mis-identifications inferred from genetic information alone. The NBSI framework is currently implemented as a new R package, “NicheBarcoding”, that is open source under GNU General Public License and freely available from https://CRAN.R-project.org/package=NicheBarcoding.
README: Environmental niche models improve species identification in DNA barcoding
https://doi.org/10.5061/dryad.rbnzs7h96
Description of the data and file structure
The software package and the supplementary materials of this research article:
Environmental Niche Models Improve Species Identification in DNA Barcoding
Files and variables
File: Appendix_S1_Simulated_datasets.zip
Description: All generated simulated datasets.
- Files starting with "Sim_": These are simulated datasets generated with species numbers of 20, 50, and 100, all with 100 individuals each. The fundamental coalescent parameter θ was set to 0.05, 0.01, and 0.2 for the respective datasets. Consequently, in the dataset with 20 species, each species contains 5 individuals; in the dataset with 50 species, each species contains 2 individuals; and in the dataset with 100 species, each species contains 1 individual.
- Files starting with "SimAmbSeqs_": Based on the "Sim_*" files, potentially ambiguous sequences whose genetic nearest neighbors belong to another species or more than two species are extracted from each simulated dataset, forming new corresponding simulated datasets.
File: Appendix_S2_Empirical_datasets.zip
Description: All collected empirical datasets.
- Files starting with "Emp_": Empirical datasets (1-6) used for testing the NBSI method in the study.
- Files starting with "EmpAmbSeqs_": Based on the "Emp_*" files, potentially ambiguous sequences whose genetic nearest neighbors belong to another species or more than two species are extracted from each simulated dataset, forming new corresponding simulated datasets.
File: Appendix_S3_Simulated_datasets_-_EnvSelected.zip
Description: The variables retained for niche modeling in every single test of simulated datasets.
- Files starting with "EnvSelected_": When testing the NBSI method using simulated datasets, to eliminate the impact of multicollinearity, we used the variance inflation factor (VIF) method to remove redundant variables. We recorded the variables that were retained in each model-building instance, corresponding to each simulated dataset.
- Files starting with "EnvSelected_AmbSeqs_": When testing the NBSI method using simulated datasets composed of the ambiguous sequences, to eliminate the impact of multicollinearity, we used the variance inflation factor (VIF) method to remove redundant variables. We recorded the variables that were retained in each model-building instance, corresponding to each simulated dataset composed of the ambiguous sequences.
File: Appendix_S4_Empirical_datasets_-_EnvSelected.zip
Description: The variables retained for niche modeling in every single test of empirical datasets.
- Files starting with "EnvSelected_": When testing the NBSI method using empirical datasets, to eliminate the impact of multicollinearity, we used the variance inflation factor (VIF) method to remove redundant variables. We recorded the variables that were retained in each model-building instance, corresponding to each empirical dataset.
- Files starting with "EnvSelected_AmbSeqs_": When testing the NBSI method using empirical datasets composed of the ambiguous sequences, to eliminate the impact of multicollinearity, we used the variance inflation factor (VIF) method to remove redundant variables. We recorded the variables that were retained in each model-building instance, corresponding to each empirical dataset composed of the ambiguous sequences.
Code/software
File: NicheBarcoding_0.0.1.8000.tar.gz
Description: The source code of the R package "NicheBarcoding". Users can directly install and use it via RStudio.