Skip to main content
Dryad logo

Enviromental DNA datasets of the Colombian Amazon and Orinoco basins


Martinelli Marín, Daniela; Lasso, Carlos A.; Caballero, Susana (2022), Enviromental DNA datasets of the Colombian Amazon and Orinoco basins , Dryad, Dataset,


The massive loss of biodiversity in recent years has driven the development of rapid, cost-effective, non-invasive, and efficient sampling alternatives, such as environmental DNA. With this method, a water sample can be used to evaluate a community's diversity, in addition with low abundance, cryptic and threatened species detection. Therefore, in this study, environmental DNA was used to determine the diversity of aquatic, semi-aquatic and terrestrial vertebrates in the Colombian Amazon and Orinoco basins, which included four main subregions: Bojonawi Natural Reserve and adjacent areas (Vichada Department), Sierra de la Macarena National Park and Tillavá (Meta Department), Puerto Nariño and adjacent areas (Amazonas Department) and the Municipality of Solano (Caquetá Department). A total of 709 OTUs were identified for all locations. The Orinoco river showed the highest number of fish genera (68) and the Guayabero river, the largest number of genera for tetrapods (13). New taxonomic records were found on all locations, mianly in Bita, Orinoco and Tillavá rivers, which portrayed the highest record of unknown fish diversity compared with traditional surveys. Likewise, two vulnerable fish species and three vulnerable mammal species were identified, as well as four threatened mammal species, including the giant otter (Pteronura brasiliensis), the giant anteater (Myrmecophaga tridactyla), the two subspecies of the Amazon river dolphin (Inia geoffrensis geoffrensis and Inia geoffrensis humboldtiana) and the tucuxi (Sotalia fluviatilis). It is essential to improve current DNA sequence databases for the neotropics and standardize the methodology according to the animal of interest in order to develop future studies that maximize environmental DNA analyses efficiency.


30 locations were sampled based on the methodologies used by NatureMetrics laboratory (NatureMetrics, 2019) and the studies of Lozano and Caballero (2020) and Caballero et al. (2021a). For each location, up to seven water subsamples (1L each) were taken in a plastic bottle and then were poured into a bucket covered with a plastic bag. The bottle, as well as the plastic bags were previously sterilized with 90 % ethanol thoroughly. Every water sample was taken with sterile gloves to avoid human DNA contamination and all plastic bags were changed after each sampling event to prevent the mixing of water from different sampling locations. The water collection was made each 10-20 m along a linear transect, carried out by boat for the rivers, canoes for the lakes-lagoons and on foot for the streams, taking the coordinates of each sampled point with a GPS (Garmin etrex 12 channel GPS). Once all the subsamples of a location were taken, the process of filtration began using a NatureMetrics eDNA collection kit. A 60 ml syringe filled with the collected water was attached to a filter disk with a 0.8 μm pore size. Then, when the filter disk was clogged and no more water could go through, the syringe was detached and a smaller syringe with a preserving buffer was used to protect the filter and avoid DNA degradation. Each filter was stored in an envelope with their respective field information and were kept cool in styrofoam cooler with ice packs.

Filters were transported to Nature Metrics (Egham, Surrey, England), where laboratory procedures took place. DNA was extracted using Qiagen DNeasy Blood & Tissue Kit (see manufacturer instructions), modifying some steps to obtain increased DNA yields. Subsequently, DNA was purified using the DNeasy PowerClean Pro Cleanup kit to remove PCR inhibitors. Then, DNA extracted from each filter was amplified using 12 replicates, with the 12S rRNA mitochondrial gene to target fish as part of the eDNA survey - Vertebrates pipeline (Milan et al. 2020). Tails were added at the 5 end of the primers to be complementary with Illumina Nextera index primers. The amplification mixture for each replicate contained 1X DreamTaq PCR Master Mix (Thermo Scientific), 0.4 μM of each of the tailed primers, 1 μL of DNA and PCR grade water (Thermo Scientific) up to a total reaction volume of 10 μL. All PCRs were performed in the presence of both a negative control and a positive control sample (mock community with a known composition, not to occur in Colombia). PCR conditions consisted of an initial denaturation at 95°C for 2 min, followed by 10 cycles of 20 s at 95°C, a 30 s touchdown annealing step (-0.5°C per cycle) starting at 60°C, and 40 s at 72°C, 35 cycles of 20s at 95°C, 30s at 55°C, and 40s at 72°C, and a final elongation step at 72°C for 5 min. Amplification success was determined by gel electrophoresis. Amplicons were pooled and purified with MagBind TotalPure NGS (Omega Biotek) magnetic beads with a ratio 0.8:1 (beads:DNA) to remove primer dimers and then quantified using a Qubit high sensitivity kit according to the manufacturer’s protocol. All purified index PCRs were pooled into a final library with equal concentrations. The final library was sequenced using an Illumina MiSeq V2 kit at 12 pM with a 10% PhiX spike in. Sequence data were processed using a custom bioinformatics pipeline, USEARCH v11, for quality filtering, dereplication and taxonomic assignment (≥ 80% agreement in the overlap). Forward and reverse primers were trimmed from the merged sequences using cutadapt 2.3 (Martin,1994; Mathon et al. 2021) and retained if the trimmed length was between 80 - 120 bp. These sequences were quality filtered to retain only those with an expected error rate per base of 0.01 or below and dereplicated by sample, retaining singletons. Unique reads from all samples were denoised in a single analysis with UNOISE (Dal Pont et al. 2021), requiring retained ZOTUs (zero-radius OTUs) to have a minimum abundance of 8. ZOTUs were clustered at 99% similarity. An OTU-by-sample table was generated by mapping all dereplicated reads for each sample to the OTU representative sequences at an identity threshold of 97%.

Taxonomic information was added to each OTU by means of sequence similarity searches against the NCBI nt database (GenBank) and PROTAX (Somervuo et al. 2016; Lozano and Caballero, 2021). Identifications from either source were accepted and these were consistent at the level at which they were made. Species and genus level assignments were automatically retained if supported by unambiguous matches to reference sequences at ≥99% or ≥95%, respectively. Public records from GBIF were used to assess which hits were most likely to be present in Colombia, in cases where there were equally good matches to multiple species. This allowed numerous uncertain sequences to be resolved to species level. OTUs that were ≥99% similar and had similar co-occurrence patterns were combined with LULU (Frøslev et al. 2017) and the OTU table was then filtered to remove low abundance OTUs from each sample (<0.05% or <10 reads). Finally, human, known food fish and livestock sequences were removed, in addition with OTUs identified above order- level.


Universidad de los Andes