Soundscapes and artificial intelligence provide powerful tools to track biodiversity recovery in tropical forests
Data files
Sep 07, 2023 version files 2.08 GB
-
100_S100_L001_R1_001.fastq.gz
-
100_S100_L001_R2_001.fastq.gz
-
101_S101_L001_R1_001.fastq.gz
-
101_S101_L001_R2_001.fastq.gz
-
102_S102_L001_R1_001.fastq.gz
-
102_S102_L001_R2_001.fastq.gz
-
103_S103_L001_R1_001.fastq.gz
-
103_S103_L001_R2_001.fastq.gz
-
104_S104_L001_R1_001.fastq.gz
-
104_S104_L001_R2_001.fastq.gz
-
105_S105_L001_R1_001.fastq.gz
-
105_S105_L001_R2_001.fastq.gz
-
106_S106_L001_R1_001.fastq.gz
-
106_S106_L001_R2_001.fastq.gz
-
107_S107_L001_R1_001.fastq.gz
-
107_S107_L001_R2_001.fastq.gz
-
108_S108_L001_R1_001.fastq.gz
-
108_S108_L001_R2_001.fastq.gz
-
109_S109_L001_R1_001.fastq.gz
-
109_S109_L001_R2_001.fastq.gz
-
110_S110_L001_R1_001.fastq.gz
-
110_S110_L001_R2_001.fastq.gz
-
111_S111_L001_R1_001.fastq.gz
-
111_S111_L001_R2_001.fastq.gz
-
112_S112_L001_R1_001.fastq.gz
-
112_S112_L001_R2_001.fastq.gz
-
113_S113_L001_R1_001.fastq.gz
-
113_S113_L001_R2_001.fastq.gz
-
114_S114_L001_R1_001.fastq.gz
-
114_S114_L001_R2_001.fastq.gz
-
115_S115_L001_R1_001.fastq.gz
-
115_S115_L001_R2_001.fastq.gz
-
116_S116_L001_R1_001.fastq.gz
-
116_S116_L001_R2_001.fastq.gz
-
117_S117_L001_R1_001.fastq.gz
-
117_S117_L001_R2_001.fastq.gz
-
118_S118_L001_R1_001.fastq.gz
-
118_S118_L001_R2_001.fastq.gz
-
119_S119_L001_R1_001.fastq.gz
-
119_S119_L001_R2_001.fastq.gz
-
120_S120_L001_R1_001.fastq.gz
-
120_S120_L001_R2_001.fastq.gz
-
121_S121_L001_R1_001.fastq.gz
-
121_S121_L001_R2_001.fastq.gz
-
122_S122_L001_R1_001.fastq.gz
-
122_S122_L001_R2_001.fastq.gz
-
123_S123_L001_R1_001.fastq.gz
-
123_S123_L001_R2_001.fastq.gz
-
124_S124_L001_R1_001.fastq.gz
-
124_S124_L001_R2_001.fastq.gz
-
125_S125_L001_R1_001.fastq.gz
-
125_S125_L001_R2_001.fastq.gz
-
126_S126_L001_R1_001.fastq.gz
-
126_S126_L001_R2_001.fastq.gz
-
127_S127_L001_R1_001.fastq.gz
-
127_S127_L001_R2_001.fastq.gz
-
128_S128_L001_R1_001.fastq.gz
-
128_S128_L001_R2_001.fastq.gz
-
129_S129_L001_R1_001.fastq.gz
-
129_S129_L001_R2_001.fastq.gz
-
130_S130_L001_R1_001.fastq.gz
-
130_S130_L001_R2_001.fastq.gz
-
131_S131_L001_R1_001.fastq.gz
-
131_S131_L001_R2_001.fastq.gz
-
132_S132_L001_R1_001.fastq.gz
-
132_S132_L001_R2_001.fastq.gz
-
133_S133_L001_R1_001.fastq.gz
-
133_S133_L001_R2_001.fastq.gz
-
134_S134_L001_R1_001.fastq.gz
-
134_S134_L001_R2_001.fastq.gz
-
135_S135_L001_R1_001.fastq.gz
-
135_S135_L001_R2_001.fastq.gz
-
136_S136_L001_R1_001.fastq.gz
-
136_S136_L001_R2_001.fastq.gz
-
137_S137_L001_R1_001.fastq.gz
-
137_S137_L001_R2_001.fastq.gz
-
138_S138_L001_R1_001.fastq.gz
-
138_S138_L001_R2_001.fastq.gz
-
139_S139_L001_R1_001.fastq.gz
-
139_S139_L001_R2_001.fastq.gz
-
140_S140_L001_R1_001.fastq.gz
-
140_S140_L001_R2_001.fastq.gz
-
141_S141_L001_R1_001.fastq.gz
-
141_S141_L001_R2_001.fastq.gz
-
145_S145_L001_R1_001.fastq.gz
-
145_S145_L001_R2_001.fastq.gz
-
146_S146_L001_R1_001.fastq.gz
-
146_S146_L001_R2_001.fastq.gz
-
147_S147_L001_R1_001.fastq.gz
-
147_S147_L001_R2_001.fastq.gz
-
148_S148_L001_R1_001.fastq.gz
-
148_S148_L001_R2_001.fastq.gz
-
149_S149_L001_R1_001.fastq.gz
-
149_S149_L001_R2_001.fastq.gz
-
150_S150_L001_R1_001.fastq.gz
-
150_S150_L001_R2_001.fastq.gz
-
151_S151_L001_R1_001.fastq.gz
-
151_S151_L001_R2_001.fastq.gz
-
152_S152_L001_R1_001.fastq.gz
-
152_S152_L001_R2_001.fastq.gz
-
153_S153_L001_R1_001.fastq.gz
-
153_S153_L001_R2_001.fastq.gz
-
154_S154_L001_R1_001.fastq.gz
-
154_S154_L001_R2_001.fastq.gz
-
155_S155_L001_R1_001.fastq.gz
-
155_S155_L001_R2_001.fastq.gz
-
156_S156_L001_R1_001.fastq.gz
-
156_S156_L001_R2_001.fastq.gz
-
157_S157_L001_R1_001.fastq.gz
-
157_S157_L001_R2_001.fastq.gz
-
158_S158_L001_R1_001.fastq.gz
-
158_S158_L001_R2_001.fastq.gz
-
159_S159_L001_R1_001.fastq.gz
-
159_S159_L001_R2_001.fastq.gz
-
160_S160_L001_R1_001.fastq.gz
-
160_S160_L001_R2_001.fastq.gz
-
161_S161_L001_R1_001.fastq.gz
-
161_S161_L001_R2_001.fastq.gz
-
162_S162_L001_R1_001.fastq.gz
-
162_S162_L001_R2_001.fastq.gz
-
163_S163_L001_R1_001.fastq.gz
-
163_S163_L001_R2_001.fastq.gz
-
164_S164_L001_R1_001.fastq.gz
-
164_S164_L001_R2_001.fastq.gz
-
165_S165_L001_R1_001.fastq.gz
-
165_S165_L001_R2_001.fastq.gz
-
166_S166_L001_R1_001.fastq.gz
-
166_S166_L001_R2_001.fastq.gz
-
167_S167_L001_R1_001.fastq.gz
-
167_S167_L001_R2_001.fastq.gz
-
168_S168_L001_R1_001.fastq.gz
-
168_S168_L001_R2_001.fastq.gz
-
169_S169_L001_R1_001.fastq.gz
-
169_S169_L001_R2_001.fastq.gz
-
170_S170_L001_R1_001.fastq.gz
-
170_S170_L001_R2_001.fastq.gz
-
171_S171_L001_R1_001.fastq.gz
-
171_S171_L001_R2_001.fastq.gz
-
172_S172_L001_R1_001.fastq.gz
-
172_S172_L001_R2_001.fastq.gz
-
173_S173_L001_R1_001.fastq.gz
-
173_S173_L001_R2_001.fastq.gz
-
174_S174_L001_R1_001.fastq.gz
-
174_S174_L001_R2_001.fastq.gz
-
175_S175_L001_R1_001.fastq.gz
-
175_S175_L001_R2_001.fastq.gz
-
176_S176_L001_R1_001.fastq.gz
-
176_S176_L001_R2_001.fastq.gz
-
177_S177_L001_R1_001.fastq.gz
-
177_S177_L001_R2_001.fastq.gz
-
178_S178_L001_R1_001.fastq.gz
-
178_S178_L001_R2_001.fastq.gz
-
179_S179_L001_R1_001.fastq.gz
-
179_S179_L001_R2_001.fastq.gz
-
180_S180_L001_R1_001.fastq.gz
-
180_S180_L001_R2_001.fastq.gz
-
181_S181_L001_R1_001.fastq.gz
-
181_S181_L001_R2_001.fastq.gz
-
182_S182_L001_R1_001.fastq.gz
-
182_S182_L001_R2_001.fastq.gz
-
183_S183_L001_R1_001.fastq.gz
-
183_S183_L001_R2_001.fastq.gz
-
184_S184_L001_R1_001.fastq.gz
-
184_S184_L001_R2_001.fastq.gz
-
185_S185_L001_R1_001.fastq.gz
-
185_S185_L001_R2_001.fastq.gz
-
186_S186_L001_R1_001.fastq.gz
-
186_S186_L001_R2_001.fastq.gz
-
187_S187_L001_R1_001.fastq.gz
-
187_S187_L001_R2_001.fastq.gz
-
188_S188_L001_R1_001.fastq.gz
-
188_S188_L001_R2_001.fastq.gz
-
189_S189_L001_R1_001.fastq.gz
-
189_S189_L001_R2_001.fastq.gz
-
97_S97_L001_R1_001.fastq.gz
-
97_S97_L001_R2_001.fastq.gz
-
98_S98_L001_R1_001.fastq.gz
-
98_S98_L001_R2_001.fastq.gz
-
99_S99_L001_R1_001.fastq.gz
-
99_S99_L001_R2_001.fastq.gz
-
BIOINFORMATIC_INFORMATION.xlsx
-
Nocturnal_Arthropod_Communities.xlsx
-
README.md
Abstract
Tropical forest recovery is fundamental to addressing the intertwined climate and biodiversity loss crises. While regenerating trees sequester carbon relatively quickly, the pace of biodiversity recovery remains contentious. Here, we use bioacoustics and meta-barcoding to measure forest recovery post-agriculture in a global biodiversity hotspot in Ecuador. We show that the community composition, and not species richness, of vocalizing vertebrates identified by experts reflects the restoration gradient. Two automated measures – an acoustic index model and a bird community derived from an independently developed Convolutional Neural Network – correlated well with restoration (adj-R2 = 0.62 and 0.69, respectively). Importantly, both measures reflected composition of non-vocalizing nocturnal insects identified via meta-barcoding. We show that such automated monitoring tools, based on new technologies, can effectively monitor the success of forest recovery, using robust and reproducible data. Crucially, this will help ensure that forest restoration efforts result in resilient, biodiverse tropical forests and not simply ‘carbon farms’.
README: DNA Metabarcoding RAW FASTQ data for "Soundscapes and artificial intelligence provide powerful tools to track biodiversity recovery in tropical forests"
[this dataset contains demultiplexed Illumina FASTQ files along with a samplesheet to process these files and an excel file listing all of the detected BINs with identification success rate and taxonomy retreived from sequence BLAST on BOLD (www.boldsystems.org)]
Demultiplexed RAW FASTQ Illumina sequence data files used for analysis of the dataset.
Use the "BIOINFORMATIC_INFORMATION.xlsx" file for your bioinformatic pipeline to run primer trimming. Each FASTQ file is listed within the BIOINFORMATIC_INFORMATION.xlsx with its corresponding primer sequences used. Follow the methods described in the material & methods section (described below) in order to generate an annotated OTU table.
Primer used CO1 Leray et al., 2013 - https://frontiersinzoology.biomedcentral.com/articles/10.1186/1742-9994-10-34
Excel File containing detected BINs
Within the file "Nocturnal_Arthropod_Communities.xlsx" all of the detected BINs (Barcode Index Numbers) along with Hit-%-identity scores and taxonomy are listed. Presence of each BIN in the samples is indicated by read numbers.
Bioinformatics methods:
Paired-ends were merged using the -fastq_mergepairs utility of the USEARCH suite v11.0.667_i86linux32 (Edgar, 2010) with the following parameters: -fastq_maxdiffs 99, -fastq_pctid 75, -fastq_trunctail 0. Adapter sequences were removed using CUTADAPT75 (single-end mode, with default parameters). All sequences that did not contain the appropriate adapter sequences were filtered out in this step using the --discard-untrimmed parameter. The remaining pre-processing steps (quality filtering, dereplication, chimera filtering, and pre-clustering) were carried out using the VSEARCH suite v2.9.176. Quality filtering was performed using the --fastq_filter VSEARCH utility (parameters: --fastq_maxee 1, --minlen 300). Sequences were dereplicated with --derep_fulllength (parameters: --sizeout, --relabel Uniq), first at the sample level, and then at the combined dataset level after concatenating all sample files into one large FASTA file, which was also filtered for singletons (sequences occurring only once in the entire dataset and a priori considered as noise; parameters: --minuniquesize 2, --sizein, --sizeout, --fasta_width 0). To save processing power, a pre-clustering step (at 98% identity) was employed before chimera filtering using the --cluster_size VSEARCH utility with the centroids algorithm (parameters: --id 0.98, --strand plus, --sizein, --sizeout, --fasta_width 0, --centroids). Chimeric sequences were then detected and filtered out from the resulting file using the VSEARCH --uchime_denovo utility (parameters: --sizein, --sizeout, --fasta_width 0, --nonchimeras).
A custom perl script obtained from the authors of VSEARCH (see https://github.com/torognes/vsearch/wiki/VSEARCH-pipeline) was then used to regenerate the concatenated FASTA file, but without the previously detected chimeric sequences. The resulting chimera-filtered file was then used to cluster the reads into OTUs using SWARM v.3.1.077 (parameters: -d 13 -z). The value for the d parameter was chosen based on the experiments of Antich et al.78. The OTU representative sequences were then sorted using VSEARCH (parameters: --fasta_width 0 --sortbysize) and an OTU table was constructed from the resulting FASTA file using the VSEARCH utility --usearch_global (parameters: --strand plus --sizein --sizeout --fasta_width 0). To reduce the risk of false-positives, a cleaning step was employed that excluded read counts in the OTU table constituting < 0.01% of the total number of reads in the sample. OTUs were additionally removed from the results based on negative control samples, i.e. if the number of reads for the OTU in any sample was less than the maximum among negative controls, those reads were excluded from further analysis. OTU representative sequences were blasted (parameters: program: Megablast; maximum hits: 1; scoring (match mismatch): 1-2; gap cost (open extend): linear; max E-value: 10; word size: 28; max target seqs 100) against (1) a custom database downloaded from GenBank (a local copy of the NCBI nucleotide database downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/db/), and (2) a custom database built from data downloaded from BOLD (www.boldsystems.org79,80) including taxonomy and BIN information, by means of Geneious (v.10.2.5 – Biomatters, Auckland, New Zealand). All available Animalia data was downloaded from the BOLD database on 29 July 2022 using the available public data API (http://www.boldsystems.org/index.php/resources/api) in a combined TSV file format. The combined TSV file was then filtered to keep only the records that: (1) had a sequence (field 72, “nucleotides”); (2) had a sequence that did not hold exclusively one or more “-” (hyphens); had a sequence that did not contain non-IUPAC characters; (3) belonged to COI (the pattern “COI-5P” in either field 70 (“markercode”) or field 80 (“marker_codes”)); (5) had an available BIN (field 8, “bin_uri”). In (5), an exception was made in cases where the species belonging to that record did not occur with a BIN elsewhere in the dataset. In other words, “BIN-less” records were kept if their species were also completely BIN-less in the dataset. The dataset was then filtered to include only South American records, and in the following way: (1) records were kept that contained, in field 55 (“country”), the South American country names: Argentina, Bolivia, Brazil, Chile, Colombia, Ecuador, Falkland Islands, French Guiana, Guiana, Paraguay, Peru, Suriname, Uruguay, and Venezuela; (2) records were additionally kept if their latitude (field 47, “lat”) was between -58.4 and 17 and their longitude (field 48, “lon”) was between -85.8 and -30.3. These values were found by taking the extreme north (Punta Gallinas), south (Cape Forward), east (Ponta do Seixas), and west (Punta Parinas) points of the continent. As a buffer, 500 km were added due north, south, east, and west, respectively of those geographic points using the “Measure on Map” function of SunEarthTools.com. It was then noted that a large part of the dataset, thus filtered, held also records from several Central American countries, in particular Costa Rica, whose biodiversity on BOLD dwarfs all other South American countries. Thus, a decision was made to additionally include all remaining records from Costa Rica. Finally, a FASTA file annotated with a Process ID (field 1, “processid”), BIN (field 8), taxonomy (fields 10, 12, 14, 16, 18, 20, 22 - “phylum_name”, “class_name”, “order_name”, “family_name”, “subfamily_name”, “genus_name”, “species_name”), geo location data (fields 47, 48, 55), and GenBank ID (field 71, “genbank_accession”) was created from the filtered combined TSV file, and then converted into a BLAST database using Geneious v10.2.6 (Biomatters, Auckland, New Zealand). The results were exported and further processed according to methods described by Uhler et al.65. Briefly, the resulting csv files, which included the OTU ID, BOLD Process ID, BIN, Hit-%-ID value (percentage of overlap similarity (identical base pairs) of an OTU query sequence with its closest counterpart in the database), Grade-%-ID value (combining query coverage, E-value and identity values for each hit with weights of 0.5, 0.25 and 0.25 respectively, allowing determination of the longest, highest-identity hits), the length of the top BLAST hit sequence, as well as the phylum, class, order, family, genus and species information for each detected OTU were exported from Geneious and combined with the OTU table generated by the bioinformatic pre-processing pipeline. As an additional measure of control other than BLAST, the OTUs were classified into taxa using the Ribosomal Database Project (RDP) naïve Bayesian classifier81 trained on a cleaned COI dataset of Arthropods and Chordates (plus outgroups; see Porter & Hajibabei82. OTUs were also annotated with the taxonomic information from the NCBI (downloaded from
“https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/"), followed by the creation of a taxonomic consensus between BOLD, NCBI and RDP.
<br>
Methods
DNA was extracted from 200-µL aliquots using the DNEasy blood & tissue kit (Qiagen) following the manufacturer’s instructions. Multiplex PCR was performed using 5 µL of extracted genomic DNA, Plant MyTAQ (Bioline, Luckenwalde, Germany) and high-throughput sequencing (HTS)-adapted mini-barcode primers targeting the mitochondrial CO1-5P region (mlCOIintF – 5’– GWACWGGWTGAACWGTWTAYCCYCC–3’; dgHCO2198–5’-TAAACTTCAGGGTGACCAAARAAYCA–3’; following Leray et al., 2013 – also see Morinière et al.; Morinière et al.. Amplification success and fragment length were determined using gel electrophoresis. The amplified DNA was cleaned and each sample was resuspended in 50 µL of molecular water. Illumina Nextera XT (Illumina Inc., San Diego, USA) indices were ligated to the samples in a second PCR, conducted at the same annealing temperature as in the first but with only seven cycles. Ligation success was confirmed by gel electrophoresis. DNA concentrations were measured using a Qubit fluorometer (Life Technologies, Carlsbad, USA), and the samples were then combined into 40-µL pools containing equimolar concentrations of 100 ng each. The pooled DNA was purified using MagSi-NGSprep Plus beads (Steinbrenner Laborsysteme GmbH, Wiesenbach, Germany). The final elution volume was 20 µL. HTS was performed on an Illumina MiSeq using v3 chemistry (2*300bp, 600 cycles, maximum of 25 mio paired-end reads).