DNA metabarcoding data from faecal samples of the lesser (Myotis blythii) and the greater (Myotis myotis) mouse-eared bats from Bulgaria
Data files
Jul 14, 2023 version files 3.55 MB
-
mmyotis-mblythii-metabarcoding-data-bulgaria.xlsx
3.54 MB
-
README.md
2.56 KB
Jul 25, 2023 version files 1.15 GB
-
2023_Data_Paper_raw_FASTQ_files.zip
1.14 GB
-
mmyotis-mblythii-metabarcoding-data-bulgaria.xlsx
3.54 MB
-
README.md
5.02 KB
Abstract
A comprehensive understanding of trophic interactions in terrestrial ecosystems is crucial for ecological research and conservation. Recent advances in non-invasive methods, such as DNA metabarcoding, have enabled researchers to collect vast amounts of data on wild animal diets. However, sharing this data and metadata effectively and transparently presents new challenges. To address this, a new type of scholarly journal publication has emerged that aims to describe datasets rather than report research investigations. In this paper, we present a dataset of consumed prey species and parasites based on the metabarcoding of 113 faecal samples from the greater and lesser mouse-eared bats (Myotis myotis and Myotis blythii), along with a detailed description of the data sampling, laboratory analysis, and bioinformatics pipeline. Our dataset comprises 1018 unique Barcode Index Numbers (BINs) from 12 Classes and 43 Orders. In addition, we provide interactive Krona charts to visually summarize the taxonomic relationships and relative read abundance of the consumed prey species and parasites. This data can be used for meta-analysis, exploring new predator-prey and host-parasite interactions, studying inter and intraspecific ecological interactions, and informing protected area management, among other applications. By sharing this dataset, we hope to encourage other researchers to use it to answer additional ecological questions and advance our understanding of trophic interactions in terrestrial ecosystems.
Methods
Sampling
Geographic Coverage
Faecal samples were collected from individual bats at the entrance of the Orlova Chucka cave, Pepelina, Dve Mogili District, Bulgaria (N 43.593240 E 25.960108). The cave is inhabited by 15 bat species all year round. In summer, however, it is predominantly occupied by mixed maternity colonies of M. myotis and M. blythii, as well as Rhinolophus euryale and Rhinolophus mehelyi (Borissov 2010). Mouse-eared bats are highly mobile with a hunting range of about 23 km around the cave (Egert-Berg et al 2018, Stidsholt et al. 2023). While we collected the faecal samples at the cave entrance, our effective sampling area thus matches the foraging area of the bats, covering an area of approximately 1600 km2. Notably, a large proportion of the foraging grounds of the bats are in protected areas including NATURA 2000 sites and the Rusenski Lom Natural Park (Borissov 2009). The preferred foraging sites of the mouse-eared bats in the study area consist of small-scale agricultural areas, forests, open grasslands, karstic areas, and riverine habitats.
Temporal Coverage
Samples were collected from June to August 2017 and 2018. Our period covers the lactation and post-lactation period of the female bats, during which they have to forage more actively to provide enough nutrition to both themselves and the pup.
Sampling Methodology
Bats were captured in the morning (when returning to the roost after foraging) with a harp trap placed in front of the cave entrance. We emptied the trap every 5 to 10 minutes to minimise defecation in the trap, and thus potential cross-contamination between individuals by faeces attached to the fur. However, we could not fully prevent bats from defecating in the trap, therefore, a small proportion of cross-contamination between the different individuals might have occurred. After being removed from the trap, bats were placed in individual cotton bags until they defecated. Prior to data collection, the bags were brushed from previous guano and washed at 90°C with bleach. After the bats had defecated in the bags, they were measured, sexed, and identified to species level following Dietz et al. 2004. To avoid misidentification, however, we used a conservative approach and only sampled individuals that could be clearly identified based on morphological measurements, identifying individuals as M. myotis if the length of the upper jaw (i.e., from the canine to the third molar, CM3) was >9.4 mm and the forearm length (FA) >61 mm, and as M. blythii for CM3 <9.0 mm and FA <59 mm. The guano pellets were placed in 2 ml Eppendorf tubes with 98% ethanol, which were subsequently stored in a freezer at -18°C until further treatment. Bat catching and sample collection were performed under permit granted by the Ministry of Environment and Waters, Bulgaria and under the control of the Regional Environment and Water Inspection Ruse (permit number 696/19.01.2017).
Laboratory procedures
DNA extraction, amplification, and metabarcoding
DNA metabarcoding was conducted at the AIM Lab (AIM—Advanced Identification Methods GmbH, Leipzig, Germany). Genomic data was extracted using the Quick-DNA Fecal/Soil Microbe 96 Kit (Zymo Research Corporation, Irvine CA, USA) and following the manufacturer's instructions. To control for artifacts arising from lab contamination, we ran 6 empty vials as negative control samples through the lab procedure: 2 before extraction, and 2 before each of the two rounds of PCR. These negative control samples were processed in the same way as the faecal samples. Further laboratory analyses were carried out as per the methods described in Uhler et al. (2022) using high-throughput sequencing (HTS)-adapted mini-barcode primers (mlCOIintF, dgHCO, Leray et al. 2013) targeting the mitochondrial CO1-5P region. HTS was performed on an Illumina MiSeq (llumina Inc., San Diego, USA) “v3 chemistry” (2 × 300 base pairs, 600 cycles, maximum of 25 million paired end reads).
Bioinformatics
Preprocessing of raw Illumina reads
From each sample, paired-end reads were merged using the -fastq_mergepairs utility of USEARCH v11.0.667 (Edgar, 2010) with the following parameters: -fastq_maxdiffs 99, ‑fastq_pctid 75, -fastq_trunctail 0. Next, adapter sequences were removed using CUTADAPT (Martin, 2011) (single-end mode, with default parameters). Reads that did not contain the appropriate adapter sequences were filtered out in this step using CUTADAPT's --discard-untrimmed option. The remaining pre-processing steps (quality filtering, dereplication, chimera filtering, and clustering) were carried out using the VSEARCH suite v2.9.1 (Rognes et al., 2016).
Quality filtering was performed using --fastq_filter, allowing a maximum of 1 expected error along the length of the sequence and a minimum read length of 300 bases (parameters: --fastq_maxee 1, --minlen 300). This was followed by dereplication on the sample level using --derep_fulllength, keeping only a single copy of each unique sequence (parameters: --sizeout, --relabel Uniq). Cleaned and dereplicated sample files were concatenated into one large FASTA file, which was then dereplicated again, and also filtered for sequences occurring only once in the entire dataset (singletons) with the parameters --minuniquesize 2, --sizein , --sizeout , --fasta_width 0.
To save processing power, a clustering step (at 98% identity) was employed before chimera filtering using the VSEARCH utility --cluster_size and the centroids algorithm (parameters: --id 0.98, --strand plus, --sizein, -- sizeout, --fasta_width 0, --centroids). Chimeric sequences were then detected and filtered out from the resulting file using the VSEARCH -- uchime_denovo utility (parameters: --sizein, --sizeout, --fasta_width 0, -- nonchimeras). Next, a perl script obtained from the authors of VSEARCH (https://github.com/torognes/vsearch/wiki/VSEARCH-pipeline) was used to regenerate the concatenated FASTA file, but without the subsequently detected chimeric sequences. The resulting chimera-filtered file was then used to cluster the reads into operational taxonomic units (OTUs) using SWARM v.3.1.0 (Mahé et al. 2021; parameters: -d13 -z). The value for the d parameter was chosen based on the results for the mitochondrial cytochrome c oxidase subunit I (COI) mini-barcode (Leray et al. 2013) from in silico experiments performed by Antich et al. (2021). The representative sequences of each OTU cluster were then sorted using VSEARCH (parameters: --fasta_width 0 -- sortbysize). An OTU table was constructed from the resulting FASTA file using the VSEARCH utility --usearch_global (parameters: --strand plus -- sizein --sizeout --fasta_width 0).
To reduce the risk of false positives, a cleaning step was employed that excluded read counts in the OTU table constituting <0.01% of the total number of reads in the sample. OTUs were additionally removed from the results based on negative control samples. If the number of reads for the OTU in any sample was less than the maximum for that OTU among negative controls, those reads were excluded from further analysis.
BLAST, reference database construction, and annotation
OTU representative sequences were blasted with the program Megablast (parameters: maximum hits: 1; scoring (match mismatch): 1-2; gap cost (open extend): linear; max E-value: 10; word size: 28; max target seqs 100) against (1) a custom database downloaded from GenBank (a local copy of the NCBI nucleotide database downloaded from ftp:// ftp.ncbi.nlm.nih.gov/blast/db/), and (2) a custom database built from data downloaded from BOLD (www.boldsystems.org) (Ratnasingham & Hebert, 2007, 2013) including taxonomy and BIN information. BLAST searches were performed using the GUI software suite Geneious (v.10.2.5 – Biomatters, Auckland, New Zealand).
All available Animalia data was downloaded from the BOLD database on 29 July 2022 using the available public data API (http:// www.boldsystems.org/index.php/resources/api) in a combined TSV file format. The combined TSV file was then filtered to keep only the records that: (1) had a sequence (field 72, “nucleotides”); (2) had a sequence that did not hold exclusively one or more “-” (hyphens); had a sequence that did not contain non-IUPAC characters; (3) belonged to COI (the pattern “COI-5P” in either field 70 (“markercode”) or field 80 (“marker_codes”)); 5) had an available BIN (field 8, “bin_uri”). In (5), an exception was made in cases where the species belonging to that record did not occur with a BIN elsewhere in the dataset. In other words, “BIN-less” records were kept if their species were also completely BIN-less in the dataset. The dataset was then filtered to include only records from a custom European BOLD BLAST database.
Finally, a FASTA file annotated with (1) a Process ID (field 1, “processid”), (2) BIN (field 8), (3) taxonomy (fields 10, 12, 14, 16, 18, 20, 22 - “phylum_name”, “class_name”, “order_name”, “family_name”, “subfamily_name”, “genus_name”, “species_name”), (4) geolocation data (fields 47, 48, 55), and (5) GenBank ID (field 71, “genbank_accession”) was created from the filtered combined TSV file. This FASTA file was then converted into a BLAST database using Geneious v10.2.6 (Biomatters, Auckland, New Zealand). The results were exported and further processed according to methods described by Uhler et al. (2022).
Briefly, the resulting CSV files containing BLAST results were exported from Geneious and combined with the OTU table generated by the bioinformatic pre-processing pipeline. The CSVs included: (1) OTU ID; (2) BOLD Process ID; (3) BIN; (4) Hit-%-ID value (the percentage of identical base pairs of the OTU query sequence with its closest counterpart in the reference database); (5) Grade-%-ID value (a value that combines query coverage, E-value and Hit-%-ID with weights of 0.5, 0.25 and 0.25 respectively); (6) length of the top BLAST hit sequence; (7) phylum, class, order, family, genus and species for each detected OTU.
As an additional measure of control other than BLAST, the OTUs were classified into taxa using the Ribosomal Database Project (RDP) naïve Bayesian classifier (Wang et al., 2007), which was trained on a cleaned COI dataset of Arthropods and Chordates (plus outgroups; see Porter & Hajibabei, 2018). OTUs were also annotated with the taxonomic information from the NCBI (downloaded from https:// ftp.ncbi.nlm.nih.gov/pub/taxonomy/), followed by the creation of a taxonomic consensus between BOLD, NCBI and RDP to facilitate assessment of the resulting matches across the three reference databases. To create the taxonomic consensus, we first adjusted the taxonomic depths of each hit from the three reference databases based on its Grade-%-ID value (>97% for species, >95% for genus, >90% for family, >85% for order, >80% for class, and >75% for phylum). In cases where a taxonomically identical match was found in all three reference databases (BOLD BLAST, NCBI BLAST, and RDP classifier), the OTUs were assigned the taxonomic score "A". Where BOLD & NCBI agreed, but RDP disagreed, the OTUs were assigned the score "B". This was in most cases the result of certain taxa either missing or not being represented with sufficient numbers in the RDP classifier's training set. Finally, where NCBI & RDP agreed, but BOLD disagreed, the OTUs got the score "C". A score of "C" commonly occurs in cases where BOLD cannot resolve a species due to a phenomenon commonly referred to as "BIN sharing". For the purposes of constructing the consensus, in every case of a BIN that is shared between 2 or more species in the database, we disregarded the species-level information given by the BOLD BLAST result. In this way, we gave precedence to a species-level annotation with a score of "C" (by means of NCBI and RDP) over a hypothetical genus-level annotation with a score of "A". We treated cases of identifications to different taxonomic levels across the three references in the same way, i.e., a lower score (consensus level) was preferred if it meant an increase to the taxonomic resolution.
BOLD taxonomy was then used to create Krona charts. These interactive HTML charts were created by means of KronaTools v2.7 (Ondov et al. 2011) (https://github.com/ marbl/Krona/wiki/KronaTools). Krona charts are a variation of a sunburst diagram, a pie-chart-like visualization, which is commonly used to plot hierarchical data in a way that emphasizes their taxonomic relationships and relative abundance. A Krona chart shows hierarchy through a series of concentric rings, where each ring corresponds to a level in the hierarchy, and each ring is segmented proportionally to represent read abundance. Where multiple OTUs were identified to the same taxon, read counts were summed over all those OTUs. A set of charts was created: one for each individual sample, one summed over all samples, as well as one each summed over M. myotis- or M. blythii-derived samples, respectively. First, a custom script was used to extract from the final Excel results table only the OTU table counts and associated taxonomic annotations. Then, intermediate sample count (.TAX) files for KronaTools were created using a bash script obtained from https:// github.com/GenomicaMicrob/OTUsamples2krona. The charts were created by the same script using the command “ktImportText [SAMPLE.TAX] -n SAMPLE -o SAMPLE.html”.