Data from: Using spatial-temporal filtering and improved barcoding tools to improve the ecological relevance of pollen meta-barcoding
Data files
Apr 30, 2026 version files 1.39 GB
-
README.md
5.29 KB
-
RMBL_Pollen.zip
1.39 GB
Abstract
DNA metabarcoding has been successful for the rapid identification of species in complex ecological assemblages, including identifying interspecific interactions among species. However, advances in metabarcoding within the plant kingdom have been hampered due to a lack of universal gene regions that work across all taxa, which limit the applications of eDNA and metagenomics in ecology more broadly. To circumvent these limitations, we propose a holistic spatio-temporal approach that combines multi-gene barcoding with existing plant occurrence databases, species distribution models, and phenological analyses to generate a shortened list of candidate species to increase metabarcoding accuracy and computing efficiency. To validate the accuracy and efficiency of our methodological framework, we compared the results of the DNA metabarcoding from pollen loads of several species of wild bumble bees to in-depth, long-term field observations of bee-plant interactions, along with expert-led pollen identification. We show that DNA metabarcoding of the plant species included in bumble bee pollen loads was most accurate when combined with a candidate taxa list of plant species flowering in the community when the bumble bees were foraging, which improved the accuracy and taxonomic precision of 77.5% of samples. With the recent proliferation of species occurrence and phenology data in tandem with advances in computing and software development, we believe that spatio-temporal filtering provides a simple approach for interpreting metagenomic studies globally. Additionally, we demonstrate that the Angiosperms 353 probes (developed for phylogenomics) offer significant promise for metagenomics projects globally, including metabarcoding to reveal species interactions within complex communities. Further, our approach demonstrates that integrating DNA metabarcoding is most accurate and powerful when combined with local ecological data.
https://doi.org/10.5061/dryad.0k6djhb7v
Description of the data and file structure
Please refer to the methods, the data on DRYAD are images and the scoring of pollen.
Files and variables
pollen_clusteR.zip
Contents
rsconnect/— Directory for the Shiny app used to score pollen grains..git/— Git directory containing version control for the analysis.cluster_analysis.png— Results of cluster analysis (1).pollen_clusteR.Rmd— R Markdown document containing clustering analysis code.cluster_analysis_pollen.png— Results of cluster analysis (2).pollen_clusteR.pdf— Rendered Markdown document.Data/— Data used for analyses inpollen_clusteR.Rmd.
Data: Existing Pollen Reference Slides
- Genus: The genus which the species is in.
- Epithet: The species on the reference slide.
- Authority: The scientific authority for the species.
- Family: The family which the species is in, sensu Angiosperm Phylogeny Group IV.
- Type: Whether the reference object is an image or physical slide.
- Prepared: Individual who prepared the slide (Paul J. CaraDonna or Heather Marie Briggs).
- Date: Year the slide was originally prepared.
- Site: Collection site for the pollen on the slide.
- State: U.S. state where the plant pollen source was located.
- County: County where the plant pollen source was located.
- Voucher: Indicates whether a herbarium voucher exists for the plant from which the pollen was sampled.
Data: pollen_morpho_all.csv
- Genus: The genus which the species is in.
- Epithet: The species on the reference slide.
- Voucher: Whether an herbarium voucher exists for the plant the pollen was sampled from; the collector is cited when applicable.
Data: RMBL_Pollen.zip
Data: Pollen morpho matrix.2021-04-29.csv
- Genus: The genus which the species is in.
- Epithet: The species on the reference slide.
- Voucher: Whether an herbarium voucher exists for the plant the pollen was sampled from; the collector is cited when applicable.
- Outline: The overall shape of the pollen grain (polar or equatorial view).
- Polarity: Orientation of the pollen grain (e.g., isopolar, heteropolar).
- Dispersal_Unit: Whether the grain is dispersed singly (monad) or in groups (dyad, tetrad, polyad).
- Aperturate: Indicates presence or absence of apertures.
- Ornamentation: Surface texture (e.g., reticulate, striate, psilate).
- Saccate: Boolean; indicates presence of air sacs (sacci).
- Operculate: Boolean; indicates presence of an operculum (lid-like structure).
- Annulate: Boolean; indicates whether apertures are ringed with a thickened margin.
- Aperture_Type: Morphological type of aperture (e.g., colpus, pore, colporus).
- Aperture_Number: Number of apertures present.
- l_polar_lat: Polar axis length (µm), averaged across grains.
- l_equi_lat: Equatorial diameter (µm), averaged across grains.
Data: Label_box.xlsx
- No.: Slide number in the collection.
- Family: Family, sensu Angiosperm Phylogeny Group IV.
- Taxon: Species from which the pollen was collected.
- Voucher: Herbarium specimen, collector’s number.
Pollen_Photos_RMBL.zip
Images of pollen voucher specimens. Each image is accompanied by a corresponding .xml metadata file documenting the image.
├── Apiaceae
│ ├── Cymopterus lemmonii
│ ├── Heracleum sphondylium
│ ├── Ligusticum porteri
│ └── Osmorhiza depauperata
├── Asteraceae
│ ├── Achillea millefolium
│ ├── Senecio wootonii
│ ├── Symphyotrichum foliaceum
│ └── Taraxacum officinale
├── Boraginaceae
│ ├── Lappula squarrosa
│ ├── Mertensia ciliata
│ └── Mertensia fusiformis
├── Brassicaceae
│ ├── Boechera drummondii
│ ├── Cardamine cordifolia
│ ├── Draba spectabilis
│ ├── Erysimum capitatum
│ └── Thlapsi arvense
├── Caprifoliaceae
│ └── Lonicera involucrata
├── Caryophyllaceae
│ └── Stellaria longifolia
├── Fabaceae
│ ├── Lathyrus leucanthus
│ ├── Lupinus argenteus
│ ├── Lupinus crassus
│ ├── Trifolium hybridum
│ ├── Trifolium pratense
│ └── Vicia americana
├── Gentianaceae
│ └── Frasera speciosa
├── Geraniaceae
Sharing/Access information
All novel sequencing data are located at NCBI Short Read Archive (SRA) under PRJNA1093153 (metagenomic) and PRJNA1083211 (reference).
Code/Software
All code and software are from three major languages: R (mostly run from RStudio), 'bash/shell scripts' (for bioinformatics), and Python (also for bioinformatics). All bash and Python code was run from the command line. Two primary computers were used for nearly all analysis; both ran stable releases of Ubuntu 20+, one computer initially used at the beginning of the project ran Windows 10 (only used for R work).
To improve metabarcoding reliability and efficiency, we suggest creating a regional list of candidate species using digital collections gleaned from herbaria, survey work, and community science (Figure 1). This list can further be refined using species distribution models and temporal filtering to limit the impact of spatial and taxonomic biases in the species list and account for spatial variations in niche availability throughout the study area. The final list is then used to inform the collection of plant samples to create a library and inform metabarcoding. We apply this methodological framework to the metabarcoding of corbiculae pollen loads of bumble bees and compare the accuracy of our metabarcoding approach both prior to and after applying a spatial and temporal filtering to pollen identification conducted by experts and field observations.
System background
To test the effectiveness of our methodological approach, we applied it to identify the plant species found in the corbiculae (pollen loads) of queen bumble bees (Bombus Latreille) collected from the Rocky Mountain Biological Laboratory in Colorado, USA (RMBL; 38°57.5” N, 106°59.3” W (WGS 84), 2900 m.a.s.l.). We collected pollen loads from wild foraging queen bees between May and July of 2015 at six permanent study sites (Ogilvie & CaraDonna, 2022). To harvest the pollen loads, we captured queens in an insect net, transferred them into a restraining device (Kearns et al., 2001), collected a pollen load from one leg, and then released them. We collected 64 corbiculae pollen loads from queens of several common wild bumble bee species: Bombus appositus, B. bifarius, B. californicus, B. flavifrons, B. nevadensis, and B. rufocinctus (Pyke 1982; Ogilvie & CaraDonna 2022). At the six study sites, bumblebee abundance and interactions with flowering plants were monitored for one hour at weekly intervals. The abundance of flowers visited by bumble bees within belt transects spread over the three vegetation types (0.5 x 40 m transects in each vegetation type, 60 m2 total area per site). We used six years (2015-2020) of observational data on Bombus flower visits to identify the plant taxa most frequently visited by queens across all years and to compare with metagenomic data.
Survey database to generate a regional taxa list
We first generated a short list of potential candidate species. We downloaded from the Botanical Information and Ecology Network ‘BIEN’ (Maitner, 2022) all records adjacent to the field sites to develop an ecologically relevant list of vascular plant species, with expected biotic pollination, which may be present at the study area. To reduce the number of species to include in the reference sequence databases, we then generated Species Distribution Models (SDMs) for these taxa to predict their distribution throughout the study area. To minimize the number of species for which SDM’s were to be generated, BIEN was queried at a distance of up to 100km from our study area, and all plant species records were downloaded. To account for the stochasticity of botanical collecting and offset the number of records associated with the research station, this data set was bootstrap resampled 250 times, with 90% of samples selected, to create a testing dataset. The median of the logistic regression assessing the probability of occurrence of a species record as a function of distance from the study area was used as a threshold distance, under which to include species as candidates for distribution modelling.
Spatial filtering via species distribution modelling
To determine which clades to include in the reference sequence database, we used Species Distribution Modelling. We used all occurrence records from BIEN (n = 23,919) within a 50km border of the ecoregion, Omernik level 3, which includes the study area (No. 21 “Southern Rockies”) to construct the species distribution model (Omernik, 1987). These records were copied into two, initially identical, sets, one for generating machine learning models (ML; Random Forest, and Boosted Regression Tree’s), and the other for Generalised Linear (GLM) and Generalized Additive Models (GAM) (Barbet-Massin et al., 2012). Ensembled predictions have been shown to outperform their constituent models, on average, and to reduce the ecological signal to the analytical noise of individual runs (Araujo & New (2007)). No single method of producing SDMs has been shown to universally outperform others when faced with a large and diverse number of applications, in our case, a great number of species with different biology and ecology (Elith* et al. (2006), Qiao et al., 2015). In the spirit of these findings, multiple families of models, which can be generated together as they have similar requirements regarding the number and ratios of Presence to Absence records, were ensembled together (Barbet-Massin et al., 2012).
We then generated 4,029 absence points, locations where the focal taxon is anticipated missing, through a random stratification of 19% of the land cover in the area and included them in (Land Management, 2019). To achieve a larger absence data set, we generated 1,000 pseudo-absence records for each taxon by randomly selecting coordinates located at least 10km away from an occurrence record. For ML models, these pseudo-absences were reduced so that the ratio of presence to absence records were balanced (Barbet-Massin et al., 2012). To achieve this, we removed absence records inside of 10% of the mean sample value of any predictor variable, and the presence records; the required number of absence records were then randomly sampled. To predict the potential distribution of each species, we used 26 environmental variables at 30m resolution, six related to climate, five soil, four topographic, four related to cloud cover, with the remaining reflecting assorted abiotic parameters (Wilson & Jetz, 2016; Wang et al., 2016; Hengl et al., 2017; Robinson et al., 2014). These publicly available data sets were selected as they pertain to a wide range of variables interacting with plant physiology. For linear regression models, these predictors underwent both vifstep (theta = 10, max observations = 12,500) and vifcor (theta = 0.7, max observations = 12,500) to detect highly correlated variables, and collinear features were removed, leaving 16 variables (Naimi et al., 2014). Modelling: Random Forest and Boosted Regression Trees were sub-sampled with 30% test and two replicates each before a weighted ensemble based on True Skill Statistics (tss) (Naimi & Araujo, 2016). Generalised linear models (GLM) and Generalised additive models (GAM) with 30% sub sampling and three replicates each were also ensembled using the tss (Naimi & Araujo, 2016). TSS was chosen as the ensemble criterion as it has been shown to work across a wide range of species occurrences and prevalence (Allouche et al., 2006). The results of these models were extracted on a cell-by-cell basis to a polygon feature derived from a minimum-spanning tree, which encompasses the study sites, and species from either ensemble with greater than 50% mean habitat suitability across all cells were considered present for further purposes (Prim (1957)). A total of 535 species were modelled using Generalized Linear Models and Generalized Additive Models, and 534 species were modelled using Random Forest and Boosted Regression Trees. To evaluate the accuracy of the species distribution models, additional presence records from GBIF (n = 61,789) and AIM (n = 12,730) were used as test and training sets (n = 74,519) for logistic regression (Occdownload Gbif.Org (2021), Land Management (2019)). Additional novel absence records were generated from the AIM data set to create a data set where each species has balanced presence and absence. Eleven or more paired presence and absence records were required for this testing, resulting in 334 species being included in the logistic regression (Mdn = 110.0, x̄ = 223.1, max = 1568 record pairs used) with a 70% test split (Kuhn, 2022).
Temporal Filtering of the species list
For assignment of reads to ecologically probabilistic species subsequent to BLAST, flowering time was used as a filter. To estimate the duration of dates in which plant species were flowering, Weibull estimates of several phenological parameters, all spatially modelled taxa were developed (Belitz et al., 2020; Pearse et al., 2017). Only BIEN records that occurred in the Omernik Level 4 Ecoregions within 15km of the study area (n = 5 Level 4 Ecoregions, or conditionally 6 ecoregions if enough records were not found in the nearest 5), and which were from herbarium records, were included. To remove temporally irrelevant herbarium records, i.e., material collected during times which flowering is impossible at the study area due to snow cover, we used the SnowUS data set (Iler et al. (2021), Tran et al. (2019)) from 2000-2017 were analyzed for the first three days of contiguous snow absence, and the first three days of contiguous snow cover in fall. Herbarium records after the 3rd quantile for melt, and the 1st quantile for snow cover of these metrics were removed. Species with > 10 records had their Weibull distributions generated for the date when 10% of individuals had begun flowering, when 50% were flowering, and when 90% of individuals had flowered. We used the initiation and cessation dates, respectively, as effective start and ends of flowering. These estimates were compared to a long-term observational study of flowering phenology 1974-2012 (CaraDonna et al., 2014), and the floral abundance data from 2015, using Kendall’s tau.
Microscopic pollen identification
To qualitatively identify and quantitatively note the plant species present in corbiculae loads, microscopy was used. A pollen reference library of fuchsin-jelly stained grains, which may be present in corbiculae loads of slides, was assembled from slides previously prepared by the authors (n = 21) and other researchers (n = 38) (Beattie, 1971; Brosi & Briggs, 2013). Using five years of observational data on Bombus Queen Bee foraging at these studies sites (Ogilvie & CaraDonna (2022)), as well as the RMBL Vascular Plant Checklist (Frase & Buck (2007)), an additional 62 voucher slides for species were prepared and imaged at 400x (Leica DMLB, Leica MC170 HD Camera, Leica Application Suite V. 4.13.0) from non-accessioned herbarium collections to supplement the number of species and clades covered.
We used clustering techniques to supplement our subjective opinions of which plant taxa were distinguishable via light microscopy, and to develop a dichotomous key to pollen morphotypes. Ten readily discernible categorical traits were collected from each specimen in the image collection. These traits were transformed using Gower distances and clustered using Divisive Hierarchical clustering techniques (Maechler et al., 2022). Using the cluster dendrogram, elbow plot, and heatmaps (Hennig, 2020), of these results, morphological groups of pollen which could not be resolved via microscopy were delineated, and a dichotomous key was prepared. This key was then used to identify the pollen grains sampled from corbiculae loads to morphotypes in a consistent manner.
To prepare the pollen slides from corbiculae, all corbiculae loads were broken apart and rolled using dissection needlepoints to increase the heterogeneity of the samples. Circa 0.5mm2 of pollen was placed onto a ~4mm2 fuchsin jelly cube (Beattie, 1971) atop a graticulated microscope slide, with 20 transects and 20 rows (400 quadrants) (EMS, Hartfield, PA). The jelly was melted, with stirring, until pollen grains were homogeneously spread across the microscope slide. Slides were sealed with Canada Balsam (Rublev Colours, Willits, CA), followed by sealing with clear nail polish to prevent oxidation; all samples are noted in.
To identify the pollen present in corbiculae loads, light microscopy at 400x (Zeiss Axioscope A1) was used. In initial sampling in three transects, each pollen grain was identified to a morphotype and counted; an additional two transects were scanned for morphotypes unique to that slide. If either transect contained a unique morphotype than all grains in that transect were also identified and counted. Subsequent to the first round of sampling, non-parametric species richness rarefaction curves (Oksanen et al., 2022) and non-parametric species diversity rarefaction curves were used to assess the completeness of sampling (Chao et al., 2014; Hsieh et al., 2020). Slides not approaching the asymptote of the rarefaction curve were then re-sampled and analysed iteratively for up to a total of seven transects.
Metagenomics: additional plant tissue collection and extraction
Using five years (2015-2020) of observational data on Bombus queen interactions with flowering plants at these studies sites, we identified the plant taxa most frequently visited by queens across all years. In order to capture more variability inherit in the 353 loci, we sequenced the 12 most visited taxa twice, using samples collected from one site within the Gunnison Basin River Drainage and one individual collected from another more distal population. In addition, we included a congener - or a species from a closely related genus to serve as an outgroup for all 12 taxa. We also sequenced another 15 taxa of plants commonly visited by Bombus workers, based on the abundances and immediate access to plant tissue, in the aforementioned data set. Plant collections were identified typically using a combination of dichotomous keys and primary literature as required (Flora of North America Editorial Committee (1993+), Hitchcock & Cronquist (2018), Ackerfield (2015), Lesica et al. (2012), Cronquist et al. (1977+), Allred & Ivey (2012), Jepson flora project (2020), Mohlenbrock (2002)). Plant genomic DNA was isolated from ~ 1 cm2 of leaf tissue from silica-gel dried or herbarium material using a modified cetyltrimethylammonium (CTAB) protocol (Doyle & Doyle, 1987) that included two chloroform washes. DNA was quantified using a Nanodrop 2000 (Thermo Fisher Scientific, Waltham, Massachusetts, USA) and Qubit fluorometer (Thermo Fisher Scientific).
Pollen DNA extraction
To extract genomic DNA from pollen, a CTAB-based protocol was modified from Lahlamgiahi et al. and Guertler et al. (2014, 2014). A SDS extraction buffer (350µL, 100mM Tris-HCl, 50 mM EDTA, 50 mM NaCl, 10% SDS v/v., pH 7.5) was added to the sample, followed by vortexing to allow dissolution of corbiculae. Pollen grains were then macerated with Kontes Pellet Pestles, and the tip of these washed with 130 µL of the SDS extraction buffer, samples were then incubated for 1 hour at 30°C. This was followed by the addition of 10% CTAB solution (450ul, of 20 mM Tris-Cl pH. 8.0, 1.4 M NaCl, 10 mM EDTA pH 7.5, 10% CTAB, 5% PVP, ~85% Deionized water) and RNAse (10 uL of 10 mg/mL) and samples were incubated for 40 minutes at 37°C, on a heat block (Multi-Blok, Thermo Fisher Scientific, Waltham Massachusetts) set to 40°C. After 20 minutes incubation, Proteinase K (15 µL of 20mg/ml) and DTT (12.5 µL of 1M in water) were added, and the samples were further incubated at 60°C for 1 hour. Samples were then incubated overnight at 40°C. 500 µL of Phenol-Chloroform-Isoamyl alcohol (25:24:1) were added, vortexed, and centrifuged at 10,000 rpm for 10 minutes, and the aqueous phase was pipetted to a 1.5 ml centrifuge tube.
To precipitate the DNA, chilled Isopropyl alcohol & 3 mM Sodium acetate (5:1) equivalent to 32 of the volume of sample were added, with 1 hour of chilling at -20°C, followed by 10 minutes of centrifuging at 13,000 rpm. The supernatant was pipetted to a new 1.5 ml centrifuge tube, and 70% EtOH (400 µL) were added before chilling at -20°C for 20 minutes, followed by centrifugation at 13,000 rpm for 10 minutes. Both tubes were then washed with 75% EtOH (400 µL), inverted, centrifuged at 13,000 rpm for 4 minutes, and the solution discarded, then washed with 95% EtOH (400 µL), inverted, centrifuged at 13,000 rpm for 4 minutes, and the solution discarded. Pellets were dried at room temperature overnight before resuspension in nuclease-free H2O. Extractions were assessed using a Nanodrop 2000 (Thermo Fisher Scientific) and Qubit fluorometer (Thermo Fisher Scientific). DNA extracts were then cleaned using 2:1 v./v. Sera-Mag beads (Cytiva, Little Chalfont, UK) to solute ratio following the manufacturer’s protocol, eluted in 0.5x TE, and the eluent allowed to reduce by half volume in ambient conditions. DNA was quantified using a Qubit fluorometer.
Library preparation & Bait capture (Barcoding)
Sequence library preparation was performed using the NEBNext Ultra II FS-DNA Library Prep Kit for Illumina (New England BioLabs, Ipswich, Massachusetts, USA) using slightly modified manufacturer's recommendation. Fragmentation was performed at ½ volume of reagents and ¼ enzyme mix for 40 minutes at 37°C, with an input of 500 ng cleaned DNA. Adapter Ligation and PCR enrichment were performed with ½ volumes, while cleanup of products was performed using SPRI beads (Beckman Coulter, Indianapolis, Indiana, USA) and recommended volumes of 80% v./v. ethanol washes. The exception was the herbarium specimens, which were not fragmented and only end-repaired, with similar library preparation of all samples. Products were analysed on 4% agarose gels and a Qubit fluorometer. Libraries were pooled and enriched with the Angiosperms 353 probe kit V.4 (Arbor Biosciences myBaits Target Sequence Capture Kit) by following the manufacturer’s protocol and Brewer et al. 2019. Sequencing was performed using an Illumina mi-Seq with 150-bp end reads (NUSeq Core, Chicago, Illinois).
Bioinformatics
Sequences were processed using Trimmomatic, which removed sequence adapters, clipped the first 3 bp, discarding reads less than 36 bp, and removing reads if their average PHRED score dropped beneath 20 over a window of 5 bp (Bolger & Giorgi, 2014; Tange, 2021). Contigs generated were mapped to a reference with HybPiper with using target files created by M353 (Johnson et al., 2016; McLay et al., 2021). A custom Kraken2 database was created by downloading representative species indicated as being present in
the study area by the spatial analyses from the Sequence Read Archive (SRA) NCBI (Wood et al., 2019). These sequences were processed in the same manner as our novel sequences. The Kraken2 database was built using default parameters. Kraken2 was run on sequences using default parameters. Following Kraken2, Bracken was used to classify sequences to terminal taxa (Lu et al., 2017). Finally, all reads which could be classified by these databases were passed to a local BLAST database. A local NCBI database was built using the same processed novel and downloaded sequences as the previous database (Camacho et al., 2009).
Comparison of Sequence classifications pre- and post-processing
Using wilcox_effsize, with a one-sided hypothesis of greater, we have strong evidence of (p < 1e-04, wilcoxsign_test) an effect size of 0.732, 95% CI [0.57, 0.84, n = 40, bootstrap replicates = 1000] (Kassambara (2023), Hothorn et al. (2006)).
