The apportionment of dietary diversity in wildlife
Data files
Jul 01, 2025 version files 1.06 MB
-
bold-specimendata-DS-YNPBP-R3.csv
387.83 KB
-
bold-trnL-DS-YNPBP-R3.fas
495.36 KB
-
README.md
10.69 KB
-
YNPP6_completeDB_20241108.fasta
167.63 KB
Abstract
Evaluating species’ roles in food webs is critical for advancing ecological theories on competition, coexistence, and biodiversity, but complicated by pronounced dietary variability within species and overlap across species. We combined dietary DNA metabarcoding, GPS tracking, and a machine-learning algorithm to cluster and compare dietary profiles within and among five migratory large-herbivore species from Yellowstone National Park. Interspecific niche partitioning was weak, but statistically significant (PERMANOVA: pseudo-F4,498 = 14.7, R2 = 0.11, P ≤ 0.001), such that some diet profiles from different species were as similar as those from within one species. Instead of affirming species’ identity as a primary determinant of diet composition, we found three statistically different clusters of diet profiles—one concentrated on graminoids and forbs, another on forbs and deciduous shrubs, and a third on gymnosperms—each including samples from all herbivore species. Clusters did not reflect traditional diet classification schemes such as the grazer-browser continuum that is often used to distinguish species by percent grass consumption or use of grassland habitat in African savannas. Instead, clusters in Yellowstone reflected seasonal dietary variation within species that often equaled or exceeded niche differences between species, contributing to our growing understanding of why environmental variability may favor generalist foraging strategies at temperate latitudes whereas specialized grazer and browser guilds appear to predominate in tropical savannas. Data-driven strategies that untangle complex trophic networks without relying on a priori groupings offer promising new insights into wildlife diets, with potential applications in resource management and environmental monitoring.
Local reference library files:
This dataset includes the specimen data (bold-specimendata-DS-YNPBP-R3.csv) and trnL input fasta (bold-trnL-DS-YNPBP-R3.fas) file that were used to create the local DNA reference library for Yellowstone National Park, as well as the output fasta (YNPP6_completeDB_20241108.fasta) for the complete local reference library (generated from obitools_step2_local_reference_library_2025.Rmd).
Both the input and and output files for the local reference library are in FASTA format where a sequence begins with a single-line description (specimen ID and plant taxonomy), followed by lines of sequence data for that taxon.
All columns in bold-specimendata-DS-YNPBP-R3.csv are outlined below (cells where information wasn't recorded for a specimen are labeled with 'n/a'):
- Sample ID = internal identifier for the sample being sequenced
- Project Code = unique identifier for the data project
- Process ID = unique code automatically generated by BOLD systems for each new record added to project
- Field ID = identifier for specimen assigned in the field
- Phylum = scientific name of collected specimen identified to phylum
- Class = scientific name of collected specimen identified to class
- Order = scientific name of collected specimen identified to order
- Family = scientific name of collected specimen identified to family
- Subfamily = scientific name of collected specimen identified to subfamily
- Genus = scientific name of collected specimen identified to genus
- Species = scientific name of collected specimen identified to species
- Subspecies = scientific name of collected specimen identified to subspecies
- Identifier = Full name of primary individual who assigned the specimen to a taxonomic group
- Catalog Num = identifier for specimen assigned by formal collection upon accessioning (museum ID)
- rbcL Seq. Length = sequence length (bps) of rbcL locus for specimen
- rbcL Trace Count = number of trace files for rbcL locus per specimen
- rbcL Accession = GenBank accession number for rbcL specimen record
- matK Seq. Length = sequence length (bps) of matK locus for specimen
- matK Trace Count = number of trace files for matK locus per specimen
- matK Accession = GenBank accession number for matK specimen record
- trnL-F Seq. Length = sequence length (bps) of trnL-F locus for specimen
- trnL-F Trace Count = number of trace files for trnL-F locus per specimen
- trnL-F Accession = GenBank accession number for trnL-F specimen record
- trnH-psbA Seq. Length = sequence length (bps) of trnH-psbA locus for specimen
- trnH-psbA Trace Count =number of trace files for trnH-psbA locus per specimen
- trnH-psbA Accession = GenBank accession number for trnH-psbA specimen record
- Image Count = number of images associated with specimen on BOLD systems
- Barcode Compliant = barcode index number marked as compliant if they contain at least one sequence that meets BOLD systems standards
- Contamination = indicates specimen flagged for contamination
- Stop Codon = indicates presence of stop codon in loci
- Flagged Record = indicates specimen or sequence that was flagged as an issue
- Identification = taxonomic assignment of specimen
- Life Stage = life stage of specimen
- Extra Info = extra information about specimen
- Voucher Type = indicates special case for accessioning process
- Institution = Full name of institution that has physical possession of the voucher specimen
- Collectors = The full or abbreviated names of the individuals or team responsible for collecting the sample in the field
- Collection Date = Date of specimen collection
- Country/Ocean = Country that specimen was collected
- State/Province = State that specimen was collected
- Region = region that specimen was collected
- Lat = latitude that specimen was collected (Decimal degrees)
- Lon = longitude that specimen was collected (Decimal degrees)
- Elev = elevation that specimen was collected (m)
- Habitat = habitat classification that specimen was collected
- Collection Notes = Additional collection notes
We also provide Python notebooks, R notebooks, and input/output files used to quantify dietary variation within and among populations of five large-herbivore species (pronghorn, bighorn sheep, mule deer, elk, bison) in Yellowstone National Park, USA. We use these as interactive notebooks for our analyses; steps 1-4 were run on our high-performance cluster (HPC) but we have made comments in each notebook to describe the appropriate file paths at each step. We use RMarkdown files with the intention of having users run code in chunks and understand outputs at each step.
First, global (step 1) and local (step 2) reference libraries are built for the trnL-P6 locus. Raw sequence reads from large-herbivore fecal samples are then cleaned and prepared (step 3) for taxonomy assignment (step 4). The taxonomies assigned using the local and global reference libraries are combined (step 5) and then analyses are conducted to quantify similarities and differences among all dietary profiles—regardless of consumer species, season, and geography—and evaluate how consistently each of these traditional grouping variables is associated with major axes of dietary variation in the community (step 6a-6b).
This dataset contains all code associated with:
- Creating global reference library for the trnL locus in plants (obitools_step1_global_reference_library_2025.Rmd)
- Creating local Yellowstone National Park reference library for the trnL locus in plants (obitools_step2_local_reference_library_2025.Rmd)
- Preparing and cleaning sequence reads from fecal samples (obitools_step3_prepare_sequence_reads_2025.Rmd)
- Assigning taxonomy to cleaned sequence reads using the local and global reference libraries (obitools_step4a_taxonomy_assignment_local_2025.Rmd and obitools_step4b_taxonomy_assignment_global_2025.Rmd)
- Combining the local and global reference library taxonomy assignment outputs (R_step5_combine_local_and_global_library_outputs.Rmd)
- Data analyses in R for the main text (R_step6a_data_analysis_main_text.Rmd) and supplement (R_step6b_data_analysis_supplement.Rmd)
Sharing/Access information
Illumina sequence data and sample metadata are available at NCBI (BioProject accession number: PRJNA780500).
Code/Software
These coding steps are designed to follow on from one another. The files created in steps 1, 2, and 3 will be used in step 4a-b. The files created in step 4a-b will be used in step 5. The files created in step 5 will be used in step 6a-b. All code is annotated.
Steps 1-4 require the following python packages:
- cutadapt
- obitools
Steps 5-6 require the following R packages:
- datawizard
- ggrepel
- glue
- textshape
- stringr
- remotes
- scales
- knitr
- here
- ggplot2
- dplyr
- raster
- maptools
- ggsn
- sf
- gdalUtilities
- ggspatial
- rgeos
- sp
- tidyverse
- phyloseq
- vegan
- ggalt
- elevatr
- terra
- tidyterra
- ggnewscale
- car
- speedyseq
- emmeans
- multcomp
- microViz
- permute
- lattice
- vegetarian
- cluster
- MASS
- indicspecies
- data.table
obitools_step1_global_reference_library_2025.Rmd - code to build a global reference database for plants. We use the ecoPCR program in obitools to simulate a PCR and to extract all sequences from the EMBL that may be amplified in silico by the two primers (GGGCAATCCTGAGCCAA and CCATTGAGTCTCTGCACCTATC) used for PCR amplification.
The list of steps for building this reference database are:
- Download the whole set of EMBL/ENA sequences from Globus
- Download the NCBI taxonomy
- Format sequences into the ecoPCR format
- Use ecoPCR to simulate amplification and build a reference database based on putatively amplified barcodes together with their recorded taxonomic information
obitools_step2_local_reference_library_2025.Rmd - code to build a local Yellowstone National Park reference database for plants. We use the ecoPCR program in obitools to simulate a PCR. All local barcode sequences can be found on BOLD in Dataset YNPBP-R3 and can be amplified in silico by the two primers (GGGCAATCCTGAGCCAA and CCATTGAGTCTCTGCACCTATC) used for PCR amplification. The code results in the creation of the file "YNPP6_completeDB_20241108.fasta" which is included in this dataset.
The list of steps for building this reference database are:
- Extract trnL-P6 from BOLD sequences
- Format sequences into the ecoPCR format
- Use ecoPCR to simulate amplification and build a reference database based on putatively amplified barcodes together with their recorded taxonomic information
obitools_step3_prepare_sequence_reads_2025.Rmd - code to clean and prepare raw sequence reads from large-herbivore fecal samples to determine their diets.
The following steps are taken:
- Remove primers from forward and reverse reads using cutadapt
- Recover full sequence reads from forward and reverse reads
- Remove unaligned sequence records
- Dereplicate reads into uniq sequences
- Denoise the sequence dataset
- Clean the sequences for PCR/sequencing errors
obitools_step4a_taxonomy_assignment_local_2025.Rmd and obitools_step4b_taxonomy_assignment_global_2025.Rmd - code to assign taxonomy to sequences using global and local reference libraries in order to get the complete list of species associated to each sample. Taxonomic assignment of sequences requires a reference database compiling all possible species to be identified in the sample. Assignment is then done based on sequence comparison between sample sequences and reference sequences.
The following steps are taken for both global and local reference libraries:
- Assign each sequence to a taxon
- Generate the final result table
R_step5_combine_local_and_global_library_outputs.Rmd - R code to combine local and global reference library outputs.
The following steps are taken:
- Subset databases to perfect matches (100% matches)
- Generate summary statistics for subset databases
- Make output files required to create a phyloseq object
- Build the physeq object, filter, and rarefy for further analyses
R_step6a_data_analysis_main_text.Rmd and R_stepb_data_analysis_supplement.Rmd - R code for all analyses conducted on this comparative dietary dataset.
Main analyses performed include:
- Calculation of grass, graminioid, and herbaceous plant relative read abundance in herbivore diets
- Calculation of dietary Bray-Curtis dissimilarity
- Calculation of dietary richness
- Classification of dietary clusters using Partitioning Around Medoids algorithm
- Identification of indicator species associated with each dietary cluster
We obtained high-resolution diet profiles from fecal samples associated with GPS-collared individuals from five large-herbivore species that migrated over a ~100-km2 area. We processed a total of 697 samples from Yellowstone (inclusive of the 503 samples in this dataset and 194 separate samples from a different sampling scheme) with 38 total extraction blanks to monitor for laboratory cross-contamination. To identify dietary DNA sequences with the plant trnL-P6 marker using an Illumina MiSeq, we developed two reference libraries. The ‘local’ library comprised 334 unique trnL-P6 sequences from 882 vouchered specimens representing 75 plant families. We also built a global library using all trnL-P6 sequences available from ENA in EMBL-EBI using obitools v1.2.13. The global library was built using 26,101 unique trnL-P6 sequences obtained from 223,916 accessions representing at least 641 plant families.
To ensure that both per-base and per-sequence quality scores exceeded Q20, we first ran FastQC on the demultiplexed Illumina sequences obtained for each sample. We used cutadapt to remove primers from forward and reverse reads, allowing a maximum of 2 mismatches and no indels. We assembled sequence reads using the illuminapairedend in obitools using a minimum alignment score of 40. We used the obiuniq command to tally and merge identical sequences across samples, allowing us to quantify relative read abundance (RRA). We discarded sequences that occurred <2 times overall or that were outside the broad range of expected sequence lengths for the marker (<8 bp or >300 bp) using obigrep. We identified putative PCR artifacts as sequences that were highly similar to another sequence (1 bp difference) but with a much lower abundance (0.05%) in the majority of samples in which they occurred and discarded these sequences using the obiclean command. We used our local and global reference libraries to identify each unique plant DNA sequence with the ecoTagcommand in obitools and exported these outputs for further analysis.
Using these outputs, we crosschecked local reference library matches with global reference library matches. At this stage, we eliminated sequences that did not perfectly match a sequence in our local or global reference library (100% identity). We filtered the dataset to include the subset of 503 samples obtained from herds that included GPS collared individuals, as opposed to unaffiliated groups. All samples in this subset achieved >1000 sequence reads that passed quality controls. We thus rarified samples to an even sequencing depth for all downstream analyses and the final dataset comprised 503 samples and included 910 plant taxa; 95.7% of plant taxa were identified to family, 73% to genus, and 44% to species. We used these taxonomic assignments to characterize plant functional groups using growth form classifications of the United States Department of Agriculture (USDA) Plants Database.
