The apportionment of dietary diversity in wildlife

Hoff, Hannah 1 ; Littleford-Colquhoun, Bethan 1 ; Kartzinel, Rebecca1; Anderson, Heidi2; Geremia, Chris 2 ; McGarvey, Lauren2; Segal, Carlisle3; Kartzinel, Tyler 1

Published Jul 01, 2025 on Dryad. https://doi.org/10.5061/dryad.mgqnk99b8

Data files

Jul 01, 2025 version files 1.06 MB

bold-specimendata-DS-YNPBP-R3.csv

387.83 KB
bold-trnL-DS-YNPBP-R3.fas

495.36 KB
README.md

10.69 KB
YNPP6_completeDB_20241108.fasta

167.63 KB

Abstract

Evaluating species’ roles in food webs is critical for advancing ecological theories on competition, coexistence, and biodiversity, but complicated by pronounced dietary variability within species and overlap across species. We combined dietary DNA metabarcoding, GPS tracking, and a machine-learning algorithm to cluster and compare dietary profiles within and among five migratory large-herbivore species from Yellowstone National Park. Interspecific niche partitioning was weak, but statistically significant (PERMANOVA: pseudo-F_4,498 = 14.7, R² = 0.11, P ≤ 0.001), such that some diet profiles from different species were as similar as those from within one species. Instead of affirming species’ identity as a primary determinant of diet composition, we found three statistically different clusters of diet profiles—one concentrated on graminoids and forbs, another on forbs and deciduous shrubs, and a third on gymnosperms—each including samples from all herbivore species. Clusters did not reflect traditional diet classification schemes such as the grazer-browser continuum that is often used to distinguish species by percent grass consumption or use of grassland habitat in African savannas. Instead, clusters in Yellowstone reflected seasonal dietary variation within species that often equaled or exceeded niche differences between species, contributing to our growing understanding of why environmental variability may favor generalist foraging strategies at temperate latitudes whereas specialized grazer and browser guilds appear to predominate in tropical savannas. Data-driven strategies that untangle complex trophic networks without relying on a priori groupings offer promising new insights into wildlife diets, with potential applications in resource management and environmental monitoring.

Local reference library files:
This dataset includes the specimen data (bold-specimendata-DS-YNPBP-R3.csv) and trnL input fasta (bold-trnL-DS-YNPBP-R3.fas) file that were used to create the local DNA reference library for Yellowstone National Park, as well as the output fasta (YNPP6_completeDB_20241108.fasta) for the complete local reference library (generated from obitools_step2_local_reference_library_2025.Rmd).

Both the input and and output files for the local reference library are in FASTA format where a sequence begins with a single-line description (specimen ID and plant taxonomy), followed by lines of sequence data for that taxon.

All columns in bold-specimendata-DS-YNPBP-R3.csv are outlined below (cells where information wasn't recorded for a specimen are labeled with 'n/a'):

Sample ID = internal identifier for the sample being sequenced
Project Code = unique identifier for the data project
Process ID = unique code automatically generated by BOLD systems for each new record added to project
Field ID = identifier for specimen assigned in the field
Phylum = scientific name of collected specimen identified to phylum
Class = scientific name of collected specimen identified to class
Order = scientific name of collected specimen identified to order
Family = scientific name of collected specimen identified to family
Subfamily = scientific name of collected specimen identified to subfamily
Genus = scientific name of collected specimen identified to genus
Species = scientific name of collected specimen identified to species
Subspecies = scientific name of collected specimen identified to subspecies
Identifier = Full name of primary individual who assigned the specimen to a taxonomic group
Catalog Num = identifier for specimen assigned by formal collection upon accessioning (museum ID)
rbcL Seq. Length = sequence length (bps) of rbcL locus for specimen
rbcL Trace Count = number of trace files for rbcL locus per specimen
rbcL Accession = GenBank accession number for rbcL specimen record
matK Seq. Length = sequence length (bps) of matK locus for specimen
matK Trace Count = number of trace files for matK locus per specimen
matK Accession = GenBank accession number for matK specimen record
trnL-F Seq. Length = sequence length (bps) of trnL-F locus for specimen
trnL-F Trace Count = number of trace files for trnL-F locus per specimen
trnL-F Accession = GenBank accession number for trnL-F specimen record
trnH-psbA Seq. Length = sequence length (bps) of trnH-psbA locus for specimen
trnH-psbA Trace Count =number of trace files for trnH-psbA locus per specimen
trnH-psbA Accession = GenBank accession number for trnH-psbA specimen record
Image Count = number of images associated with specimen on BOLD systems
Barcode Compliant = barcode index number marked as compliant if they contain at least one sequence that meets BOLD systems standards
Contamination = indicates specimen flagged for contamination
Stop Codon = indicates presence of stop codon in loci
Flagged Record = indicates specimen or sequence that was flagged as an issue
Identification = taxonomic assignment of specimen
Life Stage = life stage of specimen
Extra Info = extra information about specimen
Voucher Type = indicates special case for accessioning process
Institution = Full name of institution that has physical possession of the voucher specimen
Collectors = The full or abbreviated names of the individuals or team responsible for collecting the sample in the field
Collection Date = Date of specimen collection
Country/Ocean = Country that specimen was collected
State/Province = State that specimen was collected
Region = region that specimen was collected
Lat = latitude that specimen was collected (Decimal degrees)
Lon = longitude that specimen was collected (Decimal degrees)
Elev = elevation that specimen was collected (m)
Habitat = habitat classification that specimen was collected
Collection Notes = Additional collection notes

We also provide Python notebooks, R notebooks, and input/output files used to quantify dietary variation within and among populations of five large-herbivore species (pronghorn, bighorn sheep, mule deer, elk, bison) in Yellowstone National Park, USA. We use these as interactive notebooks for our analyses; steps 1-4 were run on our high-performance cluster (HPC) but we have made comments in each notebook to describe the appropriate file paths at each step. We use RMarkdown files with the intention of having users run code in chunks and understand outputs at each step.

First, global (step 1) and local (step 2) reference libraries are built for the trnL-P6 locus. Raw sequence reads from large-herbivore fecal samples are then cleaned and prepared (step 3) for taxonomy assignment (step 4). The taxonomies assigned using the local and global reference libraries are combined (step 5) and then analyses are conducted to quantify similarities and differences among all dietary profiles—regardless of consumer species, season, and geography—and evaluate how consistently each of these traditional grouping variables is associated with major axes of dietary variation in the community (step 6a-6b).

This dataset contains all code associated with:

Creating global reference library for the trnL locus in plants (obitools_step1_global_reference_library_2025.Rmd)
Creating local Yellowstone National Park reference library for the trnL locus in plants (obitools_step2_local_reference_library_2025.Rmd)
Preparing and cleaning sequence reads from fecal samples (obitools_step3_prepare_sequence_reads_2025.Rmd)
Assigning taxonomy to cleaned sequence reads using the local and global reference libraries (obitools_step4a_taxonomy_assignment_local_2025.Rmd and obitools_step4b_taxonomy_assignment_global_2025.Rmd)
Combining the local and global reference library taxonomy assignment outputs (R_step5_combine_local_and_global_library_outputs.Rmd)
Data analyses in R for the main text (R_step6a_data_analysis_main_text.Rmd) and supplement (R_step6b_data_analysis_supplement.Rmd)

Sharing/Access information

Illumina sequence data and sample metadata are available at NCBI (BioProject accession number: PRJNA780500).

Code/Software

These coding steps are designed to follow on from one another. The files created in steps 1, 2, and 3 will be used in step 4a-b. The files created in step 4a-b will be used in step 5. The files created in step 5 will be used in step 6a-b. All code is annotated.

Steps 1-4 require the following python packages:

cutadapt
obitools

Steps 5-6 require the following R packages:

datawizard
ggrepel
glue
textshape
stringr
remotes
scales
knitr
here
ggplot2
dplyr
raster
maptools
ggsn
sf
gdalUtilities
ggspatial
rgeos
sp
tidyverse
phyloseq
vegan
ggalt
elevatr
terra
tidyterra
ggnewscale
car
speedyseq
emmeans
multcomp
microViz
permute
lattice
vegetarian
cluster
MASS
indicspecies
data.table

obitools_step1_global_reference_library_2025.Rmd - code to build a global reference database for plants. We use the ecoPCR program in obitools to simulate a PCR and to extract all sequences from the EMBL that may be amplified in silico by the two primers (GGGCAATCCTGAGCCAA and CCATTGAGTCTCTGCACCTATC) used for PCR amplification.

The list of steps for building this reference database are:

Download the whole set of EMBL/ENA sequences from Globus
Download the NCBI taxonomy
Format sequences into the ecoPCR format
Use ecoPCR to simulate amplification and build a reference database based on putatively amplified barcodes together with their recorded taxonomic information

obitools_step2_local_reference_library_2025.Rmd - code to build a local Yellowstone National Park reference database for plants. We use the ecoPCR program in obitools to simulate a PCR. All local barcode sequences can be found on BOLD in Dataset YNPBP-R3 and can be amplified in silico by the two primers (GGGCAATCCTGAGCCAA and CCATTGAGTCTCTGCACCTATC) used for PCR amplification. The code results in the creation of the file "YNPP6_completeDB_20241108.fasta" which is included in this dataset.
The list of steps for building this reference database are:

Extract trnL-P6 from BOLD sequences
Format sequences into the ecoPCR format
Use ecoPCR to simulate amplification and build a reference database based on putatively amplified barcodes together with their recorded taxonomic information

obitools_step3_prepare_sequence_reads_2025.Rmd - code to clean and prepare raw sequence reads from large-herbivore fecal samples to determine their diets.
The following steps are taken:

Remove primers from forward and reverse reads using cutadapt
Recover full sequence reads from forward and reverse reads
Remove unaligned sequence records
Dereplicate reads into uniq sequences
Denoise the sequence dataset
Clean the sequences for PCR/sequencing errors

obitools_step4a_taxonomy_assignment_local_2025.Rmd and obitools_step4b_taxonomy_assignment_global_2025.Rmd - code to assign taxonomy to sequences using global and local reference libraries in order to get the complete list of species associated to each sample. Taxonomic assignment of sequences requires a reference database compiling all possible species to be identified in the sample. Assignment is then done based on sequence comparison between sample sequences and reference sequences.
The following steps are taken for both global and local reference libraries:

Assign each sequence to a taxon
Generate the final result table

R_step5_combine_local_and_global_library_outputs.Rmd - R code to combine local and global reference library outputs.
The following steps are taken:

Subset databases to perfect matches (100% matches)
Generate summary statistics for subset databases
Make output files required to create a phyloseq object
Build the physeq object, filter, and rarefy for further analyses

R_step6a_data_analysis_main_text.Rmd and R_stepb_data_analysis_supplement.Rmd - R code for all analyses conducted on this comparative dietary dataset.
Main analyses performed include:

Calculation of grass, graminioid, and herbaceous plant relative read abundance in herbivore diets
Calculation of dietary Bray-Curtis dissimilarity
Calculation of dietary richness
Classification of dietary clusters using Partitioning Around Medoids algorithm
Identification of indicator species associated with each dietary cluster

We obtained high-resolution diet profiles from fecal samples associated with GPS-collared individuals from five large-herbivore species that migrated over a ~100-km² area. We processed a total of 697 samples from Yellowstone (inclusive of the 503 samples in this dataset and 194 separate samples from a different sampling scheme) with 38 total extraction blanks to monitor for laboratory cross-contamination. To identify dietary DNA sequences with the plant trnL-P6 marker using an Illumina MiSeq, we developed two reference libraries. The ‘local’ library comprised 334 unique trnL-P6 sequences from 882 vouchered specimens representing 75 plant families. We also built a global library using all trnL-P6 sequences available from ENA in EMBL-EBI using obitools v1.2.13. The global library was built using 26,101 unique trnL-P6 sequences obtained from 223,916 accessions representing at least 641 plant families.

To ensure that both per-base and per-sequence quality scores exceeded Q20, we first ran FastQC on the demultiplexed Illumina sequences obtained for each sample. We used cutadapt to remove primers from forward and reverse reads, allowing a maximum of 2 mismatches and no indels. We assembled sequence reads using the illuminapairedend in obitools using a minimum alignment score of 40. We used the obiuniq command to tally and merge identical sequences across samples, allowing us to quantify relative read abundance (RRA). We discarded sequences that occurred <2 times overall or that were outside the broad range of expected sequence lengths for the marker (<8 bp or >300 bp) using obigrep. We identified putative PCR artifacts as sequences that were highly similar to another sequence (1 bp difference) but with a much lower abundance (0.05%) in the majority of samples in which they occurred and discarded these sequences using the obiclean command. We used our local and global reference libraries to identify each unique plant DNA sequence with the ecoTagcommand in obitools and exported these outputs for further analysis.

Using these outputs, we crosschecked local reference library matches with global reference library matches. At this stage, we eliminated sequences that did not perfectly match a sequence in our local or global reference library (100% identity). We filtered the dataset to include the subset of 503 samples obtained from herds that included GPS collared individuals, as opposed to unaffiliated groups. All samples in this subset achieved >1000 sequence reads that passed quality controls. We thus rarified samples to an even sequencing depth for all downstream analyses and the final dataset comprised 503 samples and included 910 plant taxa; 95.7% of plant taxa were identified to family, 73% to genus, and 44% to species. We used these taxonomic assignments to characterize plant functional groups using growth form classifications of the United States Department of Agriculture (USDA) Plants Database.

The apportionment of dietary diversity in wildlife

Data files

Abstract

README: The apportionment of dietary diversity in wildlife

Sharing/Access information

Code/Software

Methods