Data from: Ultraconserved element phylogenomics and biogeography of the agriculturally important mason bee subgenus Osmia (Osmia)
Data files
Aug 07, 2024 version files 1.14 GB
-
BioGeoBEARS.zip
-
Bulk-Assembly-Contigs.zip
-
Divergence-Dating.zip
-
DNA-Matrices-and-Partition-Files.zip
-
README.md
-
Tree-and-Log-Files.zip
-
UCE-Alignments-Untrimmed.zip
-
uce-baits-hymV2-bee-ant-uniques.fasta
-
uce-contigs-unaligned.fasta
Abstract
One of the most important non‐Apis groups of bees for agriculture is the mason bee subgenus Osmia Panzer (Osmia), or Osmia s.s. (Hymenoptera: Megachilidae). Out of the 29 known species, four have been developed as managed pollinators of orchards. In addition, the group is important as a source of non‐native pollinators, given that several species have been introduced into new areas. Osmia s.s. occurs naturally throughout the northern temperate zone with greatest species richness in Europe and Asia. Here, we integrate phylogenomic data from ultraconserved elements (UCEs), near complete taxon sampling, and a diversity of analytical approaches to infer the phylogeny, divergence times, and biogeographic history of Osmia s.s. We also demonstrate how mitochondrial sequence data can be extracted from ultraconserved element data and combined with sequences from public repositories in order to test the phylogeny, examine species boundaries and identify specimen‐associated, non‐bee DNA. We resolve the phylogeny of Osmia s.s. and show strong support that Nearctic Osmia ribifloris is the sister group to the rest of the subgenus. Biogeographic analyses indicate that the group originated during the Late Miocene in the West Nearctic plus East Palearctic region following dispersal from the East Palearctic to the West Nearctic across the Bering land bridge prior to its closure 5.5–4.8 Ma. The mitochondrial DNA results reveal potential taxonomic synonymies involving Osmia yanbianensis and Osmia opima, and Osmia rufina, Osmia rufinoides, and Osmia taurus.
README: Data from: Ultraconserved element phylogenomics and biogeography of the agriculturally important mason bee subgenus Osmia (Osmia)
https://doi.org/10.5061/dryad.7d7wm37t4
UCE phylogenomics and biogeography of the agriculturally important mason bee subgenus Osmia (Osmia)
Michael G. Branstetter1, Andreas Müller2, Terry L. Griswold1, Michael C. Orr3, Chao-Dong Zhu3
1U.S. Department of Agriculture, Agricultural Research Service, Pollinating Insects Research Unit, 5310 Old Main Hill, Utah State University, Logan, UT 84322, USA; ORCID-iD: https://orcid.org/0000-0002-3734-6166
2ETH Zurich, Institute of Agricultural Sciences, Biocommunication and Entomology, Schmelzbergstrasse 9/LFO, 8092 Zurich, Switzerland
3Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, 1 Beichen West Road, Chaoyang District, Beijing, 100101, P. R. China.
Description of the data and file structure
Files in Dryad Package:
- Bulk-Assembly-Contigs: Contains bulk assemblies for ALL samples included in the publication. Data were assembled using Trinity and SPAdes. Taxon names have been updated to the final published names.
- uce-contigs-unaligned.fasta: Monolithic fasta file output by phyluce after running the get_fastas_from_match_counts script. Contains all UCE contigs, untrimmed and unaligned. Taxon names have been updated to the final published names.
- UCE-Alignments-Untrimmed: All UCE alignments before trimming (Gblocks or Spruceup). Loci aligned using MAFFT-LINSi algorithm. Taxon names have been updated to the final published names.
- DNA-Matrices-and-Partition-Files: Concatenated DNA matrices and associated partition files. Taxon names have been updated to the final published names.
- Tree-and-Log-Files: Tree files and analysis log files. Taxon names have been updated to the final published names in the tree files but not the log files. File names start with the figure number used in the main publication.
- BioGeoBEARS: Input files and script files for BioGeoBEARS. Taxon names represent final published names.
- Divergence-Dating: Input files for BEAST2 and MCMCTree divergence dating analyses. Taxon names have not been updated to the final published names.
- uce-baits-hymV2-bee-ant-uniques.fasta: UCE bait file used to enrich UCE loci and to extract out UCE contigs from the bulk set of contigs.
Additional Notes
- All sample names include an extraction number (e.g. BLX#). This can be used to match up samples with additional information in the published tables.
- The matrix and tree file names include information about the dataset and analysis. The file name usually includes the number of samples (e.g. 59t), the percent taxon completeness (e.g. 75p), information about locus filtering (e.g. clock500r), the analysis program (e.g. iqtree), and partitioning scheme used (e.g. swsc). Additional information about the analysis can be found in the paper and associated supplemental files.
Methods
See accompanying paper for more details.
UCE Molecular Data Generation
To generate a genome-scale dataset of Osmia s.s we used ultra-conserved element (UCE) phylogenomics (Branstetter et al., 2017; Faircloth et al., 2012), an approach that combines the targeted enrichment of thousands of nuclear, UCE loci with multiplexed next-generation sequencing. Enrichment was performed using a recently published bait set that targets loci shared across all Hymenoptera (“hym-v2-bee-ant-specific”; Grab et al., 2019). This bait set is a subset of the principal Hymenoptera bait set (Branstetter et al., 2017) and includes probes synthesized from one ant and one bee genome only. It includes 9,068 unique baits targeting 2,545 UCE loci, plus an additional 264 baits targeting 12 exons from 7 “legacy” markers (ArgK, CAD, EF1aF1 and F2, LwRh, NaK, PolD1, and TopI) that have been used extensively in Sanger sequencing-based studies. The bait set is currently available for purchase from Arbor Biosciences as a catalog kit.
For all newly sequenced specimens, we extracted DNA using either the Qiagen DNeasy Blood and Tissue kit (48 specimens), or the Zymo Quick-DNA Miniprep Plus Kit (6 specimens). All extractions were performed using mounted museum specimens that had been collected between 1971-2017. For most specimens we removed and destructively sampled one to two legs, leaving the rest of the specimen intact and able to serve as a “same specimen” DNA voucher. For a couple of smaller species, we performed whole body, non-destructive extractions, in which the specimens were removed from their pin and soaked in proteinase-K. Following the soak, the specimens were rinsed in 95% ethanol, dried, and re-mounted. For each extraction, we followed the manufacturer’s extraction protocol, except for the following modifications (applies to both kit types): (1) the proteinase-K digestion was performed overnight for 12-16 hours; (2) the elution buffer was warmed to 60º C prior to elution; (3) the elution buffer was allowed to incubate in the column for five minutes; and (4) DNA was eluted two times using 60-75 uL of fresh elution buffer each time. After extraction, the concentration of eluted DNA was measured using a Qubit 3.0 fluorometer (Thermo Fisher Scientific) and 1-50 ng of extracted DNA was used for library preparation and target enrichment.
Each DNA sample was sheared using a Qsonica Q800R2 acoustic sonicator, with the target fragment size range being 400-600 bp. Having larger fragment sizes in the sequencing pool improves the amount of flanking DNA that is sequenced, which can improve UCE contig lengths following assembly. For recently collected, high quality samples, the sonicator was run for 90-120 seconds (shearing time) at 25% amplitude and with a 10-10 second on-off pulse. For older samples that likely had more degraded DNA, we adjusted the shearing times to between 30-60 seconds. Following sonication, fragmented DNA was purified at 3x volume using a homemade SPRI-bead substitute (Rohland & Reich, 2012). Illumina sequencing libraries were then generated using Kapa Hyper Prep kits and custom, 8 bp, dual-indexing adapters (Glenn et al., 2019). All library preparation steps were performed at quarter volume of the manufacturer’s recommendations, except for PCR, which was done at the full 50 uL volume. Limited cycle PCR was performed for 12 cycles on most samples, but for lower quality samples we increased the number of cycles to 14-16 cycles, usually as a re-PCR of the of the pre-PCR library. Each amplified library was cleaned using 1.0-1.2x SPRI beads in order to remove contaminants and select out small fragments below 200 bp, including possible adapter dimer. The concentration of the final cleaned library was measured using Qubit.
Enrichment was performed using the bee-ant UCE bait set described above and by following either a standard UCE protocol (enrichment protocol v1.5 available at ultraconserved.org), based on Blumenstiel et al. (2010), or by using a modified protocol in which we followed Arbor Biosciences v3.02 protocol for enrichment day 1 (more efficient than the standard protocol), and the standard UCE protocol for day 2. In either case, we combined up to 10 samples into equimolar enrichment pools and used 500 ng of pooled DNA for enrichment. On the first day of enrichment, the pooled DNA samples were combined with biotinylated RNA-baits and a set of blocking agents that reduce non-specific binding (salmon sperm, adapter blockers, Cot-1). We used a 20k probe, custom order of the probe set (diluted 1 ul bait to 4 uL H2O) and our own blocking reagents, which included chicken Cot-1 DNA (Applied Genetics Laboratories, Inc.), rather than the human Cot-1 DNA that comes with Arbor Bioscience’s kits. We performed the enrichment incubation at 65ºC for 24 hours using strip tubes and a PCR thermal cycler. For the second day of enrichment we used 50 uL of streptavidin beads per sample and performed on-bead PCR following the three heated (65ºC) wash steps. The enriched pools were amplified for 18 cycles and the resulting products were cleaned with SPRI beads at 1x volume. Following enrichment, each enrichment pool was quantified using qPCR and pooled together into a final sequencing pool at equimolar concentrations. Sequencing pools were either sequenced locally on an in-lab Illumina MiniSeq (2x125) or sent to the University of Utah Genomics Core for sequencing on an Illumina HiSeq 2500 (2x125, v4 chemistry).
Sequence Processing and Matrix Generation
The raw sequence data were demultiplexed and converted to FASTQ format using BCL2FASTQ (Illumina). The resulting FASTQ reads were then cleaned and processed using the software package PHYLUCE v1.6 (Faircloth, 2016) and associated programs. Within the PHYLUCE environment, the raw reads were trimmed for adapter contamination using ILLUMIPROCESSOR v2.0 (Faircloth, 2013), which incorporates TRIMMOMATIC (Bolger et al., 2014). Trimmed reads were assembled de novo into contigs using both TRINITY v2013-02-25 (Grabherr et al., 2011) and SPADES (Bankevich et al., 2012) for comparative purposes. To identify and extract UCE contigs from the bulk set of contigs we used a PHYLUCE program (match_contigs_to_probes) that uses LASTZ v1.0 (Harris, 2007) to match probe sequences to contig sequences and create a relational database of hits. For this step, we adjusted the min-identity and min-coverage settings to 70 and 75, respectively, which we have found to recover the highest number of UCE loci in bees when using the hym-v2-bee-ant UCE probes and bait file.
After extracting UCE contigs, we aligned each UCE locus using MAFFT v7.130b (Katoh & Standley, 2013), trying both the default algorithm setting in PHYLUCE (FFT-NS-i) and the slower but probably more accurate L-INS-i algorithm. To remove poorly aligned regions, we trimmed internal and external regions of alignments using GBLOCKS (Talavera & Castresana, 2007), with reduced stringency parameters (b1:0.5, b2:0.5, b3:12, b4:7). Using PHYLUCE, we then concatenated all loci into a supermatrix and inferred a preliminary tree using IQ-TREE v2.0-rc1 (Minh et al., 2020). The resulting tree showed that several of the older, lower quality samples had exaggerated terminal branch lengths, likely resulting from difficulties in aligning shorter sequence fragments to more complete ones, an issue that is seemingly common in species-level target enrichment datasets containing old samples.
To remove these poorly aligned sequences, we used the automated trimming program SPRUCEUP (Borowiec, 2019) on the complete set of UCE alignments. This program identifies outlier alignment sequences in individual samples, rather than alignment columns, and trims the sequences based on user defined cutoffs applied to all samples or to specific samples only. In the configuration file, we set SPRUCEUP to use the Jukes-Cantor-corrected distance calculation, a window size of 20 base pairs (bp), an overlap size of 15 bp, a lognormal distribution, and a cutoff value of 0.98. Additionally, we used manual cutoffs for the samples with the most exaggerated terminal branch lengths (Osmia_maxillaris_BLX43:0.039, Osmia_melanocephala_BLX45:0.035, Osmia_emarginata_infuscata_BLX38:0.04, Osmia_mustelina_umbrosa_BLX47:0.047, Osmia_cornuta_BLX35:0.037, Osmia_cerinthidis_cerinthidis_BLX32:0.04, Osmia_cerinthidis_cerinthidis_BLX176:0.05, Osmia_rufina_BLX368:0.04, Osmia_bicornis_globosa_BLX174:0.037). We arrived at these settings over several rounds of testing, with the objective being to do minimal trimming for most species and more aggressive trimming for the problematic samples (trimming results show in Fig. 1).
Following trimming with SPRUCEUP, we used PHYLUCE to re-trim all alignments with GBLOCKS (same settings as above) and then to filter the alignments for taxon occupancy, requiring loci to have 0%, 50%, 75%, 90%, 95%, 98%, and 100% taxon completeness. For each resulting set of filtered loci, we used PHYLUCE to generate descriptive statistics for the data and to concatenate the loci into a supermatrix for analysis. For each supermatrix we performed an unpartitioned phylogenetic analysis with IQ-TREE v2 (Minh et al., 2020), selecting GTR+F+G4 as the model of sequence evolution and performing 1,000 Ultrafast Bootstrap replicates (UFB; Hoang et al., 2018) and 1,000 replicates of the SH-like approximate likelihood ratio test (SH-aLRT; Guindon et al., 2010). For most analyses we used the GTR+F+G4 model only and did not perform model testing. This was done because using more complex models does not negatively impact phylogenetic results (Abadi et al., 2019), even when a less complex model might be favored. We also avoided using the proportion of invariant sites (+I) model with gamma distributed rates (+G) for among-site rate heterogeneity because the two parameters cannot be optimized together (Yang, 2006). Based on a review of the matrix statistics and the results of the unpartitioned analyses, we selected the 75% complete locus set as the focal set for additional analyses.