Data from: Ultraconserved elements improve resolution of marmot phylogeny and offer insights into biogeographic history
Data files
Mar 22, 2023 version files 150.28 MB
-
all-individuals-all-loci.fasta
150.28 MB
-
README.md
842 B
Abstract
Marmots (Marmota spp.) comprise a lineage of large-bodied ground squirrels that diversified rapidly in the Pleistocene, when the planet quickly transitioned to a drier, colder, and highly seasonal climate—particularly at high latitudes. Fossil evidence indicates the genus spread from North America, across Beringia, and into the European Alps over the course of only a few million years, beginning in the late Pliocene. Marmots are highly adapted to survive long and severely cold winters, and this likely favored their expansion and diversification over this time period. Previous phylogenetic studies have identified two major subgenera of marmots, but the timing of important speciation events and some species relationships have been difficult to resolve. Here we use ultraconserved elements and mitogenomes, with samples from all 15 extant species, to more precisely retrace how and when marmots came to inhabit a vast Holarctic range. Our results indicate marmots arose in North America in the mid Miocene (~16.3 Mya) and dispersed across the Bering Land Bridge in the late Pliocene (~3-4 Mya); in addition, our fossil-calibrated timeline is suggestive of the rise and spread of open grasslands as being particularly important to marmot diversification. The woodchuck (M. monax) and the Alaska marmot (M. broweri) are found to be more closely related to the Eurasian species than to the other North American species. Paraphyly is evident in the bobak marmot (M. bobak) and the hoary marmot (M. caligata), and in the case of the latter the data are highly suggestive of a second, cryptic species in the Cascade Mountains of Washington.
We obtained frozen tissues from 56 individuals representing all 15 currently recognized extant marmot species. Genomic DNA was extracted following the PureGene kit protocol for animal tissue (Gentra Systems Inc.). Sequence capture of ultraconserved elements (UCEs) was performed on a pilot dataset of 15 individuals according to the original UCE capture and sequencing protocols (Faircloth et al. 2012). Briefly, genomic libraries were prepared using KAPA library kits (Kapa Biosystems) and barcoded adapters for multiplexing (Faircloth & Glenn 2012). Enrichment was performed using the UCE Tetrapods 5kv1 probe set (http://ultraconserved.org/), and post-enrichment libraries were pooled and sequenced on one lane of an Illumina HiSeq 2000 instrument with 150 bp paired-end reads at the University of California Los Angeles Neuroscience Genomics Core. An additional 41 individuals were then sent to RAPiD Genomics (Gainesville, FL, USA) for library prep and sequencing using the same lab protocols. In the second batch, targeted reads were sequenced on an Illumina HiSeq 4000 platform with 150 bp paired-end reads. All indexed reads were demultiplexed with the BCLtofastq script from Illumina CASAVA v1.8.4.
We followed the phyluce v1.6.2 bioinformatics pipeline (Faircloth 2016) to process raw reads. Our newly generated reads were supplemented with additional raw UCE sequence data from ten Marmota individuals, which were provided by Bryan S. McLean as part of a recent phylogenetic study (library preparation and sequencing methods described in Mclean et al. 2019; Table S1). Adapter sequences and low-quality base calls were removed using Illumiprocessor (Faircloth 2013), which is a wrapper for Trimmomatic (Bolger et al. 2014). The cleaned reads were assembled using velvet v1.2.10 (Zerbino and Birney 2008), with optimal kmer values for each taxon estimated with VelvetOptimiser v2.2.6 (Gladman and Seeman 2012). To supplement our dataset, all five published Marmota genomes (M. himalayana, M. monax, M. marmota, M. flaviventris, M. vancouverensis) were retrieved from GenBank; for outgroups, we also retrieved genomes for the Cape ground squirrel (Xerus inauris, GenBank accession GCA_004024805), the arctic ground squirrel (Urocitellus parryii; GCA_003426925), Gunnison’s prairie dog (Cynomys gunnisoni; GCA_011316645), the Daurian ground squirrel (Spermophilus dauricus; GCA_002406435), and the thirteen-lined ground squirrel (Ictidomys tridecemlineatus; GCA_016881025). Contigs containing UCE matches were extracted from both the velvet assemblies and the downloaded genomes using the phyluce_assembly_match_contigs_to_probes script, and each UCE locus was aligned with MAFFT 6.240 (Katoh and Standley 2013) using the default settings and edge trimmed with GBlocks (Castresana 2000). The dataset was then reduced to a 75% complete matrix by removing any loci with data from fewer than 75% of individuals.
To partition the dataset, we first used the entropy-based method SWSC-en, which is designed specifically for UCE loci (Tagliacollo and Lanfear 2018). Most UCEs are split into three partitions: one containing the middle, highly conserved core of the UCE; and two flanking ends containing the more variable positions outside the core. Because SWSC-en partitions each locus individually, the resulting partitions were input into PartitionFinder2 for further consolidation (Lanfear et al. 2017). In this step, for example, the middle partitions from many different loci might be combined into one. PartitionFinder2 was also used to identify the best substitution model for each partition based on AICc score. Maximum-likelihood trees were estimated for the concatenated dataset using IQ-TREE (Nguyen et al. 2015), with branch supports obtained using the ultrafast bootstrap method (UFboot; Hoang et al. 2018) and the SH-like approximate likelihood ratio test (SH-aLRT; Anisimova et al. 2011), both with 1000 replicates.
We also used a coalescent-based method to estimate a phylogeny. Gene trees are weighted equally when imputing a species tree, but many UCE loci contained only weak phylogenetic signal; to account for this, we sorted the UCE loci by number of parsimony-informative sites using the phyluce_align_get_informative_sites script, and the dataset was reduced to the top quartile of loci. A maximum-likelihood gene tree was then generated for each UCE locus in the reduced dataset using IQ-TREE, with branch supports obtained using UFboot and SH-aLRT. Because not all loci are expected to contain strong phylogenetic signal for every node in the tree, all branches with less than 10% bootstrap support were collapsed (Zhang et al. 2019). These gene trees were used to estimate species trees using ASTRAL-III (Rabiee et al. 2019). Because we found evidence in our IQ-TREE analysis that some species did not form monophyletic groups, we left individuals as OTUs (operational taxonomic unit) rather than categorizing individuals by species (i.e., we did not use the -a flag option).