Supplementary material: Ultraconserved elements improve the resolution of difficult nodes within the rapid radiation of neotropical sigmodontine rodents (Cricetidae: Sigmodontinae)
Parada, Andrés; D'Elía, Guillermo; Hanson, John (2021), Supplementary material: Ultraconserved elements improve the resolution of difficult nodes within the rapid radiation of neotropical sigmodontine rodents (Cricetidae: Sigmodontinae), Dryad, Dataset, https://doi.org/10.5061/dryad.15dv41nw9
Sigmodontine rodents (Cricetidae, Sigmodontinae) represent the second largest muroid subfamily and the most species-rich group of New World mammals, encompassing above 410 living species and ca. 87 genera. Even with advances on the clarification of sigmodontine phylogenetic relationships that have been made recently, the phylogenetic relationships among the 11 main group of genera (i.e., tribes) remain poorly resolved, in particular among those forming the large clade Oryzomyalia. This pattern has been interpreted as consequence of a rapid radiation upon the group entrance into South America. Here, we attempted to resolve phylogenetic relationships within Sigmodontinae using target capture and high-throughput sequencing of ultraconserved elements (UCEs). We enriched and sequenced UCEs for 56 individuals and collected data from four already available genomes. Analyses of distinct data sets, based on the capture of 4,634 loci, resulted in a highly resolved phylogeny consistent across different methods. Coalescent species-tree based approaches, concatenated matrices, and Bayesian analyses recovered similar topologies that were congruent at the resolution of difficult nodes. We recovered good support for the intertribal relationships within Oryzomyalia; for instance, the tribe Oryzomyini appears as the sister taxa of the remaining oryzomyalid tribes. The estimates of divergence times agree with results of previous studies. We inferred the crown age of the sigmodontine rodents at the end of Middle Miocene, while the main lineages of Oryzomyalia appear to have radiated in a short interval during the Late Miocene. Thus, the collection of a genomic scale data set with a wide taxonomic sampling, provided resolution for the first time of the relationships among the main lineages of Sigmodontinae. We expect the phylogeny presented here will become the backbone for future systematic and evolutionary studies of the group.
Taxon Sampling and Target Capture of UCEs
UCEs for 60 species of muroids were gathered. Of these, 53 belong to Sigmodontinae, including representatives of all sigmodontine tribes, but Wiedomyini. The other seven species belong to three of the other four subfamilies of Cricetidae, Cricetinae (2), Neotominae (1), and Tylomyinae (1), and to the families Muridae (2) and Spalacidae (1). Details of the specimens analyzed, including their taxonomic classification and museum or collection number or genome assembly are provided in Supplementary Table S1.
Data for 56 of the 60 analyzed specimens were gathered in this study as follows. Genomic DNA was extracted using the Qiagen DNeasy Tissue and Blood kit. Samples were quantified using the Qubit BR dsDNA kit (Thermo Fisher Inc). Library prep was performed using the KAPA HyperPrep Kit and 0.8-1.2ug of DNA. Libraries were enriched for UCE loci using the UCE-5Kv1 probe set (Mycroarray Inc), and sequencing was performed using Illumina Technology using 2x250 paired-end sequencing.
UCEs from the cricetine Cricetulus griseus, the murids Mus musculus and Rattus norvergicus, and the spalacid Nannospalax galili, were collected using in silico alignment of a 5,060 UCEs probe set (Faircloth et al. 2012) to genomes available from NCBI (Supplementary Table S1), extracting the matched region and 300 bp of flanking nucleotides on each side.For all specimens, a coverage of 67% and identity of 80% -during the ‘run_multiple_lastzs’ phase- were used as in Esselstyn et al. (2017), following the tutorial available at https://phyluce.readthedocs.io/en/latest/tutorial-three.html.
UCEs Matrix Construction
After sequencing, low-quality bases and adapters were removed with Trimmomatic 0.36 (Bolger et al. 2014). The Python package PHYLUCE 1.5.0 (Faircloth 2016) produced different sets of alignments. Clean reads were assembled with Trinity v2.4.0 (Grabherr et al. 2011) prior to extracting the contigs matching UCEs. Each locus was aligned with MAFFT (Katoh and Standley 2013). Trimming allowed alignments with missing nucleotides at the flanks of each alignment only if at least 65% of taxa contained data. Uncertain alignment regions were trimmed with Gblocks (Castresana 2000). Concatenated alignments were created with AMAS (Borowiec 2016). Three data set were assembled: a) one 60-species data set containing UCEs that were present in at least 70% of all analyzed species that includes 2,985 loci; two 42-genera reduced data sets consisting of one species per genus with b) a minimum of taxon occupancy of 70% that includes 2,905 loci and c) of 90% that includes 215 loci (see Supplementary Table S2a-c).
Each of the three data sets was subjected to Maximum Likelihood (ML) analysis using the concatenated matrices, as well as coalescent based analyses, after obtaining gene trees via ML. Maximum Likelihood inference was performed with IQ-TREE 1.5.5 (Nguyen et al. 2015) considering a partition for each locus (Chernomor et al., 2016) and with the model of substitution selected for each locus via Bayesian information criterion (BIC) with ModelFinder (Kalyaanamoorthy et al. 2017) using script available at https://github.com/drelo/model-partition-IQTREE. Selected models are listed in Supplementary Table S2a-c). To assess branch support, we used the ultrafast bootstrap approximation (UF) with 5000 pseudoreplicates (Minh et al. 2013) resampling partitions and then sites within resampled partitions.
ML gene trees for each locus included in the three datasets were inferred with 200 standard non-parametric bootstrap (option -b) with IQ-TREE. A species tree was inferred using gene trees with ASTRAL-III (Zhang et al. 2017). We summarized the support on each species trees with multi-locus bootstrapping (hereafter BS, Seo 2008) considering Gene+site resampling (-b -g arguments) and the quartet score (QS), which is “the fraction of the induced quartet trees in the input set that are in the species tree” (see details at https://github.com/smirarab/ASTRAL). We plotted tanglegrams to compare topologies recovered in ASTRAL and IQTREE analyses of the 60-species data set and the BEAST analysis (‘initial’ analysis, see below) using the function ‘cophylo’ from the R package phytools (Revell, 2012).
Divergence Time Analyses
Given that not all loci could be included in the dating analysis due to constrains in calculous capacity, informative and low variance loci were filtered before the divergence-time inference. A “gene shopping” procedure was employed to select candidate genes or 'good genes' among the data set with 2,905 genes, following a procedure detailed by Smith et al. (2018) that identifies informative and low variance genes, using the scripts available at https://github.com/FePhyFoFum/SortaDate. These analyses considered the species tree obtained with ASTRAL for the data set of 2,905 genes. Loci are sorted by clock-likeness, discernible information content and least topological conflict. Candidates were selected according to two criteria: a) by bipartition concordance, then root-to-tip variance, and then tree length (“order 3,1,2" option) and b) by root-to-tip variance, then bipartition concordance, and tree length (“order 1,3,2" option). Using the sorted list of genes derived from the two mentioned criteria, we constructed four data sets. The first three data sets were constructed from the list gathered with “order 3,1,2": 1) the first 17 genes, 2) the first 25 genes, and 3) the second 25 genes (26 to 50). Data set 4) was constructed using the first 25 genes sorted with “order 1,3,2."
Data set 1 was subjected to a first (I) Bayesian analyses in BEAST2 2.5.0 (Bouckaert et al. 2014). A Birth-Death process and an initial random tree was used as priors. The selected substitution models were implemented in BEAUti (see Supplementary Table S2). Runs were performed under an uncorrelated lognormal relaxed-clock model for each partition. The fossil estimate of the split of Mus-Rattus (sensu Kimura et al. 2015) was employed as a calibration point; we set it with a normal prior distribution (mean = 11.77 million years (Ma), standard deviation = 0.36), providing a minimum bound for the distribution such that the 5 % quantile corresponds to the minimum age of the fossil, while the 95 % interval accounts both for the uncertainty of the fossil age and for the incompleteness of the fossil record. Three independent runs of 6 × 10^8 generations -sampled every 15.000 generations- were performed. The other three data sets were the base of four additional dating analysis that include the following departures from the analysis of the first data set described above. In addition to the Mus-Rattus split, all used two other calibration points: the crown age of Phyllotini, based on the fossil record of Kraglievchimys formosus (Barbière et al. 2019) and the crown age of Sigmodontalia based on the fossil record of Prosigmodon oroscoi (Jacobs and Lindsay 1981; Peláez-Campomanes and Martin 2005). These priors were incorporated to the study as lognormal distributions for the age of the most recent common ancestor of Phyllotini and Sigmodontalia, respectively, with mean, standard deviation and offset of 0.001, 0,58, 2.9 and 0.03, 0.55, 4,27, respectively. Data set 2 (genes 1 to 25 with criterion 3,1,2) was used in two analyses: II with a relaxed clock and III with a strict clock. Data set 3 (genes 26 to 50 with criterion 3,1,2) was included in analysis IV with a strict clock. Finally, data set 4 (genes 1 to 25 with criterium 1,3,2) was used in analysis V with a strict clock. Analyses II, III and V were run with two independent runs of 8 x 10⁸ generations that were sampled every 50.000 generations; while analysis IV used four independent runs of 9 x 10⁷ generations that were sampled every 50.000 generations. A summary of the different data sets and analysis is presented in Supplementary Table S2.
For all analyses, convergence to stable values was checked with Tracer v.1.7 (Rambaut et al. 2018) obtaining an effective sample size (ESS) greater than 200 for all parameters. Tree and log files (108003 trees after a 10 % burn-in) were combined using LogCombiner. Trees then were compiled into maximum clade credibility (MCC) trees using TreeAnnotator to display mean node ages and highest posterior density (HPD) intervals (95 % upper and lower) for each node.
Captions of Figures and Tables
Supplementary Figures S1-S11
Supplementary Table S1
List of taxa included in the present study.
Supplementary Table S2
Best-fitting model for each UCE locus for the 60-species data set (with 70% taxon occupancy, 2a) and 42-genera data sets with 70% (2b) and 90% taxon occupancy (2c). Columns are abbreviated as follows: Var = variable sites; gap = gaps. BIC = Bayesian Information Criterion. Details of the data sets constructed via the 'gene shopping' procedure (2d) and the different dating analyses (2e) are provided in this Table (see also Online Appendix for details).
Supplementary Table S3
Genera and species richness of each sigmodontine tribe. Proportion of genera and species sampled per sigmodontine tribe in our 60-species data set (of which 53 are sigmodontine species).
Supplementary Table S4
Comparison of the crown age estimates for selected nodes obtained in the present study and those of selected previous dating analyses of Sigmodontinae. The table summarizes results for analyses of I-IV (see Online Appendix for details and Table S2). Ages (million years) are mean node heights. Mean age or Mean age and highest posterior density intervals at 95% (upper and lower).
Online Appendix 1
Detailed Materials and Methods
Three data sets:
- 60-species (70% taxon occupancy) = 2,958 UCEs
- 42-genera (90% taxon occupancy) = 215 UCEs
- 42-genera (70% taxon occupancy) = 2,905 UCEs
Zip files contain data as described below.
ALIGNS. 3 data sets, occupancy as indicated in each folder.
BEST TREES. trees from ML gene-tree analysis. 3 data sets, occupancy as indicated in each folder.
CONCATENATED. *.fasta files + partition file accompanying each file.
TREES3METHODS. Consisting of results from 3 analyses IQTREE, ASTRAL & BEAST.
- -ASTRAL 2 trees per data set, with standard bootstrap and quartet support (BS y QS).
- -BEAST MCC trees of analyses I-IV.
- -IQTREE 3 trees, one for each data set.
Fondo Nacional de Desarrollo Científico y Tecnológico, Award: 3150604
Fondo Nacional de Desarrollo Científico y Tecnológico, Award: 1180366