Nuclear phylogeography of the temperate tree species Chiranthodendron pentadactylon (Malvaceae): Quaternary relicts in Mesoamerican cloud forests
Cite this dataset
Hernández-Langford, Diana Gabriela; Siqueiros-Delgado, María Elena; Ruíz-Sánchez, Eduardo (2020). Nuclear phylogeography of the temperate tree species Chiranthodendron pentadactylon (Malvaceae): Quaternary relicts in Mesoamerican cloud forests [Dataset]. Dryad. https://doi.org/10.5061/dryad.dfn2z34w9
The Mexican hand tree or Canac (Chiranthodendron pentadactylon) is a temperate tree species of cloud and pine-oak forests of southern Mexico and Guatemala. The characteristic hand-shaped flower is used in folk medicine and has constituted the iconic symbol of the Sociedad Botánica de México since 1940. Here, the evolutionary history was estimated through phylogeographic analyses of nuclear DNA sequences obtained through restriction site associated DNA sequencing and ecological niche modeling. Total genomic DNA was extracted from leaf samples obtained from a representative number (5 to 10 per sampling site) of individuals distributed along the species geographic range. In Mexico, population is comprised by spatially isolated individuals which may follow the trends of cloud forest fragmentation. Whether in Guatemala, Chiranthodendron may constitute a canopy dominant species near the Acatenango volcano. The distributional range encompasses geographic provinces separated by the Isthmus of Tehuantepec.
The objectives of the study were to: Estimate the genetic structure to define whether the observed range disjunction exerted by the Isthmus of Tehuantepec translates into separate populations. To relate population divergence timing and demographic trends to historical climate change, and to test hypotheses related to Pleistocene refugia.
Patterns of genetic diversity indicated high levels of genetic differentiation between populations separated by the Isthmus. The western and eastern population diverged approximately 0.873 MYA. Demographic analyses supported a simultaneous split from an ancestral population and rapid expansion from a small stock approximately 0.2 MYA corresponding to a glacial period. The populations have remained stable since the LIG (130 KYA). SDM predicted a decrease in potential distribution in the LIG and an increase during LGM (22 KYA), Mid-Holocene (6 KYA) and present times.
Divergence estimations support the hypothesis that populations represent Quaternary relict elements of a species with broader and northernmost distribution. Pleistocene climatic shifts exerted major influence on the distribution of populations allowing dispersion during episodes of suitable climatic conditions and structuring during the first interglacial with a time period length of 100 KYR and the vicariant influence of the Isthmus. Limited demographic expansion and population connectivity during the LGM supports the moist forest hypothesis model.
C. pentadactylon is a tree of up to 30 m tall (Fig. 1A) with a minimum generative growth time of more than 3 years . It is resistant to low temperatures (down to 5ºC) and dry conditions . The dichogamous red flowers (Fig. 1B) can be cross-pollinated or geitonogamously pollinated . Pollination is mediated by birds  and bats . The flowering period spans from November to April and the fruiting period from April to May . Fruits are dehiscent capsules (Fig. 1C) and the strophiolated seeds  dispersal is potentially mediated by birds .
Sampling and DNA sequencing
Leaf samples from 5 to 10 individuals per sampling site found along the range of C. pentadactylon in Mexico and Guatemala (Fig. 2) were collected and stored and dried with silica gel at room temperature until further processing. Two additional leaf samples from Fremontodendron californicum and F. mexicanum were collected from the Rancho Santa Botanic Garden in Claremont, California to be incorporated as outgroup species. No special permission was required to collect the samples. Taxon identification and geographic corroboration were performed during fieldwork after reviewing voucher specimens deposited in the herbarium of the Universidad Autónoma de Aguascalientes (Herbario UAA), the Instituto de Ecología AC at Pátzcuaro, the Facultad de Agronomía de la Universidad de San Carlos de Guatemala (AGUAT), the Escuela de Biología, USAC (Herbario BIGU), the Missouri Botanical Garden (MO) and the Chicago Natural History Museum. Each later defined population has an accompanying voucher deposited at the Herbarium UAA (Additional file 1).
Total genomic DNA was extracted following the cetyltrimethylammonium bromide (CTAB) protocol  with an additional incubation step with the enzyme pectinase to remove the excess of carbohydrates. Thereupon, a RAD sequencing library was prepared at Rancho Santa Ana Botanic Garden (Claremont, California) and at the University of California Riverside as described by Etter et al  and modified by Medina  using SbfI-HF restriction enzyme (New England Biolabs). Samples were pooled into a multiplexed library and sequenced on an Ilumina NextSeq500 platform to generate 150-bp single-end readings.
The raw sequence data were analyzed and assembled de novo in the software pipeline ipyrad v0.7.28 , which identifies two sequences as orthologous as determined by a specified degree of similarity (0.90). Filtering parameters were set to trim all reads to a length of 140 bp. Reads were clustered allowing a maximum number of ambiguous base calls (Q < 20) of 0 and a maximum depth of coverage of 10k. To date, karyotype information and ploidy level have not been determined, therefore loci containing more than two alleles were discarded to exclude potential paralogs.
When clustered across samples a minimum number of samples per locus were set to 66 to maximize the representativeness of the total samples per locus. The samples had an average of 565 K reads that passed quality filtering. These clustered into an average of 8837 clusters per sample with a mean depth of 4526 producing a mean of 4075 consensus sequences per sample and a maximum total of 70 loci were recovered (Additional file 2).
Loci retrieved from the RADSeq genome-partitioning approach are mainly from the nuclear genome . However, 5 loci were highly similar to mitochondrial loci from closely related species as identified through the basic local alignment search tool (BLAST) option from the National Center for Biotechnology Information (NCBI) database and were discarded from further analysis, resulting in a total of 65 nuclear loci. Mitochondrial DNA substitution rate is at least 5 times slower than nuclear genes  and thence may not provide the enough resolution to detect infraspecific historical processes. Additional data files recovered from the alignment include the variant call format (vcf) file and the SNPs file.
Genetic diversity and genetic differentiation
The genetic diversity was estimated through the nucleotide diversity (Pi) measure proposed by Nei (1987) and implemented by the software DnaSP v6 . The estimation was based on the sequence variation information stored in the variant call format (vcf) retrieved from the ipyrad alignment.
The underlying degree of genetic structure was determined through the Bayesian clustering approach implemented by the software STRUCTURE v2.3.4 (Pritchard et al, 2012). The substructure was estimated from the set of 65 nuclear loci. Posterior probabilities for the ancestry model and allele frequency model parameters were estimated under the admixture model and the correlated allele frequencies model respectively. The MCMC process estimation was conducted using a burn-in of 1,000,000 followed by 100,000 chains for each of the 10 replicas assigned to a range of 1 to 9 K assumed groups. The most probable K value was estimated using the Evanno method  implemented in STRUCTURE HARVESTER  and the corresponding individual membership coefficients matrix was used as input file in STRUCTURE PLOT  to visualize the degree of genetic differentiation among groups.
Genetic structure was further evaluated through a global analysis of molecular variance (AMOVA) using the software Arlequin v3.5 . The analysis was performed based on a weighted average over all loci and confidence intervals for the fixation index were calculated by bootstrapping with 20,000 replicates.
Phylogenetic network and divergence times
To visualize the phylogenetic relationships among and within populations, a phylogenetic network was constructed using the Neighbor-Net algorithm implemented by the software SplitsTree5 . The relationships were estimated using the linked SNPs file. The Neighbor-Net method consists of the agglomeration of weighted collection of splits or partition of the set of taxa which constitute the building blocks of a phylogenetic tree and provides the visualization of the space of feasible trees. Thereof, constitutes a useful tool for the representation of the relationships of taxa in which the underlying evolutionary history may not be treelike due to processes such as recombination, hybridization, gene conversion and gene transfer .
To relate the genetic divergence among populations to pre-Pleistocene and Pleistocene events, calibrated time trees were calculated under the multispecies coalescent model implemented in StarBEAST2 . StarBEAST2 models the incomplete lineage sorting process between and within species with no recent history of gene flow  and allows variation in substitution rates of different genes and species by using gene tree relaxed clock models . The time trees were calculated from a selection of 36 polymorphic nuclear loci (Additional file 3) of approximately 140 bp. The best DNA substitution model (Table 1 in Additional file 4) and gamma rate heterogeneity for each locus were determined using jModelTest v2.1.10  through the implementation of the Akaike information criterion (AIC). The corresponding site models were set with four rate categories and an uncorrelated lognormal molecular clock was established for the clock model. The constant populations model was selected for the population model parameter and the Calibrated Yule model for the tree prior. The time trees were calibrated using a secondary calibration point due to the absence of paleontological information to constrain a minimum age of divergence for C. pentadactylon. This calibration point derives from a dated phylogeny for the Malvaceae family, which was used to infer divergence dates and diversification rates within the Theobroma genus . The resulting estimated age of divergence between C. pentadactylon and its sister genus Fremontodendron was on the order of 5 MYA. This was set as the median age for the informative lognormal prior with a standard deviation of 0.01. The MCMC chain length was 180,000,000 and logged every 5,000. The log files were analyzed with Tracer v1.7.1 for parameter convergence. The burn-in was set to remove ten percent of the tree files before the generation of a maximum clade credibility tree with median heights in TreeAnnotator.
Demographic and evolutionary history
To assess changes in population size over time and determine their correspondence to pre-Pleistocene or Pleistocene events, a Coalescent Bayesian Skyline Plot (BSP) was estimated for each group defined in STRUCTURE and later defined as separate populations, using BEAST v.2.5.2. Linked SNPs data were set as input files. The best substitution model for each set of SNPs was determined using jModelTest through the implementation of the AIC (Table 2 in Additional file 4). The corresponding site models were set with four rate categories and a strict molecular clock was established for the clock model. The time scale was calibrated with the divergence age of 0.873 MYA previously estimated and a standard deviation of 0.2 under a lognormal distribution. The MCMC chain length was set to 10,000,000 and to 15,000,000 for the western and eastern population respectively and logged every 1000. The log files were analyzed in Tracer for parameter convergence.
The population history of C. pentadactylon was inferred from an approximate Bayesian computation (ABC) in which evolutionary scenarios were simulated and compared through posterior probabilities using the DIYABC v2.1.0  software. The linked SNPs data file was used to simulate 1 million datasets per scenario. The evolutionary scenarios were built considering the genetic structure and StarBEAST2 analyses: a) Two populations (Pop1 and Pop2) have diverged simultaneously from an ancestral population at t1 (scenario1). b) Pop2 (Eastern population) diverged from Pop1 (Western population) at t1 (scenario 2). c) Pop1 diverged from Pop2 at t1 (scenario 3). d) Pop2 derived from few western immigrants at t2 and diverged at t1 (scenario 4). e) Pop1 derived from few eastern immigrants at t2 and diverged at t1 (scenario 5).
Posterior probabilities of scenarios were assessed using a polychotomic weighted logistic regression on 1% of the simulated datasets. The fit between the simulated and observed datasets was evaluated through the implementation of a model checking procedure based on a principal component analysis (PCA). Confidence in scenario choice was assessed by means of the simulation of 500 pseudo-observed datasets (PODs) under each scenario to estimate Type I and Type II error rates.
The potential distribution of C. pentadactylon for the present day, Mid-Holocene (Mid-HLC, 6 KYR), Last Glacial Maximum (LGM, 22 KYR) and Last Interglacial (LIG, 130 KYR) was predicted using Maxent 3.4.1 . The georeferenced sampling locations were used as occurrence data in addition to data retrieved (15 March 2017) from the Global Biodiversity Information Facility (GBIF; Chiranthodendron pentadactylon Larreat in GBIF Secretariat (2019). GBIF Backbone Taxonomy. Checklist dataset https://doi.org/10.15468/39omei), TROPICOS (https://www.tropicos.org/Name/3900594, Missouri Botanical Garden, St. Louis, MO, USA; available from: http://www.tropicos.org) and REMIB (http://www.conabio.gob.mx/remib/doctos/remib_esp.html) databases and represented 152 unique localities after data cleansing including the removal of duplicate points. The models were built from 9 climate layers selected through the jackknife method. These had a 2.5 arc minute spatial resolution and were acquired from the WorldClim database . Layers for the LGM and the Mid-HLC were obtained from two paleoclimate models (CCSM and MIROC) that simulate differing climatic predictions . Distribution models were built with 10 replicates using the default settings and 25% of the occurrence data was selected for model validation.
Consejo Nacional de Humanidades, Ciencias y Tecnologías, Award: 358420