Unraveling evolutionary pathways: Allopolyploidization and introgression in polyploid Prunus (Rosaceae)
Data files
Aug 26, 2025 version files 648.65 MB
-
1_Prunus_SCN1806_ref.fasta
5.63 MB
-
2_Prunus_SCN_MO.phy
263.40 MB
-
3_Prunus_SCN_RT.phy
364.03 MB
-
4_CDS77_ref.fasta
68.02 KB
-
5_Prunus_CDS.phy
7.64 MB
-
6_Maddenia_cp.phy
6.50 MB
-
Maddenia_cp_IQTREE.log
43.61 KB
-
Maddenia_cp_IQTREE.tre
2.11 KB
-
Maddenia_cp_RAxML.log
25.56 KB
-
Maddenia_cp_RAxML.tre
2.97 KB
-
Prunus_CDS_ASTRAL.log
20.21 KB
-
Prunus_CDS_ASTRAL.tre
4.85 KB
-
Prunus_CDS_IQTREE.log
36.75 KB
-
Prunus_CDS_IQTREE.tre
5.75 KB
-
Prunus_CDS_RAxML.log
38.52 KB
-
Prunus_CDS_RAxML.tre
8.07 KB
-
Prunus_MO_ASTRAL.log
15.86 KB
-
Prunus_MO_ASTRAL.tre
4.80 KB
-
Prunus_MO_IQTREE.log
110.13 KB
-
Prunus_MO_IQTREE.tre
6.08 KB
-
Prunus_MO_RAxML.log
359.02 KB
-
Prunus_MO_RAxML.tre
8.09 KB
-
Prunus_RT_ASTRAL.log
14.97 KB
-
Prunus_RT_ASTRAL.tre
4.74 KB
-
Prunus_RT_IQTREE.log
154.83 KB
-
Prunus_RT_IQTREE.tre
6.08 KB
-
Prunus_RT_RAxML.log
492.40 KB
-
Prunus_RT_RAxML.tre
8.10 KB
-
README.md
3.15 KB
Abstract
Allopolyploidization, resulting from hybridization and subsequent whole‐genome duplication (WGD), is a fundamental mechanism driving evolutionary diversification across various lineages within the Tree of Life. The polyploid Prunus (Rosaceae), significant for its economic and agricultural value, provides an ideal model for investigating the evolutionary dynamics associated with allopolyploidy. In this study, we utilized deep genome skimming (DGS) data to demonstrate a comprehensive analytical framework for elucidating the underlying allopolyploidy that includes a newly adapted tool (DGS‐Tree2GD) tailored explicitly for accurately detecting WGD events. Additionally, we introduced two methods to evaluate the contribution of incomplete lineage sorting (ILS) to lineage diversification. Phylogenomic discordance analyses revealed that allopolyploidization, rather than ILS, played a dominant role in the origin and dynamics of polyploid Prunus. Moreover, we inferred that the uplift of the Himalayas from the Middle to Late Miocene was a key driver in the rapid diversification of the Maddenia clade, an endemic group in East Asia. This geological event facilitated extensive hybridization and allopolyploidization, particularly the introgression between the Himalayas–Hengduan and Central–Eastern China clades. This case study demonstrates the robustness and efficacy of our analytical approach in precisely identifying WGD events and elucidating the evolutionary mechanisms underlying allopolyploidization in polyploid Prunus.
https://doi.org/10.5061/dryad.fbg79cp46
The package contains the datasets and analysis output files.
Description of the data and file structure
Alignments:
Data 1. 1_Prunus_SCN1806_ref.fasta = 1806 single copy nuclear reference genes filtered from three genomes (P. dulcis, P. armeniaca, and P. persica)
Data 2. 2_Prunus_SCN_MO.phy = concatenated alignments of 1622 SCN genes from “Monophyletic Outgroup” (MO) dataset
Data 3. 3_Prunus_SCN_RT.phy = concatenated alignments of 2268 SCN genes from “RooTed ingroup” (RT) dataset
Data 4. 4_CDS77_ref.fasta = plastid coding genes reference genes
Data 5. 5_Prunus_CDS.phy = concatenated alignments of plastid coding genes (CDSs)
Data 6. 6_Maddenia_cp.phy = concatenated alignments of the Maddenia-plastome dataset
Trees:
Prunus_MO_ASTRAL.tre = the phylogenetic tree estimated using ASTRAL-III based on the MO dataset
Prunus_MO_IQTREE.tre = the phylogenetic tree estimated using IQ-TREE2 based on the MO dataset
Prunus_MO_RAxML.tre = the phylogenetic tree estimated using RAxML based on the MO dataset
Prunus_RT_ASTRAL.tre = the phylogenetic tree estimated using ASTRAL-III based on the RT dataset
Prunus_RT_IQTREE.tre = the phylogenetic tree estimated using IQ-TREE2 based on the RT dataset
Prunus_RT_RAxML.tre = the phylogenetic tree estimated using RAxML based on the RT dataset
Prunus_CDS_ASTRAL.tre = the phylogenetic tree estimated using ASTRAL-III based on the CDS dataset
Prunus_CDS_IQTREE.tre = the phylogenetic tree estimated using IQ-TREE2 based on the CDS dataset
Prunus_CDS_RAxML.tre = the phylogenetic tree estimated using RAxML based on the CDS dataset
Maddenia_cp_IQTREE.tre = the phylogenetic tree estimated using IQ-TREE2 based on the Maddenia-plastome dataset
Maddenia_cp_RAxML.tre = the phylogenetic tree estimated using RAxML based on the Maddenia-plastome dataset
Log files:
Prunus_MO_ASTRAL.log = the phylogenetic tree estimated using ASTRAL-III based on the MO dataset
Prunus_MO_IQTREE.log = the phylogenetic tree estimated using IQ-TREE2 based on the MO dataset
Prunus_MO_RAxML.log = the phylogenetic tree estimated using RAxML based on the MO dataset
Prunus_RT_ASTRAL.log = the phylogenetic tree estimated using ASTRAL-III based on the RT dataset
Prunus_RT_IQTREE.log = the phylogenetic tree estimated using IQ-TREE2 based on the RT dataset
Prunus_RT_RAxML.log = the phylogenetic tree estimated using RAxML based on the RT dataset
Prunus_CDS_ASTRAL.log = the phylogenetic tree estimated using ASTRAL-III based on the CDS dataset
Prunus_CDS_IQTREE.log = the phylogenetic tree estimated using IQ-TREE2 based on the CDS dataset
Prunus_CDS_RAxML.log = the phylogenetic tree estimated using RAxML based on the CDS dataset
Maddenia_cp_IQTREE.log = the phylogenetic tree estimated using IQ-TREE2 based on the Maddenia-plastome dataset
Maddenia_cp_RAxML.log = the phylogenetic tree estimated using RAxML based on the Maddenia-plastome dataset
Taxon sampling, DNA extraction, and sequencing
Based on the recently proposed phylogenetic backbone of Prunus s.l. (Hodel et al., 2023), we performed comprehensive sampling across all major clades to investigate the evolutionary origins and diversification patterns of the polyploid Prunus. The Maddenia group, in particular, was extensively sampled due to its suitability as a model system for investigating rapid radiation involving allopolyploidization and hybridization. We collected a total of 41 samples representing five morphologically delineated Maddenia species: Prunus fujianensis and P. gongshanensis (seven individuals), P. himalayana and P. hypoxantha (five), and P. hypoleuca (17). This targeted sampling strategy enabled us to capture the comprehensive morphological, genetic, and geographic diversity observed within this group (Su et al., 2021). To address the lack of fossil records in Prunus s.l., we expanded our phylogenetic analysis to include 30 additional species from the subfamily Amygdaloideae, thus improving the robustness of evolutionary inference and temporal calibration. Rosa rugosa Thunb. from subfamily Rosoideae served as the outgroup (Xiang et al., 2017). In total, our dataset comprised 114 samples, including 90 DGS and 24 RNA-Seq datasets (Table S3). The breadth and depth of this dataset facilitate a detailed exploration of polyploid evolution and its associated complexities within Prunus s.l.
Genomic DNA was extracted from silica-gel dried leaves and herbarium specimens using a modified CTAB protocol (Doyle and Doyle, 1987; Li et al., 2013) at the Institute of Botany, Chinese Academy of Sciences (IBCAS). After quality control, DNA libraries were prepared with the NEBNext® Ultra™ II DNA Library Prep Kit and sequenced on the Illumina MiSeq (Kunming Institute of Botany, Kunming) or BGISEQ-500 (Frasergen, Wuhan) platforms. The raw data are archived in the NCBI Sequence Read Archive (SRA) under BioProject PRJNA1031385 (Table S3), with vouchers deposited in the PE and WUK (herbaria codes follow Thiers, 2017).
Plastome assembly, annotation, and alignment
The raw data were initially processed to trim low-quality reads and adapters using Trimmomatic v.0.39 (Bolger et al., 2014). The resulting clean reads were then quality-checked with FastQC v.0.12.1 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Following quality assessment, the plastid genomes were assembled using GetOrganelle v.1.7.7.0 (Jin et al., 2020) and annotated using the Plastid Genome Annotator (PGA, Qu et al., 2019). All plastomes assembled during this study were submitted to GenBank (Table S3).
In this study, we employed two alternative strategies for extracting plastid CDSs across different data types to generate the “plastid CDS” dataset. Geneious Prime v.2023.2 (Kearse et al., 2012) was utilized to extract the plastid CDSs from the assembled complete plastomes. Concurrently, we employed HybPiper v.2.1.6 (Johnson et al., 2016) to assemble the plastid CDSs from RNA-Seq data, with the closely related genomes downloaded from GenBank used as references. The plastid CDSs were aligned using MAFFT v.7.520 (Nakamura et al., 2018) and trimmed poorly aligned regions using trimAl v.1.4.1 (Capella-Gutiérrez et al., 2009). Then those sequences were concatenated with AMAS v.1.0 (Borowiec, 2016). Given the reduced site variation within the Maddenia group, we generated a “Maddenia-plastome” dataset specifically to infer the maternal phylogeny.
Single-copy nuclear genes development and assembly
Considering the absence of a published polyploid Prunus genome, we selected three well-assembled genomes within Prunus s.l. as reference genomes to balance phylogenetic representativeness and data quality. After evaluating assembly quality and annotation completeness, we selected Prunus dulcis (Mill.) D.A.Webb (GCA902201215.1), P. armeniaca (GCA903112645.1), and P. persica (GCA000346465.2) as representative reference genomes. Coding sequences were then extracted from each of these genomes for downstream analyses. We used MarkerMiner v.1.0 (Chamala et al., 2015) to identify SCN genes as phylogenetic markers, leveraging its built-in “Fvesca” database (derived from the strawberry genome) as the reference. Strawberry (Fragaria vesca L.) is a close relative of Prunus, so using the “Fvesca” reference increased the likelihood of detecting homologous genes in our taxa (Morales-Briones et al., 2018, 2022; Liston et al., 2021). We set the minimum transcript length parameter (minTranscriptLen) to 600 bp to ensure that retrieved sequences were sufficiently long, leaving all other parameters at their default values.
We implemented the HybPiper v.2.1.6 pipeline (Johnson et al., 2016) to assemble SCN genes from the clean reads. This was done with a coverage cutoff value set to 5 to ensure that only reads with sufficient depth were assembled, thus enhancing the reliability of the assembled sequences. We used specific commands to calculate sequence length statistics (Table S4) and generated a heatmap to visualize the assembly results (Figure S1). Although SCN genes were initially identified from the three designated reference genomes as strictly single-copy using MarkerMiner, their status as single-copy may not hold across all taxa—particularly within polyploid Prunus, where complex polyploidization events have likely introduced additional gene copies. As a result, some SCN genes may actually represent paralogs in these lineages, potentially confounding downstream phylogenetic analyses. To address this, we implemented a rigorous orthology inference following the assembly step, allowing us to identify and retain true orthologs.
Orthology inference and dataset generation
Identifying potential paralogs is essential for accurate phylogenetic inferences to avoid the confusion caused by paralogous genes. Building on the methods from Yang and Smith (2014), Morales-Briones et al. (2022), and Liu et al. (2022), we employed two orthology inference strategies for nuclear genes assembled from DGS reads. These strategies generated two SCN datasets—the “Monophyletic Outgroup” (MO) dataset and “RooTed ingroup” (RT) dataset—with Rosa rugosa as the outgroup. Under the MO approach, homologs were detected by identifying a monophyletic, non-repeating outgroup, rooting the gene trees, and removing redundant paralog branches, retaining only the largest non-repeating ingroup branch that met the minimum size requirement. In contrast, the RT method split branches containing non-repeating ingroups and removed outgroups to maximize orthologs, before reintroducing MO outgroups for downstream analysis. Detailed procedures for this step are provided in Method S1.
Integrative phylogenetic strategies: concatenation- and coalescent-based analyses
We used concatenation- and coalescent-based phylogenetic methods on the plastid CDS dataset and two distinct SCN datasets to achieve accurate phylogenetic inference. The selection of optimal partitioning schemes and evolutionary models was conducted using PartitionFinder2 (Lanfear et al., 2017), guided by the corrected Akaike information criterion model (AICc), and incorporated linked branch lengths for model testing. To accommodate the varying complexities and sizes of the datasets, we implemented two distinct search strategies: the ‘greedy’ algorithm for its efficiency in larger datasets and ‘rcluster’ for its capability to manage computational demands effectively (Lanfear et al., 2012, 2014). The determined partitioning schemes and evolutionary models were applied for the subsequent phylogenetic inference.
We generated a concatenated supermatrix by combining all aligned gene sequences, and then the ML trees based on this supermatrix were inferred utilizing IQ-TREE2 v.2.2.2.7 (Minh et al., 2020), with 1,000 replicates each for UFBoot assessment (Minh et al., 2013) and the SH-aLRT test (Anisimova and Gascuel, 2006) to evaluate tree topology reliability. Furthermore, RAxML v.8.2.13 (Stamatakis, 2014) was utilized as a supplementary tool for ML tree inference, employing the best-fit substitution model. This was accompanied by the generation of 200 bootstrap replicates for the SCN datasets and 100 replicates for both the “plastid CDS” and “Maddenia-plastome” datasets.
To account for gene tree discordance arising from ILS, we inferred species trees using the coalescent-based method ASTRAL-III (Zhang et al., 2018). This approach integrates individual gene trees derived from plastid CDS and SCN datasets using the MultiSpecies Coalescent (MSC) model, thus reducing bias introduced by conflicting signals. This integrative analytical framework enabled robust reconstruction of evolutionary relationships, reflecting both nuclear (biparentally inherited) and plastid (maternally inherited) evolutionary histories within Prunus s.l..
To account for methodological biases, we combined concatenation and coalescent-based approaches, each with different assumptions. Concatenation may overestimate support under ILS or hybridization, while coalescent methods, like ASTRAL-III, better accommodate gene tree discordance caused by ILS but rely on accurate gene trees. By integrating methods with different assumptions, we aimed to reduce bias and enhance the reliability of our evolutionary inference.
Inference of global split networks
Increasing evidence suggests that web-like patterns of evolution instead of simple tree-like ones can explain how different groups are related, especially when complex processes like hybridization and polyploidy play a role. Given the frequent observation of reticulate signals within Prunus s.l., particularly among polyploid Prunus, we utilized SplitsTree v.4.19.2 (Huson and Bryant, 2006) to estimate a split network from the MO dataset (parameters detailed in Method S2).
Detecting and visualizing nuclear gene tree discordance
Evolutionary processes such as ILS, gene flow, or gene duplication, along with variable evolutionary rates, can cause discrepancies among gene trees. To elucidate the discrepancies between gene trees and species trees, we utilized phyparts v.0.0.1 (Smith et al., 2015) to perform an in-depth analysis of conflicting and concordant bipartitions. Phyparts directly compares nodes and branching patterns of individual gene trees with those of the species tree, identifying conflicting and concordant bipartitions. It takes gene trees and concatenation/coalescent-based tree based on the dataset as input files. However, incomplete gene trees (due to missing data) or branches with insufficient phylogenetic signal may reduce the number of usable gene trees for specific nodes. Consequently, fewer genes than the original input set contributed to the analyses of some nodes. Using these data, we calculated Internode Certainty All (ICA) scores (Salichos et al., 2014), quantifying the extent of phylogenetic discordance at each node. Furthermore, Quartet Sampling (QS) analysis (Pease et al., 2018) was employed to assess branch support within the species tree, which can easily distinguish between true phylogenetic conflict and low-information scenarios. We performed the QS analysis with 100 replicates and a log-likelihood cutoff of 2, which yielded four pivotal metrics for each node. The results of the QS analysis were visualized utilizing the plot_QC_ggtree.R script (https://github.com/ShuiyinLIU/QS_visualization). Interpretation of consensus patterns followed guidelines established by Pease et al. (2018).
Incomplete Lineage Sorting analyses
We employed two methods to investigate the role of ILS in the phylogenetic discordance between gene trees and the species tree. First, the Coalescent Simulation (CoSi) analysis provided a global overview of ILS impact across the phylogeny. Second, the Mutation Calculation based on Coalescent Model (MuCCo) analysis assessed the influence of ILS at specific nodes, offering insights into its contribution to phylogenetic incongruences. We adapted and refined the scripts for both methods to accommodate our DGS data. The updated scripts, along with detailed procedural documentation, are available on GitHub (https://github.com/PhyloAI/Coalescent-Simulation-Analysis; https://github.com/PhyloAI/Mutation-Calculation-based-on-Coalescent-Model).
In the CoSi analysis, subtrees were extracted from gene trees, followed by the inference of a species tree employing ASTRAL-III with these subtrees. The resulting species tree was converted into an ultrametric tree, which was then used as the basis for simulating 10,000 gene trees under the MSC model using the “sim.coaltree.sp” function in the R package Phybase v. 2.0 (Liu and Yu, 2010). The topological variance between empirical gene trees and their simulated counterparts was estimated using DendroPy v.4.5.2 (Sukumaran and Holder, 2010), while the overlap between their distributions, visualized in a column chart, was employed to evaluate the explanatory power of the coalescent model for gene tree discordance and to assess the plausibility of ILS as a contributing factor.
In the MuCCo analysis, we further explored the contribution of ILS by calculating the population mutation parameter “theta” (Cai et al., 2021). Initially, branch lengths of the species tree were converted from coalescent units—reflecting effective population size (Ne) and divergence time—to mutation units, representing substitutions per site, using RAxML v.8.2.13 (Stamatakis, 2014), with a fixed topology derived from our ASTRAL species tree. The theta value for each internal branch was calculated as the ratio of its mutation-unit length (RAxML) to coalescent-unit length (ASTRAL), where theta > 0.1 indicates significant ILS (Cai et al., 2021). This threshold reflects scenarios where high ancestral Ne or rapid speciation events promote ILS, consistent with theoretical expectations (Maddison and Knowles, 2006).
SNP calling and gene flow analyses
Due to the absence of a complete reference genome for the polyploid Prunus, we selected the latest available genome assembly of P. mume Siebold & Zucc. (GCF000346735.1) as the reference. The clean reads were mapped to the reference genome using BWA v.0.7.17 (Li and Durbin, 2009). BAM files were then sorted, duplicates marked, and indexed using SAMtools v.1.19 (Danecek et al., 2021) and Picard v.3.1.1 (http://broadinstitute.github.io/picard/). For genetic variation detection and the combination of genomic variant call format (gVCF) files into a single VCF file, we used GATK4 v.4.5.0.0 (Van der Auwera and O’Connor, 2020). Variant filtering was performed using “VariantFiltration” and “SelectVariants”, as well as “VCFtools” with the parameter “--max-alleles 2” to retain only strictly biallelic SNPs and exclude multi-allelic sites (Danecek et al., 2011).
To investigate gene flow dynamics among closely related species within the polyploid Prunus, we performed analyses using the Dsuite package (Malinsky et al., 2021). Although D-statistics assume the absence of ancestral population structure, our analyses were conducted at a deep phylogenetic scale, minimizing the potential impact of this assumption. To enhance robustness, we also performed complementary analyses, which yielded consistent signals of gene flow.
To enhance the reliability and reduce potential biases, we selected appropriate outgroups for these analyses: P. serotina Ehrh. for the Maddenia group and Lyonothamnus floribundus A.Gray for the broader analyses within Prunus s.l. We calculated the f4-ratio statistics for all possible trios of species, indicative of gene flow, using the “Dtrios” command on a filtered VCF file and an ASTRAL-derived species tree. The Benjamini-Hochberg (BH) correction was applied using the “p.adjust” function in R to control the false discovery rate. The adjusted results were visualized using the Ruby script “plot_d.rb.” Additionally, we employed the heuristic “Fbranch” approach to examine gene flow in specific lineages on the species tree, with results visualized using the “dtools.py” script.
Flow cytometric analyses
To assess genome size and ploidy variation within polyploid Prunus, we performed DNA flow cytometry following the protocol developed by Zhang and Feng (2023). Fresh leaf tissues were processed using PVPK12-mGB2 buffer, stained with Propidium Iodide (PI), and subsequently analyzed on an LSRFortessa flow cytometer (BD, USA). We recorded fluorescence data from at least 5,000 nuclei per sample, repeating the measurement three times to ensure reliability and reproducibility. Flow cytometry data were analyzed using the FlowJo software (Loureiro et al., 2006).
Multiple methods for whole-genome duplication analyses
Given the increased number of paralogs in polyploid Prunus, we used the “wgd_map” pipeline to identify potential WGD events (Yang et al., 2018; Morales-Briones et al., 2022). The process began by extracting rooted orthogroups from the final homolog trees using the “extract_clades.py” script. We then applied the “map_dups.py” script to map these orthogroups onto the species tree, allowing us to document the proportion of gene duplications at the nodes corresponding to the MRCA of the involved clades.
We also adapted the scripts of Tree2GD to accommodate DGS data (https://github.com/PhyloAI/DGS-Tree2GD). All polyploid Prunus samples and an outgroup, Lyonothamnus floribundus, were used for this analysis. We extracted 29,705 genes from the P. mume genome (GCF000346735.1) and assembled these nuclear coding sequences (nuclear CDS) using HybPiper v.2.1.7 (Johnson et al., 2016), creating a new dataset for WGD analysis. The nuclear CDS were translated using TransDecoder v.5.7.1 (https://github.com/TransDecoder/TransDecoder) with default settings. Diamond v.2.1.9 (Buchfink et al., 2021) was then used to build an index and perform all-against-all sequence alignment for each pair of species, using the “more-sensitive” model and filtering results with an e-value greater than 1e-10. Homologous sequences were predicted using phyloMCL v.2.0 (Zhou et al., 2020), with species less than four discarded, resulting in the retention of 59,212 gene clusters. We used the dolloparsimony script (Chen et al., 2022) to map the genes in each gene family to the species tree, calculating the total number of gene families at each node and counting newly obtained and lost gene families. Multiple CDS alignments for each gene cluster were performed using MUSCLE v.5.1 (Edgar, 2004) with “--cds2tree”. Each phylogenetic gene tree was inferred using IQ-TREE2 v.2.3.4 (Minh et al., 2020) with the “GTR” model and 1,000 bootstraps. To detect retained GD on lineages, gene trees were reconciled with a reference species tree by Tree2GD v.1.0.40 (Chen et al., 2022). Nodes with GD values > 300, and a percentage of GD with both lineages A and B (both taxa A and B having two copies, ABAB type of GD) > 50% were proposed as WGD events (Zhang et al. 2023).
Allopolyploidy analyses
In our analyses, we observed that the node corresponding to the MRCA of the polyploid Prunus displayed a significant frequency of gene duplication, which strongly suggests potential gene duplication event at this node. We performed GRAMPA v.1.4.0 (Gene tree Reconciliation Algorithm for MUL-trees and Polyploid Analysis, Thomas et al., 2017) to investigate the polyploidy modes associated with inferred gene duplication events. GRAMPA evaluates whether a gene duplication event is autopolyploid (within-species duplication) or allopolyploid (hybridization between distinct lineages) by reconciling gene trees with species trees under different polyploidization hypotheses. The software simulates scenarios and identifies the optimal multi-labeled trees with the lowest reconciliation score, displaying the offspring and its parental lineages as sister groups. The parameter “h1” was used to explore the mode of polyploidization (auto-/allo-) and to identify the potential parental lineages involved. Additionally, given the uncertainty about the presence of allo-, auto-, or non-polyploid events in the Maddenia group, we tested all possible nodes comprehensively following the strategies of Morales-Briones et al. (2022).
Historical biogeographic analyses
To clarify the spatiotemporal dynamics within polyploid Prunus, we estimated divergence times using ML trees derived from nuclear and plastid CDS datasets. Uneven taxon sampling, varying evolutionary rates, and unequal distribution of fossil calibrations can introduce errors in dating analysis. To improve accuracy, we employed a two-step strategy: First, we selected eight fossil records (F1–F8) and one secondary calibration node (C1) to date the subfamily Amygdaloideae (details in Method S3 and Table S5). Then, focusing on the Maddenia group, we used the estimated dates from the stem node of polyploid Prunus as a secondary calibration point, along with the fossil record of P. wutuensis (F2).
Considering the computational burden in dating analysis, we used the approximate likelihood calculation applied by the program MCMCTREE in PAML v.4.10.7 (Yang, 2007). We opted for the “independent rates” clock model and the “HKY85” model for nucleotide substitution and configured the parameters as follows: initiating with a burn-in period of 1,000,000 iterations, sampling frequency was set to every 10 iterations and the total number of samples collected was specified as 500,000. The assessment of stationarity and convergence across these runs was conducted, ensuring that the effective sample size (ESS) for all parameters is larger than a minimum threshold of 200.
In the second step, we utilized the tribe Lyonothamneae as the outgroup in the SCN genes dataset and the tribe Sorbarieae as the outgroup in the plastid CDS dataset. The estimated stem ages of polyploid Prunus—83.98–64.47 Ma in the SCN dataset and 86.84–52.36 Ma in the plastid CDS dataset—served as the secondary calibration points. To ensure methodological consistency, we applied the same parameters as those used in the first step.To estimate the ancestral geographic origin of the polyploid Prunus, we employed the BioGeoBEARS v.1.1.1 (Matzke, 2018) implemented in the Reconstruct Ancestral State in Phylogenies (RASP) v.4.1 (Yu et al., 2020). We defined six distinct biogeographic regions based on the distribution patterns in the polyploid Prunus as follows: (A) East Asia, (B) North & South America, (C) West Asia, (D) Europe, (E) Africa, and (F) Australasia. Furthermore, to investigate the origin and dispersal of the Maddenia group, we integrated species distribution ranges with the geomorphological regionalization scheme of China proposed by Wang et al. (2020) and defined three geographic areas, including (A) Southeast China Region (Southeastern China Low Mountains-Hills-Plains); (B) Southwest China Region (Tibetan Plateau High and Extremely High Mountains-Basins-Valleys); (C) Central China Region (Southwestern China Middle and Low Mountains-Plateaus-Basins). In these analyses, the “DIVALIKE” model was selected due to its ability to accommodate vicariance, dispersal, and extinction events.
Diversification rates analyses
To estimate the speciation and extinction rate dynamics over time within the Maddenia group, we employed RevBayes v.1.2.1 (Höhna et al., 2016) under the EBD model. The EBD model is a robust framework for analyzing diversification processes, where speciation and extinction rates are allowed to vary episodically through time and are modeled as piecewise-constant rates (Stadler, 2011; Höhna, 2015), making it particularly suitable for capturing complex evolutionary dynamics such as rapid radiations or declines triggered by historical events. The magnitude and timing of rate changes were then visualized using the RevGadgets v.1.0.0 (Tribble et al., 2022). These analyses utilized both the nuclear MO and plastid CDS datasets to explore how divergent phylogenies might affect the assessment of diversification rates.
