Phylogenomic conflict analyses in the apple genus Malus s.l. reveal widespread hybridization and allopolyploidy driving diversification, with insights into the complex biogeographic history in the Northern Hemisphere
Data files
May 24, 2023 version files 483.75 MB
-
153_genes_supporting_Malus_coronaria_bipartition_B_BS10_speciestree-astral2.tre
-
153_genes_supporting_Malus_coronaria_bipartition_B_IQ-TREE2.tre
-
153_genes_supporting_Malus_coronaria_bipartition_B_RAxML.tre
-
188_genes_supporting_Malus_sikkimensis_bipartition_A_BS10_speciestree-astral2.tre
-
188_genes_supporting_Malus_sikkimensis_bipartition_A_IQ-TREE2.tre
-
188_genes_supporting_Malus_sikkimensis_bipartition_A_RAxML.tre
-
393_genes_supporting_Malus_sikkimensis_bipartition_B_BS10_speciestree-astral2.tre
-
393_genes_supporting_Malus_sikkimensis_bipartition_B_IQ-TREE2.tre
-
393_genes_supporting_Malus_sikkimensis_bipartition_B_RAxML.tre
-
427_genes_supporting_Malus_coronaria_bipartition_A_BS10_speciestree-astral2.tre
-
427_genes_supporting_Malus_coronaria_bipartition_A_IQ-TREE2.tre
-
427_genes_supporting_Malus_coronaria_bipartition_A_RAxML.tre
-
Data_1._Maleae_nuclear.fasta
-
Data_10._Results_of_phyckle_analysis.153_genes_supporting_Malus_coronaria_bipartition_B.fasta
-
Data_10._Results_of_phyckle_analysis.153_genes_supporting_Malus_coronaria_bipartition_B.partitions.txt
-
Data_11._Results_of_phyckle_analysis.188_genes_supporting_Malus_sikkimensis_bipartition_A.fasta
-
Data_11._Results_of_phyckle_analysis.188_genes_supporting_Malus_sikkimensis_bipartition_A.partitions.txt
-
Data_12._Results_of_phyckle_analysis.393_genes_supporting_Malus_sikkimensis_bipartition_B.fasta
-
Data_12._Results_of_phyckle_analysis.393_genes_supporting_Malus_sikkimensis_bipartition_B.partitions.txt
-
Data_2._Malus_SCN_50%-sample_dataset.fasta
-
Data_2._Malus_SCN_50%-sample_dataset.partitions.txt
-
Data_3._Malus_SCN_80%-sample_dataset.fasta
-
Data_3._Malus_SCN_80%-sample_dataset.partitions.txt
-
Data_4._Malus_SCN_all-sample_dataset.fasta
-
Data_4._Malus_SCN_all-sample_dataset.partitions.txt
-
Data_5._Malus_80_plastid_coding_genes.fasta
-
Data_5._Malus_80_plastid_coding_genes.partitions.txt
-
Data_6._40-taxa_sampling_for_HyDe_analysis.fasta
-
Data_6._40-taxa_sampling_for_HyDe_analysis.partitions.txt
-
Data_7._27-taxa_sampling(Maleae)_for_PhyloNetworks_analysis.partitions.fasta
-
Data_7._27-taxa_sampling(Maleae)_for_PhyloNetworks_analysis.partitions.txt
-
Data_8._14-taxa_sampling(Malus)_for_PhyloNetworks_analysis.fasta
-
Data_8._14-taxa_sampling(Malus)_for_PhyloNetworks_analysis.partitions.txt
-
Data_9._Results_of_phyckle_analysis.427_genes_supporting_Malus_coronaria_bipartition_A.fasta
-
Data_9._Results_of_phyckle_analysis.427_genes_supporting_Malus_coronaria_bipartition_A.partitions.txt
-
exclude_short_sequences.py
-
Malus_80_plastid_coding_genes_RAxML.tre
-
Malus_SCN_50%-sample_dataset_BS10_speciestree-astral2.tre
-
Malus_SCN_50%-sample_dataset_IQ-TREE2.tre
-
Malus_SCN_50%-sample_dataset_RAxML.tre
-
Malus_SCN_80%-sample_dataset_BS10_speciestree-astral2.tre
-
Malus_SCN_80%-sample_dataset_IQ-TREE2.tre
-
Malus_SCN_80%-sample_dataset_RAxML.tre
-
Malus_SCN_all-sample_dataset_BS10_speciestree-astral2.tre
-
Malus_SCN_all-sample_dataset_IQ-TREE2.tre
-
Malus_SCN_all-sample_dataset_RAxML.tre
-
README.txt
Abstract
Phylogenomic evidence from an increasing number of studies has demonstrated that different data sets and analytical approaches often reconstruct strongly supported but conflicting relationships. In this study, 785 single-copy nuclear genes and 75 complete plastomes were used to infer the phylogenetic relationships and estimate the historical biogeography of the apple genus Malus sensu lato, an economically important lineage disjunctly distributed in the Northern Hemisphere and involved in known and suspected hybridization and allopolyploidy events. The nuclear phylogeny recovered the monophyly of Malus s.l. (including Docynia); however, the genus was supported to be biphyletic in the plastid phylogeny. An ancient chloroplast capture event in the Eocene in western North America best explains the cytonuclear discordance. Our conflict analysis demonstrated that ILS, hybridization, and allopolyploidy could explain the widespread nuclear gene tree discordance. One deep hybridization event (Malus doumeri) and one recent event (Malus coronaria) were detected in Malus s.l. Furthermore, our historical biogeographic analysis integrating living and fossil data supported a widespread East Asian-western North American origin of Malus s.l. in the Eocene, followed by several extinction and dispersal events in the Northern Hemisphere. We also propose a general workflow for assessing phylogenomic discordance and biogeographic analysis using deep genome skimming datasets.
Methods
Taxon sampling, library preparation, and deep genome skimming sequencing
Taxon sampling is designed to resolve the phylogenetic placements and relationships of major clades within Malus s.l. For the convenience of discussion, we followed the widely accepted taxonomic system proposed by Phipps et al. (1990), in which they recognized five sections. Due to the great economic importance of Malus, numerous hybrid cultivars have been used as crops and ornamentals, which have hybridized either within one section or between sections, and this made it difficult to infer the phylogenetic relationships between cultivated and wild species. Hence, we excluded the widely recognized artificial hybrid species in Malus, e.g., M. × asiatica Nakai, M. × astracanica hort. ex Dum.Cours., M. × cerasifera Spach, M. × dawsoniana Rehder, M. × domestica (= M. pumila Mill.), M. × floribunda Siebold ex Van Houtte, M. × halliana, M. × micromalus, M. × heterophylla Spach, M. × magdeburgensis Schoch ex Rehder, M. × prunifolia (Willd.) Borkh., M. × sargentii Rehder, M. × scheideckeri Späth ex Zabel, M. × soulardii (L.H.Bailey) Britton, M. × spectabilis (Aiton) Borkh., and M. × sublobata Rehder. We hence sampled 39 ingroup individuals representing 18 wild species (out of ca. 2415), representing all five sections and Docynia within Malus s.l. In addition, due to the potential biphyly of Malus based on complete plastomes21,103, we sampled 38 outgroups across the tribes Maleae and Gillenieae to resolve the Malus phylogeny and identify possible events of cytonuclear discordance (Supplementary Table S1). We investigated 77 individuals in total, of which 27 species of deep genome skimming data were generated for this study (Supplementary Table S1).
Total genomic DNAs were extracted from silica-gel dried leaves or herbarium specimens using a modified CTAB (mCTAB) method104 in the lab of the Institute of Botany, Chinese Academy of Science (IBCAS) in China. The libraries were prepared in the lab of Novogene, Beijing, China using NEBNext® UltraTM II DNA Library Prep Kit, and then paired-end reads of 2 × 150 bp were generated on the NovoSeq 6000 Sequencing System (Novogene, Beijing; 5-10 G data for each sample: Supplementary Table S1).
Single-copy nuclear marker development
The SCN marker development followed the pipeline in Liu et al. (2021). Briefly, the coding regions of Malus domestica (GenBank assembly accession: GCA_000148765.2) were first input into MarkerMiner v.1.0105 to identify the putative single-copy genes. The resulting genes were then filtered by successively BLASTing106-108 against six available genomes [Malus baccata (accession: GCA_006547085.1), M. domestica, Pyrus betulifolia Bunge (accession: GCA_007844245.1), P. bretschneideri Rehder (accession: GCA_000315295.1), P. ussuriensis Maxim. × P. communis L. (accession: GCA_008932095.1), and P. pyrifolia (Burm.f.) Nakai (accession: GCA_016587475.1)] in Geneious Prime109, with the parameters settings in the Megablast program110 as a maximum of 60 hits, a maximum E-value of 1 × 10-10, a linear gap cost, a word size of 28, and scores of 1 for match and -2 for mismatch in alignments. We first excluded genes with mean coverage > 1.1 for alignments, which generally indicate potential paralogy of the genes and/or the presence of highly repeated elements in the sequences. The remaining alignments were further visually examined to exclude those genes receiving multiple hits with long overlapping but different sequences during the BLAST. It should be noted that the alignments with mean coverage between 1.0 and 1.1 were generally caused by the presence of tiny pieces of flanking intron sequences in the alignments. These fragments were still accepted as a SCN gene here. After filtering, the remaining genes were used as references in the following gene assembly. The baits in silico could be available from the Dryad Digital Repository (Data 1): https://doi.org/10.5061/dryad.2jm63xsq5.
Data processing and the assembly of single-copy nuclear genes
Read processing and assembly followed the pipeline in Liu et al. (2021). Generally, we used Trimmomatic v. 0.39111 for quality trimming and adapter clipping with the parameters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:1:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36). Then, the results were quality-checked with FastQC v. 0.11.9112. The number of clean reads after trimming was also calculated here for comparison (Supplementary Table S1). We used the HybPiper pipeline v. 1.3.1113 for targeting SCN genes with default settings; BWA v. 0.7.1114 to align and distribute reads to target genes; SPAdes v. 3.15.0115 with a coverage cutoff value of 8 to assemble reads to contigs; and Exonerate v. 2.2.0116 to align assembled contigs to target sequences and determine exon-intron boundaries. Python and R scripts included in the HybPiper pipeline113 were used to retrieve the recovered gene sequences, summarize and visualize the recovery efficiency.
Nuclear datasets construction and phylogenetic analysis
Sequences for each SCN were aligned in MAFFT v. 7.480117 with options “--localpair --maxiterate 1000”. Due to the variable sequencing coverage of each sample in this study, we employed three steps to remove the poorly aligned regions. We used trimAL v. 1.2118 to trim the alignment of each SCN, in which all columns with gaps in more than 20% of the sequences or with a similarity score lower than 0.001 were removed. Given the low-quality assembly in some sequences, Spruceup119 was used to discover, visualize, and remove outlier sequences in the concatenated multiple sequence alignments with the window size 50 and overlap 25. Because the Spruceup algorithm works better, the more data it has, we concatenated all the SCN gene alignments with AMAS v. 1.0120 before running Spruceup. We also used AMAS v. 1.0120 to split the processed/trimmed alignment back into single-locus alignments. The resulting alignments for each SCN were trimmed again using trimAL v. 1.2118 with the same parameters described above. Thirdly, we excluded the sequences less than 250 bp in each alignment with our customized python script (exclude_short_sequences.py, which can be available from Dryad Digital Repository https://doi.org/10.5061/dryad.2jm63xsq5) for decreasing the effect of missing data, because the short sequences in each alignment have few informative sites for the following coalescent-based species tree inference. The resulting SCN genes were used to infer individual ML gene trees using RAxML 8.2.12121 with a GTRGAMMA model and the option “-f a” and 200 BS replicates to assess clade support for each SCN. TreeShrink v. 1.3.9122 was used for detecting abnormally long branches in each tree with the default false positive error rate α = 0.05 and per-species mode. The shrunk trees and sequences have been used for the following phylogenetic inference, and hereafter these resulted sequences were referred to as “clean genes”.
We generated three different datasets to reconstruct the phylogeny to account for the effect of missing data in each SCN gene: (1) 50%-sample dataset: each SCN gene with at least 900 bp and more than 50% samples (≥ 39 individuals); (2) 80%-sample dataset: each SCN gene with at least 900 bp and more than 80% samples (≥ 62 individuals); (3) all-sample dataset: each SCN with at least 900 bp and more than 100% samples (77 individuals). These three datasets can be available from the Dryad Digital Repository (Data 2, 3, 4): https://doi.org/10.5061/dryad.2jm63xsq5. We used both concatenated and coalescent-based methods for phylogenetic inference of each dataset. We used PartitionFinder2123,124 to estimate the best-fit partitioning schemes and/or nucleotide substitution models under the corrected Akaike information criterion (AICc) and linked branch lengths, as well as with rcluster125 algorithm options for the nuclear dataset. The resulting partitioning schemes and evolutionary models were used for the following Maximum Likelihood (ML) tree using IQ-TREE2 v. 2.1.3126 with 1000 SH-aLRT and the ultrafast bootstrap replicates and RAxML 8.2.12121 with GTRGAMMA model for each partition and clade support assessed with 200 rapid bootstrap (BS) replicates. The shrunk trees from TreeShrink122 were used to estimate a coalescent-based species tree with ASTRAL-III (Zhang et al., 2018) using local posterior probabilities (LPP)127 to assess clade support. Each of the gene trees was rooted, and low support branches (≤ 10) were collapsed using Newick Utilities128 or phyx129 since collapsing gene tree nodes with BS support below a threshold value will help to improve accuracy130. In total, nine phylogenies were generated for topological comparison, and these nine trees are available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.2jm63xsq5.
Detecting and visualizing nuclear gene tree discordance
To explore the discordance among gene trees, we employed phyparts v. 0.0.129 to calculate the conflicting/concordant bipartitions by comparing the nuclear gene trees against the ML tree inferred from RAxML with a BS threshold of 50 (i.e., gene-tree branches/nodes with less than 50% BS were considered uninformative) for filtering out poorly supported branches, thus alleviating noise in the results of the conflict analysis29. We also used the internode certainty all (ICA) value that resulted from phyparts to quantify the degree of conflict on each node of a species tree given individual gene trees131. Phyparts results were visualized with phypartspiecharts.py (by Matt Johnson, available from https://github.com/mossmatters/MJPythonNotebooks/blob/master/phypartspiecharts.py). Furthermore, in order to distinguish lack of support from conflicting support in the species tree, we conducted Quartet Sampling (QS)41 analysis with 100 replicates and the log‐likelihood cutoff 2. The QS method subsamples quartets from the input tree and alignment to assess the confidence, consistency, and informativeness of internal tree relationships, and the reliability of each terminal branch, and then four values are given in this analysis: QC = Quartet Concordance, QD = Quartet Differential, QI = Quartet Informativeness, and QF = Quartet Fidelity. The QS result was visualized with plot_QC_ggtree.R (by ShuiyinLIU, available from https://github.com/ShuiyinLIU/QS_visualization). Both the phyparts and QS results can provide alternative evidence for evaluating the discordance between gene trees.
Comparing the nine topologies inferred above, we found two species (Malus coronaria and M. kansuensis) with conflicting phylogenetic positions among trees (referring to the result below). We used the “alternative relationship test” in phyckle28 to investigate conflicting bipartitions and discover the gene trees supporting each bipartition. Considering the complex origin of Malus coronaria, we employed the two bipartitions supported by Phylonetworks analysis (see Fig. 4 & Table 1 in Results). Two or more user-specified alternative bipartitions could be used as a constraint to infer gene trees. Arbitrarily, we set the cutoff of ∆lnL > 100 as the outlier genes132. The resulting gene dataset supporting each bipartition was then used to estimate phylogenetic inference based on the concatenated (IQ-TREE2 and RAxML) and coalescent-based methods (ASTRAL-III) mentioned above. The resulting 12 trees could be available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.2jm63xsq5.
Coalescence simulation and phylogenetic network estimation
To measure the goodness-of-fit of the coalescent model with ILS explaining the gene tree discordance adequately, we performed a coalescent simulation analysis following the methods in previous studies34,133-135. Briefly, if the simulated gene trees based on the coalescent model correspond well to the empirical gene trees, the gene tree discordance may be explained by ILS. Given the calculation of gene tree distances, we subsampled 27 species representing each major clade in the phylogenetic framework of the tribe Maleae. The function “sim.coaltree.sp” implemented in the R package Phybase v. 1.5136 has been used to simulate 10,000 gene trees with the multispecies coalescent (MSC) model.
Hybrid Detector (HyDe) can be used to detect hybridization using phylogenetic invariants arising under the coalescent model with hybridization137. We sampled 40 taxa, including all 39 Malus s.l. individuals and one outgroup (Pyrus communis) to detect the possible hybridization events within Malus s.l. This dataset can be available from the Dryad Digital Repository (Data 6): https://doi.org/10.5061/dryad.2jm63xsq5. The γ denotes the inheritance probability of parent 1 (P1), while 1 - γ would be the probability of the hybrid population being sister to parent 2 (P2). Generally, significant γ values close to 0.5 indicate a recent hybridization event; significant γ values closer to 0 or 1 indicate an ancient hybridization event remained in the extant species. We herein set the γ threshold at 0.3 and 0.7, which followed convention132.
To explore the possibility of reticulation as a cause of discordance in the apples and their allies, we employed the Species Networks applying Quartets (SNaQ) method43 implemented in the software PhyloNetworks 0.14.044, which explicitly accommodates introgression/gene flow and ILS. Given the computational limitation of PhyloNetworks, we used two datasets to test for hybridization events, i.e., 27-taxa sampling at the tribe level of Maleae and 14-taxa sampling at the genus level of Malus. These two datasets are available from the Dryad Digital Repository (Data 7 & 8): https://doi.org/10.5061/dryad.2jm63xsq5. Considering that Malus members have been reported to have hybridized with many genera in Maleae, e.g., Aria (Pers.) Host and Torminaria M.Roem.5, we sampled 27 individuals, including five species representing each major clade in Malus s.l. and 22 outgroup species in Maleae. This taxon sampling scheme represents a reasonable compromise between taxonomic coverage and computational cost. The 27-taxa dataset construction followed the method described above (i.e., Nuclear dataset construction and phylogenetic analysis). The best trees generated from RAxML were used to estimate the quartet concordance factors (CFs), representing the proportion of genes supporting each possible relationship between each set of four species. The resulting CFs and the ASTRAL species tree were used as initial input to run SNaQ analysis (h = 0), and the resulting best network was used as starting topology to run the next h value (h + 1), and so on. We investigated h values ranging from 0 to 5 with 50 runs in each h for estimating the best phylogenetic network. Each run generated a pseudo-deviance score: a value for fitting the network to the data, and estimated the inheritance probabilities (i.e. the proportion of genes contributed by each parental population to a hybrid taxon) for each network. Similarly, we also sampled 14 species, including 13 Malus species and one outgroup (Pyrus communis), to test the hybridization events among Malus members. The method followed that of the 27-taxa sampling dataset mentioned above. The best network was visualized using Dendroscope v 3.7.4138.
Plastome assembly, annotation, phylogenetic analysis, and cytonuclear discordance
A two-step strategy was used for obtaining high-quality chloroplast genomes. NOVOPlasty v. 4.3.1139 was applied first to assemble the plastomes with high-quality raw data. Then we used the successive assembly approach140, combining the reference-based and the de novo assembly methods to assemble the remaining low-quality samples. With the de novo assembly and a seed-and-extend algorithm, NOVOPlasty was the least laborious approach and resulted in accurate plastomes; however, this program needs sufficient high-quality raw reads without gaps to cover the whole plastome. The whole plastomes assembled from NOVOPlasty then could be used as references for assembling the remaining samples. The successive method provided an excellent approach to obtaining relatively accurate and nearly complete plastomes with or without gaps from lower-coverage raw data. Due to the sensitivity of Bowtie2 v. 2.4.2141 to the reference, this successive method needs a closely related reference sequence with increased time and RAM requirements. Several recent studies have described the procedure in detail21,60,103,140,142,143. All assembled plastomes have been submitted to GenBank with the accession numbers listed in Supplementary Table S1.
The assembled plastid genomes from the low-coverage and high-coverage datasets were annotated using PGA144 with a closely related plastome (MN062004: Malus ioensis) downloaded from GenBank as the reference, and the results of automated annotation were checked manually. The coding sequences of plastomes were translated into proteins to manually check the start and stop codons in Geneious Prime109. The custom annotations in the GenBank format were converted into the FASTA format, and five-column feature tables file required by NCBI submission using GB2sequin145.
Given the considerable variation among plastid introns at the level of Maleae, we extracted 80 coding genes (CDSs) using Geneious Prime109, and these CDSs were aligned by MAFFT v. 7.475117 with default parameters, respectively. This dataset with 77 samples and 80 plastid coding genes is available from the Dryad Digital Repository (Data 5): https://doi.org/10.5061/dryad.2jm63xsq5. The best-fit partitioning schemes and/or nucleotide substitution models for each dataset were estimated using PartitionFinder2123,124, under the corrected Akaike information criterion (AICc) and linked branch lengths, as well as with rcluster125 algorithm options. The partitioning schemes and evolutionary model for each subset were used for the downstream phylogenetic analysis. Like the nuclear analysis, we estimated the ML tree by IQ-TREE2 v. 2.1.3126 with 1000 SH-aLRT and the ultrafast bootstrap replicates and RAxML 8.2.12121 with GTRGAMMA model for each partition and clade support assessed with 200 rapid BS replicates. We also used ASTRAL-III130 for estimating a coalescent-based species tree. These three trees are available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.2jm63xsq5. Cytonuclear discordance at various levels was detected in Malus s.l. (Fig. 7); coalescent simulation was also employed for evaluating the importance of ILS in explaining cytonuclear discordance35,135. We used the R package Phybase v. 1.5 to simulate 10,000 gene trees based on the ASTRAL species tree from SCN genes under MSC model. We used Phyparts v. 0.0129 to explore conflicts among the simulated gene trees and plastid tree, and the proportion of gene tree concordance compared to the plastid genome tree was used to qualify the discordance.
Dating and ancestral area reconstruction
We aim to estimate the age of divergence of the three major clades identified from hundreds of SCN genes and 80 plastid coding genes in Malus s.l. The fossils used in this study are listed in Table 3. Malus obensis and the clade Malus doumeri - Docynia delavayi were grouped together because of the fruit similarity. Malus parahupehensis was thought to be similar to the living species M. hupehensis; however, this fossil species showed more similarities to M. doumeri because of the dentate margin and parallel, craspedodromous venation15. Thus, we grouped M. parahupehensis, M. doumeri, M. obensis, and Docynia delavayi as monophyletic. The earliest fossil, Malus kingiensis, was found in the middle Eocene from the western Kamchatka peninsula and can not be assigned to any living lineages of Malus s.l. based on the morphology; thus, this fossil species likely represented the stem clade of Malus doumeri and Docynia delavayi. However, only one fossil, Malus antiqua, was found in Europe, and this fossil with deeply lobed leaves was similar to the Mediterranean species. Therefore, we grouped this fossil with the two Mediterranean living species (M. florentina and Eriolobus trilobatus). Five fossil species were described from North America, especially the abundant fossil records in western North America. More or less deeply lobed leaves characterized all these leaf fossils, significantly distinct from the living western North American species (Malus fusca). They showed more similarities to the eastern North American and Mediterranean species, and these species were thus grouped together. Additionally, leaf fossil of Malus or Pyrus from the Republic site, Washington146 was used to constrain the divergence between Malus and Pyrus at 46-44 Mya.
We ran the dating analyses based on the SCN and plastid datasets to test the divergence time differences with significant cytonuclear discordance. Given the intensive computational burden of dating analysis using BEAST2, we employed a 19-taxa dataset with only one individual for each species in Malus s.l. and Pyrus communis as the outgroup, and this dataset is available from the Dryad Digital Repository (Data 13): https://doi.org/10.5061/dryad.2jm63xsq5. The divergence time estimation was run under a GTR model with a gamma rate inferred from PartitionFinder2123,124, an uncorrelated lognormal relaxed clock147, and the fossilized birth-death model148,149. Markow Chain Monte Carlo (MCMC) chains were run for 100,000,000, sampling every 20,000 generations in five parallel jobs. We used the LogCombiner v1.10 to combine log and tree files from the five independent runs of BEAST. The MCMC trace file was analyzed in Tracer v1.7.1150. Maximum credibility trees were generated in TreeAnnotator v1.10, and FigTree v1.4.4 visualized the MCC tree.
To test the ancestral areas of three major clades of Malus, we conducted the ancestral area construction using BioGeoBEARS v. 1.1.1151 implemented in RASP v. 4.2152. Geological evidence suggests that an aridity barrier existed from the western-most part of China to the eastern Asian coast from the Paleogene to the Miocene. It has been hypothesized to have acted as a climate barrier between these two regions153; thus, we subdivided East Asia into the northern and the southern areas154-159. Six biogeographic areas were defined across the distribution of Malus s.l.: (A), Southern East Asia; (B), Northern East Asia; (C), Europe and Central Asia; (D), Mediterranean; (E), eastern North America; (F), western North America. The MCC tree summarized by TreeAnnotator was used as input of RASP. The maxarea was set to six, i.e., the number of potential areas of a hypothetical ancestor was restricted to a maximum of six regions. The model with the highest AICc_wt value has been chosen as the best model.