Data from: Unraveling the web of life: Incomplete lineage sorting and hybridization as primary mechanisms over polyploidization in the evolutionary dynamics of pear species

Data files

Aug 29, 2025 version files 59.08 MB

Analyses_data.zip

59.08 MB
README.md

3 KB

Abstract

The traditional Tree of Life (ToL) model is increasingly challenged by the Web of Life (WoL) paradigm, which offers a more accurate depiction of organismal phylogeny, particularly in light of the incongruences often observed between gene and species trees. However, the absence of a standardized method for resolving evolutionary mechanisms—such as Incomplete Lineage Sorting (ILS), hybridization, introgression, polyploidization, and whole-genome duplication—remains a significant obstacle in defining the WoL. Characterized by extensive hybridization events, the pear genus Pyrus provides an ideal model for exploring these complexities. In this study, we present a Step-by-Step Exclusion approach for investigating the evolutionary pathways of Pyrus, and our results demonstrate that: 1) ILS, rather than polyploidization, plays a dominant role in the origination of Pyrus; 2) the two subgenera of Pyrus followed independent evolutionary paths, influenced by geographical barriers formed through the uplift of the Tibetan Plateau and increased aridity in Central Asia; 3) both ILS and hybridization have driven the diversification of subg. Pashia, while hybridization alone has shaped the reticulate evolution of subg. Pyrus; 4) the establishment of the Silk Road during the Han Dynasty facilitated genetic exchange between subg. Pyrus and subg. Pashia. The SSE approach offers a versatile framework for studying the evolutionary mechanisms underlying the WoL paradigm.

Unraveling the Web of Life: Incomplete lineage sorting and hybridization as primary mechanisms over polyploidization in the evolutionary dynamics of pear species

This package contains the data and software outputs - Analyses_data.zip

Description of the Data and file structure

Analyses_data.zip

ALIGNMENTS

Data 1. Pyrus_nuclear.fasta = 801 nuclear single-copy reference genes filtered from thirteen genomes

Data 2. Pyrus_77_plastid_coding_genes.fasta = concatenated alignments of 77 plastid coding genes of Pyrus
Pyrus_77_plastid_coding_genes.partitions.txt = partition file for concatenated alignments of 77 plastid coding genes of Pyrus

Data 3. Pyrus_771_MO_genes_dataset.fasta = concatenated alignments of 771 MO ortholog genes dataset
Pyrus_771_MO_genes_dataset.partitions.txt = partition file for concatenated alignments of 771 MO ortholog genes dataset

Data 4. Pyrus_905_RT_genes_dataset.fasta = concatenated alignments of 905 RT ortholog genes dataset
Pyrus_905_RT_genes_dataset.partitions.txt = partition file for concatenated alignments of 905 RT ortholog genes dataset

Data 5. Maleae 15-taxa sampling for PhyloNetworks analysis.fasta = concatenated alignments of Maleae 15-taxa sampling dataset
Maleae 15-taxa sampling for PhyloNetworks analysis.partitions.txt = partition file for concatenated alignments of Maleae 15-taxa sampling dataset

Data 6. Subgenus Pyrus 16-taxa sampling for PhyloNetworks analysis.fasta = concatenated alignments of Subgenus Pyrus 16-taxa sampling dataset
Subgenus Pyrus 16-taxa sampling for PhyloNetworks analysis.partitions.txt = partition file for concatenated alignments of Subgenus Pyrus 16-taxa sampling dataset

Data 7. Subgenus Pashia 15-taxa sampling for PhyloNetworks analysis.fasta = concatenated alignments of Subgenus Pashia 15-taxa sampling dataset
Subgenus Pashia 15-taxa sampling for PhyloNetworks analysis.partitions.txt = partition file for concatenated alignments of Subgenus Pashia 15-taxa sampling dataset

TREES

Pyrus_77_plastid_coding_genes_RAxML.tre = the plastid topology estimated by RAxML
Pyrus_77_plastid_coding_genes_IQ-TREE2.tre = the plastid topology estimated by IQ-TREE2
Pyrus_77_plastid_coding_genes_BS10_speciestree-astral2.tre = the species tree inferred from the plastid gene trees with collapsed low-supported (10%) branches

Pyrus_771_MO_genes_dataset_RAxML.tre = the MO ortholog topology estimated by RAxML
Pyrus_771_MO_genes_dataset_IQ-TREE2.tre = the MO ortholog topology estimated by IQ-TREE2
Pyrus_771_MO_genes_dataset_BS10_speciestree-astral2.tre = the species tree inferred from the MO ortholog trees with collapsed low-supported (10%) branches

Pyrus_905_RT_genes_dataset_RAxML.tre = the RT ortholog topology estimated by RAxML
Pyrus_905_RT_genes_dataset_IQ-TREE2.tre = the RT ortholog topology estimated by IQ-TREE2
Pyrus_905_RT_genes_dataset_BS10_speciestree-astral2.tre = the species tree inferred from the RT ortholog trees with collapsed low-supported (10%) branches

Materials and Methods

In the following section, we provide a brief overview of the materials and methods used in this project; additional details can be found in the supplementary methods.

1 | Taxon Sampling, DNA Extraction, and Sequencing

A comprehensive phylogenetic analysis of Pyrus was conducted, including all seven subsections defined by Phipps et al. (1990), the most detailed global taxonomic framework for pear species. Due to the self-incompatibility and hybrid formation of Pyrus, species like P. sinkiangensis were excluded. To accurately estimate divergence times, 41 outgroups from 26 genera in the apple tribe Maleae were selected, with a focus on close relatives, such as Malus and Sorbus sensu lato. Gillenia trifoliata from the tribe Gillenieae was used as an outgroup for rooting the Maleae phylogeny. Of the 92 samples analyzed, 58 came from Whole Genome Sequencing (WGS) and/or Deep Genome Skimming (DGS) data from the NCBI Sequence Read Archive (SRA), while 34 DGS (2 × 150bp) were generated in this study (see Table S1 for accession details).

DNA was extracted from silica-gel dried leaves and herbarium specimens using a modified CTAB method (mCTAB, Li et al., 2013). The quality of DNA was checked via agarose gel electrophoresis, and high-quality samples were sent to Novogene in Beijing, where libraries were prepared with the NEBNext^®^ Ultra^™^ II DNA Library Prep Kit and sequenced on the Illumina NovaSeq Platform (2 × 150bp).

2 | Reads Processing, Plastome Assembly, and Annotation

Raw sequencing reads were processed using Trimmomatic v. 0.39 (Bolger et al., 2014) to remove the adapter and low-quality sequences. Quality control of the trimmed reads was performed with FastQC v. 0.11.9 (Andrews, 2010) to ensure the reliability and accuracy of the data for subsequent analyses.

For plastome assembly, we employed the Successive Approach combining Reference-based and De novo assembly (SARD; Liu et al., 2023), this method has been successfully utilized in various lineages of angiosperms (e.g., Liu et al., 2021, 2022, 2023; Jin et al., 2023, 2024; Wang et al., 2024). The assembled plastomes were annotated using the Plastid Genome Annotator (PGA) (Qu et al., 2019). Visualization of the chloroplast genomes was performed in Geneious Prime (Kearse et al., 2012) to verify the start and stop codons of each coding gene, and manual corrections were made where necessary. All assembled plastomes were submitted to GenBank, and their respective accessions are listed in Table S1.

3 | Single-copy Nuclear Marker Development and Sequence Assembly

We used the well-designed 801 Single-Copy Nuclear (SCN) genes from previous Maleae studies (Jin et al., 2023, 2024). Nuclear loci were assembled using HybPiper v. 2.0.1 (Johnson et al., 2016), an integrated bioinformatics suite. Lineage-specific SCN sequences were extracted from NGS reads using default parameters. The ‘hybpiper assemble’ command mapped and sorted trimmed reads against SCN gene references with BWA v. 0.7.17 (Li and Durbin 2009) and SAMtools v. 1.17 (Li et al., 2009). Sorted reads were assembled into contigs using SPAdes v. 3.15.5 (Bankevich et al., 2012) with a coverage cutoff of 5. Gene recovery efficiency across species was summarized and visualized using ‘hybpiper stats’ and ‘hybpiper recovery_heatmap’ commands. The ‘hybpiper paralog_retriever’ command was used to detect and exclude paralogs or chimeric sequences that could affect orthology inference.

4 | Orthology Inference for the Nuclear Genes

To accurately estimate phylogeny through orthologs, we applied the tree-based orthology inference method for SCN genes (Yang and Smith, 2014), generating two ortholog datasets: Monophyletic Outgroup (MO) and RooTed Ingroup (RT). Procedures are outlined in Morales-Briones et al. (2022). Uneven sequencing coverage led to outlier loci and short sequences, potentially affecting phylogenetic inference. To address this, our PhyloAI team developed a pipeline to refine these sequences, as detailed in the Supplementary Methods.

5 | Multiple Inference Methods for Phylogenetic Analyses

This study used multiple phylogenetic methods to estimate the plastid/nuclear phylogeny of Pyrus, including concatenation- and coalescent-based approaches. For concatenated supermatrices derived from nuclear datasets, orthologs were combined using AMAS v. 1.0 (Borowiec, 2016). The optimal partitioning scheme and evolutionary models were determined with PartitionFinder2 (Stamatakis, 2006; Lanfear et al., 2017) using AICc and the rcluster algorithm. Maximum Likelihood (ML) inference was performed with IQ-TREE2 v. 2.1.3 (Minh et al., 2020), which included 1000 replicates for the SH approximate likelihood ratio test and ultrafast bootstrap, and RAxML v. 8.2.12 (Stamatakis, 2014), utilizing the GTRGAMMA model along with 200 bootstrap replicates.

The boundaries of two inverted repeats in each plastome were verified using Geneious Prime’s Repeat Finder plugin (Kearse et al., 2012), removing one duplicate. The 77 Plastid Coding Sequences (plastid CDSs) were extracted, aligned with MAFFT v. 7.475, and concatenated in AMAS v. 1.0. The resulting supermatrix underwent PartitionFinder2 for partitioning and nucleotide model selection, with the greedy algorithm applied. ML phylogenetic inference followed the nuclear analysis parameters.

A coalescent-based approach using ASTRAL-III (Zhang et al., 2018) was applied to the nuclear and plastid CDS datasets. Individual gene trees were estimated with RAxML v. 8.2.12 (Stamatakis, 2014) and refined by collapsing branches with bootstrap values below 10. The refined gene trees were then integrated into a species tree using ASTRAL-III. All nine resultant trees and related log files are available at the Dryad Digital Repository (https://doi.org/10.5061/dryad.3ffbg79r8) for further analysis.

6 | Detecting and Visualizing Nuclear Gene Tree Discordance

We used several methods to evaluate gene tree congruence with the inferred phylogeny. First, we applied phyparts (Smith et al., 2015) to examine phylogenetic conflict by mapping gene trees to the target tree and quantifying discordant and congruent bipartitions. Both gene trees and the target tree were rooted using the ‘pxrr’ command in phyx (Brown et al., 2017), followed by a full concordance analysis (-a 1) in phyparts. Nodes with bootstrap support below 50% were excluded. A rapid concordant analysis (-a 0) was also conducted to mitigate the impact of missing taxa due to uneven nuclear gene recovery. The results were combined and visualized as pie charts, displaying the proportion of discordant and congruent topologies for each node. Additionally, we calculated the internode certainty all (ICA) value to summarize node inconsistency.

We also used Quartet Sampling (QS, Pease et al., 2018) to assess conflicting support at weakly supported nodes. QS, performed with 100 replicates and a log-likelihood threshold of 2, evaluates the reliability of internal tree relationships and terminal branches.

7 | SNP Calling and Gene Flow Analyses

The latest high-quality genome assembly of Pyrus pyrifolia (GCA_016587475.1) was downloaded as the reference for single nucleotide polymorphism (SNP) calling. Clean reads from each sample were mapped to this reference using BWA v. 0.7.17 (Li and Durbin 2009), and the aligned results were processed into BAM files with SAMtools v. 1.6 (Li et al., 2009). Duplicate reads were marked, and variants were called with GATK (McKenna et al., 2010) using the ‘MarkDuplicates’ and ‘HaplotypeCaller’ functions. Haplotypes were combined with ‘CombineGVCFs’ and ‘GenotypeGVCFs’ to generate genotype files (gVCF). Two-step filtering was applied: initial filtering with GATK and secondary filtering with VCFTOOLS (Danecek et al., 2011) to select final variant sites.

We calculated the f₄-ratio using Dsuite v. 0.5 (Malinsky et al., 2021) to explore gene flow between species. SNPs served as input data, with an ASTRAL-derived species tree as the guiding tree. Visualization was done using the ‘plot_f4ratio.rb’ Ruby script. Parallel analyses with different outgroups (Malus or Sorbus) showed high consistency in results regardless of the outgroup selection.

8 | Incomplete Lineage Sorting Analyses

To investigate the role of incomplete lineage sorting (ILS) in shaping the evolutionary trajectory of Pyrus, we employed two approaches integrating the population mutation parameter theta (Cai et al., 2021) and coalescent simulation (Liu et al., 2022). In the first approach, theta was calculated by dividing the branch length in mutation units, inferred by IQ-TREE, by the length in coalescent units from ASTRAL-III. The ASTRAL-III tree was used as a fixed topology to ensure consistency between trees. Additionally, we examined the correlation between branch lengths and ICA values from ASTRAL-III to assess ILS’s impact; a strong positive correlation suggests ILS as a driver of tree conflicts (Zhou et al., 2022).

In the second approach, coalescent simulations were conducted using a dataset of 29 high-quality samples. Gene trees were extracted from these samples, and a species tree was inferred using ASTRAL-III. Phybase v. 1.5 (Liu and Yu, 2010) was used to simulate 10,000 gene trees under the multi-species coalescent model. The distances between simulated gene trees, empirical gene trees, and the species tree were calculated using DendroPy v. 4.5.2 (Sukumaran and Holder, 2010) and visually compared. The disparity in distance distributions was analyzed to assess ILS’s contribution to gene tree incongruence.

9 | Polyploidy Analyses

We integrated multiple sources of evidence to explore the potential impact of polyploidy and whole-genome duplication (WGD) on phylogenetic discrepancies in Pyrus. First, a comprehensive literature review was conducted to gather chromosome-related data, including ploidy levels for all Pyrus species, primarily sourced from the International Plant Chromosome Number (IPCN) database (http://legacy.tropicos.org/Project/IPCN). Additionally, Smudgeplot (Ranallo-Benavidez et al., 2020) was used to infer ploidy by analyzing k-mers within sequencing reads, allowing for the identification of potential polyploid species.

To investigate the occurrence of polyploidization and WGD events in the deep phylogeny of Pyrus, we employed a methodology outlined by Morales-Briones et al. (2022). This involved extracting rooted ortholog trees from homolog trees, applying a filtering criterion of a minimum 50% bootstrap value per tree. Gene duplications identified in these trees were mapped onto a rooted species tree, with the proportions of gene duplications at nodes corresponding to the most recent common ancestor (MRCA) carefully documented. The computational scripts for this analysis are available at https://bitbucket.org/blackrim/clustering.

10 | Inference of Global Split Networks

SplitsTree is an ideal tool for computing global split networks, particularly for deriving unrooted phylogenetic networks from molecular sequence data. Using methods such as split decomposition, neighbor-net, and consensus networks, we employed SplitsTree v. 4.19.0 (Huson and Bryant 2006) to investigate the complex evolutionary trajectory of Pyrus. Our primary dataset consisted of well-aligned SCN genes from the MO dataset, which included 50 Pyrus samples and a Malus outgroup. This investigation entailed an in-depth inference of the implicit network, employing parameters such as uncorrected_P distances, the EqualAngle network construction algorithm, and the NeighborNet method.

11 | Phylogenetic Network Analyses

To explore the complex reticulate evolutionary processes in Pyrus, we used the Species Networks applying Quartets (SNaQ) algorithm within PhyloNetworks (Solís-Lemus et al., 2017). This software, built on maximum pseudolikelihood methods, efficiently infers phylogenetic networks from multi-locus datasets, particularly when scaling the number of taxa or hybridization events. We considered gene flow and ILS as potential sources of discordance in gene trees. To manage computational demands, three datasets were created, each with fewer than 20 samples: (1) “Maleae 15-taxa data,” testing hybridization origins with 15 species from genera like Malus and Sorbus, (2) “Pyrus 16-taxa data,” representing 15 P. subg. Pyrus species with a Malus outgroup, and (3) “Pashia 15-taxa data,” including 14 P. subg. Pashia taxa and a Malus outgroup. The selected species were either major lineages or known for cytonuclear discordance, ensuring a broad view of genetic diversity in Pyrus.

For each dataset, SCN gene trees were analyzed for quartet concordance factors (CFs) using the ‘readTrees2CF’ package. Species trees were reconstructed with ASTRAL-III (Zhang et al., 2018), and CFs were used to infer the optimal phylogenomic network. The maximum number of reticulation events (h_max) ranged from 0 to 6, with inheritance probabilities calculated for hybridization edges. The optimal network was identified by selecting the stable h_max value, where the pseudo-deviance score plateaued, indicating consistency across multiple runs.

12 | Dating Analysis and Ancestral Area Reconstruction

In the context of the Maleae tribe, and particularly the genus Pyrus, there has been a notable absence of temporal dating analyses, resulting in an imprecise determination of the stem age of Pyrus. To address this gap, we adopted a two-step strategy for the divergence time estimation within Pyrus. Despite the discovery of various Pyrus fossils across different epochs and localities (Table S2), most of these fossils as leaf specimens present a limitation, as leaf morphology alone is insufficient for accurate species identification. Additionally, the generic classification of some leaf fossils remains ambiguous, posing challenges in distinguishing between Malus, Pyrus, or other related genera within the Maleae tribe. We first used the MCMCTree, a program implemented in PAML v. 4.9j (Yang, 2007), to estimate the divergence times across the Maleae phylogenetic backbone, incorporating two fossil species. The inferred age stem age of Pyrus was used in our subsequent analyses. In the following step, we employed BEAST2 (Bouckaert et al., 2014) to refine our estimation of divergence times among Pyrus species. This analysis utilized the stem age estimated in the first step of Pyrus as a secondary calibration point, thereby enhancing the precision of our temporal divergence estimates within the Pyrus genus. The detailed parameter settings for these two software programs can be found in the Supplementary Methods.

The biogeographic analysis of Pyrus was conducted using the BioGeoBEARS v. 1.1.1 (Matzke, 2018), integrated within RASP v. 4.2 (Yu et al., 2015). The time tree, inferred by PAML after excluding all outgroup taxa, was employed as the input tree for subsequent analyses. In this study, based on the distribution patterns of the extant Pyrus species and the paleotectonic histories of continents, we categorized the geographic areas into three regions: (A) East Asia, (B) Central and West Asia, and (C) Europe and Northern Africa. During the analysis, a constraint was imposed wherein the maximum number of areas assignable to any given phylogenetic node was restricted to two. The selection of the optimal biogeographic model was based on the highest Akaike Information Criterion corrected for small sample sizes (AICc_wt) value, following the comprehensive evaluation of all models available in the BioGeoBEARS toolkit. This methodological approach facilitated a rigorous and data-driven determination of the most plausible biogeographic scenario for the Pyrus genus.