Data from: Advancing Pyrus phylogeny: Deep genome skimming-based inference coupled with paralogy analysis yields a robust phylogenetic backbone and an updated infrageneric classification of the pear genus (Maleae, Rosaceae)

Data files

Aug 14, 2025 version files 55.81 MB

Analyses_data.zip

55.80 MB
README.md

2.96 KB

Abstract

The lack of a robust phylogenetic backbone has posed significant challenges to proposing an infrageneric taxonomic classification of the pear genus, Pyrus, a widely distributed Eurasian lineage of Rosaceae. This issue has been exacerbated by limited informative loci and inaccessible taxon sampling. To address these limitations, we conducted extensive taxon sampling, encompassing 78 Pyrus ingroup individuals representing 32 species, along with 4 outgroup species. This comprehensive sampling strategy covers a wide range of morphological and geographical variations. To enable accurate phylogenomic inference, we assembled 801 single-copy nuclear genes and 72 plastid coding sequences from deep genome skimming (DGS) data. Additionally, we employed a tree-based method for nuclear orthology inference, which led to the generation of three orthologous datasets: one-to-one orthologs (1to1), monophyletic outgroups (MO), and rooted ingroups (RT). The results yielded from both nuclear and plastid analyses consistently support the monophyly of Pyrus, and two well-supported clades, the Occidental and Oriental clades, were recovered in nine nuclear and three plastid trees. Integrating evidence from morphology and phylogenomics, we propose an updated infrageneric classification of Pyrus, which consists of two subgenera: P. subg. Pyrus and P. subg. Pashia stat. nov. This revised classification provides a more robust framework for understanding the evolutionary relationships within the pear genus.

Phylogenomic analyses from hundreds of single-copy nuclear genes and plastomes via deep genome skimming data support an updated infrageneric classification of the pear genus Pyrus (Maleae, Rosaceae)

This package contains the data and software outputs - Analyses_data.tar.gz

Description of the Data and file structure

Analyses_data.zip

ALIGNMENTS

Data 1. Pyrus_nuclear.fasta = 801 nuclear single-copy reference genes filtered from thirteen genomes
Data 2. Pyrus_72_plastid_coding_genes.fasta = concatenated alignments of 72 plastid coding genes of Pyrus
Pyrus_72_plastid_coding_genes.partitions.txt = partition file for concatenated alignments of 72 plastid coding genes of Pyrus
Data 3. Pyrus_513_1to1_genes_dataset.fasta = concatenated alignments of 513 1to1 ortholog genes dataset
Pyrus_513_1to1_genes_dataset.partitions.txt = partition file for concatenated alignments of 513 1to1 ortholog genes dataset
Data 4. Pyrus_748_MO_genes_dataset.fasta = concatenated alignments of 748 MO ortholog genes dataset
Pyrus_748_MO_genes_dataset.partitions.txt = partition file for concatenated alignments of 748 MO ortholog genes dataset
Data 5. Pyrus_884_RT_genes_dataset.fasta = concatenated alignments of 884 RT ortholog genes dataset
Pyrus_884_RT_genes_dataset.partitions.txt = partition file for concatenated alignments of 884 RT ortholog genes dataset

TREES

Pyrus_72_plastid_coding_genes_RAxML.tre = the plastid topology estimated by RAxML
Pyrus_72_plastid_coding_genes_IQ-TREE2.tre = the plastid topology estimated by IQ-TREE2
Pyrus_72_plastid_coding_genes_BS10_speciestree-astral2.tre = the species tree inferred from the plastid gene trees with collapsed low-supported (10%) branches
Pyrus_513_1to1_genes_dataset_RAxML.tre = the 1to1 ortholog topology estimated by RAxML
Pyrus_513_1to1_genes_dataset_IQ-TREE2.tre = the 1to1 ortholog topology estimated by IQ-TREE2
Pyrus_513_1to1_genes_dataset_BS10_speciestree-astral2.tre = the species tree inferred from the 1to1 ortholog trees with collapsed low-supported (10%) branches
Pyrus_748_MO_genes_dataset_RAxML.tre = the MO ortholog topology estimated by RAxML
Pyrus_748_MO_genes_dataset_IQ-TREE2.tre = the MO ortholog topology estimated by IQ-TREE2
Pyrus_748_MO_genes_dataset_BS10_speciestree-astral2.tre = the species tree inferred from the MO ortholog trees with collapsed low-supported (10%) branches
Pyrus_884_RT_genes_dataset_RAxML.tre = the RT ortholog topology estimated by RAxML
Pyrus_884_RT_genes_dataset_IQ-TREE2.tre = the RT ortholog topology estimated by IQ-TREE2
Pyrus_884_RT_genes_dataset_BS10_speciestree-astral2.tre = the species tree inferred from the RT ortholog trees with collapsed low-supported (10%) branches

Taxon sampling, DNA extraction, and sequencing. — Multi-source genomic data have been collected to infer a comprehensive phylogenomic backbone. We acquired DGS data for wild pear species from public databases, namely the National Center for Biotechnology Information (NCBI) and the National Genomics Data Center (NGDC), encompassing 63 individuals, and also generated new data from silica-gel dried materials or museum/herbarium specimens for 15 individuals, exclusively for this study. In total, we sampled 78 individuals of Pyrus, representing 32 species. We sampled several individuals of each of the widespread species, i.e., P. betulifolia Bunge (4 individuals), P. calleryana (5), P. elaeagrifolia Pall. (3), P. pashia Buch.-Ham. ex D.Don (5), P. phaeocarpa Rehder (4), and P. ussuriensis (7). It is important to note that we aimed to propose an updated infrageneric classification of Pyrus in this study rather than focus on species delimitation. In addition, we chose four species (Crataegus iracunda Beadle, Cydonia oblonga Mill., Malus sylvestris Mill., Pourthiaea beauverdiana (C.K.Schneid.) Hatus.) from different genera of Maleae as the outgroup (Appendix 1).

In this study, we employed well-established procedures for DNA extraction and sequencing, referring to B.B. Liu & al. (2021, 2022). Briefly, DNA extraction was carried out from silica-gel dried leaves or herbarium specimens utilizing the modified cetyltrimethylammonium bromide (mCTAB) method (J.L. Li & al., 2013) in the laboratory of State Key Laboratory of Plant Diversity and Specialty Crops, Institute of Botany, Chinese Academy of Sciences. Verification of DNA quality was performed through agarose gel electrophoresis. Sequencing libraries were prepared using the NEBNext Ultra II DNA Library Prep Kit, following the protocol provided by the manufacturer (New England Biolabs, Ipswich, Massachusetts, U.S.A.). Subsequently, sequencing was conducted on the Illumina NovaSeq Platform at the Novogene lab in Beijing, China, with 150 bp paired-end reads. To ensure the accessibility of the raw data, we submitted the newly generated sequencing data to the NCBI Sequence Read Archive (SRA) under the BioProject PRJNA891949 with the corresponding SRA numbers (SRR22103100–SRR25212414) and the NGDC Genome Sequence Archive (GSA) under the BioProject CRA012315 with the corresponding GSA numbers (CRR855073–CRR855088). All details are listed in the supplement. Table S1.

Single-copy nuclear marker development, reads processing, and quality control. — Since our investigation focused on the pear genus in the apple tribe Maleae, we utilized our previously published SCN marker set specific to Maleae (Z.T. Jin & al., 2023) as the reference for subsequent analyses. The reference dataset, which contains 801 SCNs, was designed using MarkerMiner v.1.0 (Chamala & al., 2015) and is based on three Rosaceae genomes: Malus domestica (Suckow) Borkh. (accession number: GCF_002114115.1), Prunus persica (L.) Batsch (GCA_000346465.2), and Pyrus ussuriensis × P. communis (GCA_008932095.1). Though initially intended for retrieving SCN sequences from diverse taxa within the subfamily Amygdaloideae, this reference dataset includes a wide range of Maleae species, making it highly suitable for studying the pear genus. For detailed information on the comprehensive parameter settings and design process of the SCN markers, please refer to our previously published research paper on the genus Stranvaesia in the Rosaceae family (Z.T. Jin & al., 2023). The SCN gene reference (801) is available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.dr7sqvb43.

In this study, we conducted a quality evaluation of the newly generated sequencing data and those obtained from public databases using FastQC v.0.12.1 (Andrews, 2018). The purpose was to identify any residual sequencing adapters and to evaluate the quality of the reads. Subsequently, sequencing data that did not meet the required standards were trimmed using Trimmomatic v.0.39 (Bolger & al., 2014) with the following parameters: LEADING:3, TRAILING:3, SLIDINGWINDOW:4:15, MINLEN:36, and the trimmed data were rechecked using FastQC v.0.12.1 (Andrews, 2018) until all samples met the criteria for downstream analyses. The numbers of clean reads, bases, and data size for each sample can be found in the supplement. Table S1.

SCN sequence assembly and recovery efficiency. — HybPiper (Johnson & al., 2016) is a software package initially designed for recovering gene regions of interest from target enrichment sequencing data. However, recent studies (B.B. Liu & al., 2021, 2022; Su & al., 2021; Duan & al., 2023; Z.T. Jin & al., 2023) have demonstrated its versatility in capturing target sequences across various data types, such as RNA-Seq, DGS, and whole-genome sequencing (WGS) data. Therefore, we employed HybPiper v.2.0 (Johnson & al., 2016) with default settings to assemble SCN sequences from our DGS data. The process of target sequence recovery using HybPiper (Johnson & al., 2016) entails three steps as follows. Initially, BWA v.0.7.17 (Li & Durbin, 2009) and SAMtools v.1.17 (H. Li & al., 2009) were applied to search for reads against the input target DNA sequences and to sort them according to a prior mapping to a target locus. Notably, we adjusted the default coverage cutoff (--cov_cutoff) for SPAdes v.3.15.5 (Bankevich & al., 2012) to 5, allowing us to maximize the number of contigs retained for assembly (see also B.B. Liu & al., 2021). Subsequently, reads for each gene were assembled into contigs using SPAdes (Bankevich & al., 2012). Finally, multiple contigs matching the same target reference were merged into “stitched contigs” using Exonerate v.2.4.0 (Slater & Birney, 2005). To evaluate the recovery efficiency of the target sequence for each sample and to visualize the results, we utilized the post-processing commands “hybpiper stats” and “hybpiper recovery_heatmap” implemented in HybPiper (Johnson & al., 2016). Detailed information regarding the recovery statistics for each gene and the heatmap of genes from multiple samples can be found in the supplement. Table S2 and suppl. Fig. S1.

Tree-based orthology inference for SCN genes. — During the initial development of the SCN genes, we employed a stringent filtering process to preserve strict single-copy genes as target sequences. However, after executing the “hybpiper paralog_retriever” command of HybPiper (Johnson & al., 2016), we encountered a substantial number of genes displaying paralog warnings in some samples. To address this issue, we implemented a tree-based orthology inference method proposed by Yang & Smith (2014), and then implemented the scripts that had been adapted by Morales-Briones & al. (2022) for Hyb-Seq data. This method can effectively mitigate the impact of paralogs. The retrieved assemblies, devoid of putative chimeric sequences, were extracted and utilized for subsequent homolog tree inference. Each SCN sequence was aligned using MAFFT v.7.520 (Nakamura & al., 2018), and the resulting alignments underwent trimming using the “pxclsq” command within the software package Phyx v.1.0 (Brown & al., 2017) to eliminate positions with over 90% missing data. Next, homolog trees were constructed based on the pruned alignments using RAxML v.8.2.12 (Stamatakis, 2014) with a GTRCAT model and 100 bootstrap replicates. To ensure accuracy, we removed abnormally long branches using TreeShrink v.1.3.9 (Mai & Mirarab, 2018) with a parameter set to 0.01. Additionally, within each homolog tree, we clipped monophyletic and paraphyletic tips belonging to the same taxon, retaining only the tip with the maximum number of characters in the alignment. The resulting homolog trees were then employed for subsequent orthology inference.

To establish a robust nuclear phylogenetic framework, we generated three datasets using distinct orthology inference methods proposed by Yang & Smith (2014): one-to-one orthologs (1to1), monophyletic outgroup (MO), and rooted tree (RT). The number of orthologous genes varies substantially due to the different processing strategies employed for the homolog trees in these three inference methods. The 1to1 method employs stringent orthology screening criteria, eliminating all homolog trees containing paralogs and retaining only homologs with no duplicated taxa. On the other hand, the MO method retains the orthologs filtered by the 1to1 method and conducts further analysis on homolog trees containing paralogs. It selects homologs with monophyletic and non-repeating outgroups, reroots the tree, and prunes paralogs from root to tip. Only the trimmed homolog trees satisfying the minimum ingroup taxa count are recognized as orthologs. Similarly, the RT method also preserves orthologs filtered by the 1to1 method and extends the analysis to homolog trees with paralogs. These homolog trees are pruned by extracting ingroup clades and cutting paralogs from root to tip. Only the pruned homolog trees that meet the minimum ingroup taxa criterion are selected for subsequent analysis. For this study, we set the minimum number of ingroup taxa to 25 to maximize the number of obtained orthologs. Finally, we utilized a Python script to generate FASTA files from the trimmed trees. All scripts are available at https://bitbucket.org/dfmoralesb/target_enrichment_orthology.

Orthologous datasets construction and phylogenomic analyses. — To address the variation in recovery efficiency of SCN genes among samples, we conducted additional cleaning procedures for the three orthologous datasets using a pipeline that has been successfully implemented in a series of studies (B.B. Liu & al., 2021, 2022; Z.T. Jin & al., 2023). These steps aimed to eliminate ambiguous loci and ensure accurate phylogenetic inference in downstream phylogenomic analyses. Specifically, each ortholog was aligned using MAFFT v.7.520 (Nakamura & al., 2018) with the parameters “--localpair --maxiterate 1000”, and then trimmed using trimAL v.1.4.1 (Capella‐Gutiérrez & al., 2009) with parameters “-gt 0.8 -st 0.001”. Subsequently, the resulting trimmed genes were merged into a supermatrix using AMAS v.1.0 (Borowiec, 2016), and Spruceup v.1.0 (Borowiec, 2019) was utilized to remove outlier sequences based on the uncorrected p-distance calculation method. The resultant alignment was then separated into loci again using AMAS v.1.0 (Borowiec, 2016) and trimmed with trimAL v.1.4.1 (Capella‐Gutiérrez & al., 2009) using the same parameters. The resulting alignments longer than 250 bp were retained for inferring maximum likelihood (ML) trees with RAxML v.8.2.12 (Stamatakis, 2014) using the parameters “-m GTRGAMMA -N 100”. Next, we identified tips with long branches and removed them from the respective gene tree and alignment using TreeShrink v.1.3.9 (Mai & Mirarab, 2018). The resulting trees and sequences were then utilized for subsequent phylogenetic analyses.

By integrating multiple phylogenetic inference methods, including concatenated and coalescent-based methods, our objective was to accurately estimate the phylogeny of Pyrus based on the three nuclear data matrices. To achieve this, we combined all orthologous sequences for each dataset using AMAS v.1.0 (Borowiec, 2016). Subsequently, we employed PartitionFinder2 (Stamatakis, 2006; Lanfear & al., 2017) with the search scheme set to the rcluster algorithm (Lanfear & al., 2014) to determine the optimal partitioning scheme and evolutionary model for each partition. Using the estimated optimal partitioning scheme and evolutionary model, we inferred the ML phylogeny with IQ-TREE2 v.2.2.2.7 (Minh & al., 2020) using parameters “-B 1000 -alrt 1000”, as well as with RAxML v.8.2.12 (Stamatakis, 2014) using the GTRGAMMA model and 100 bootstrap (BS) replicates. For coalescent-based estimation, branches with support below 10 in each ortholog tree were collapsed using the “pxcolt” command in Phyx (Brown & al., 2017). The resulting pruned gene trees were then entered into ASTRAL-III v.5.7.8 (C. Zhang & al., 2018) to infer the species tree using default parameters.

Plastome coding sequences recovery and phylogenetic analysis. — We assembled CDSs for phylogenetic inference while taking into account uneven sequencing coverage (G.N. Liu & al., 2023b; Xu & al., 2023). Specifically, we retrieved plastome-associated reads that were generated during the extending iterations stage of Getorganelle v.1.7.6.1 (J.J. Jin & al., 2020) for each sample. Subsequently, these reads were utilized to assemble the plastid CDSs using HybPiper (Johnson & al., 2016). To facilitate the assembly process, we extracted the plastid CDSs of Pyrus betulifolia (GenBank accession number: ON478188) using Geneious Prime v.2023.0.1 (Kearse & al., 2012) as a reference in HybPiper (Johnson & al., 2016). We employed the “hybpiper stats” and “hybpiper recovery_heatmap” commands implemented in HybPiper to summarize and visualize the overall recovery efficiency. Due to the uneven recovery efficiency of plastid CDSs across samples, we kept only 72 plastid CDSs for the subsequent phylogenetic analyses. The final recovery statistic file and heatmap are available in suppl. Table S3 and the supplement Fig. S2, respectively.

We aligned all the 72 plastid CDSs using MAFFT v.7.520 (Nakamura & al., 2018) with the “auto” parameter. Next, the resulting alignments were trimmed with trimAL v.1.4.1 (Capella‐Gutiérrez & al., 2009) using the “automated” parameter and then merged with AMAS v.1.0 (Borowiec, 2016) for concatenation-based ML analyses. To estimate the best-fit partitioning schemes and nucleotide substitution models for each partition, we utilized PartitionFinder2 (Stamatakis, 2006; Lanfear & al., 2017) with the greedy algorithm (Lanfear & al., 2012). For the best ML trees, we employed IQ-TREE2 v.2.2.2.7 (Minh & al., 2020) and RAxML v.8.2.12 (Stamatakis, 2014) with the same parameter settings as used for the nuclear phylogeny. Similarly, the plastid coalescent-based tree was inferred using the same process as the nuclear species tree with ASTRAL-III (C. Zhang & al., 2018).

Data from: Advancing Pyrus phylogeny: Deep genome skimming-based inference coupled with paralogy analysis yields a robust phylogenetic backbone and an updated infrageneric classification of the pear genus (Maleae, Rosaceae)

Data files

Abstract

README: Advancing Pyrus phylogeny: Deep genome skimming-based inference coupled with paralogy analysis yields a robust phylogenetic backbone and an updated infrageneric classification of the pear genus (Maleae, Rosaceae)

Description of the Data and file structure

Analyses_data.zip

Methods

Works referencing this dataset