Data from: Nightmare or delight: taxonomic circumscription meets reticulate evolution in the phylogenomic era

Data files

Sep 20, 2023 version files 16.22 MB

Analyses_data.zip

16.22 MB
README.md

2.28 KB

Abstract

Phylogenetic studies in the phylogenomics era have demonstrated that reticulate evolution greatly impedes the accuracy of phylogenetic inference, and consequently can obscure taxonomic treatments. However, the systematics community lacks a broadly applicable strategy for taxonomic delimitation in groups characterized by pervasive reticulate evolution. The red-fruit genus, Stranvaesia, provides an ideal model to examine the influence of reticulation on generic circumscription, particularly where hybridization and allopolyploidy dominate the evolutionary history. In this study, we conducted phylogenomic analyses integrating data from hundreds of single-copy nuclear (SCN) genes and plastomes, and interrogated nuclear paralogs to clarify the inter/intra-generic relationship of Stranvaesia and its allies in the framework of Maleae. Analyses of phylogenomic discord and phylogenetic networks showed that allopolyploidization and introgression promoted the origin and diversification of the Stranvaesia clade, a conclusion further bolstered by cytonuclear and gene tree discordance. With a well-inferred phylogenetic backbone, we propose an updated generic delimitation of Stranvaesia and introduce a new genus, Weniomeles. This new genus is distinguished by its purple-black fruits, thorns trunk and/or branches, and a distinctive fruit core anatomy characterized by multilocular separated by a layer of sclereids and a cluster of sclereids at the top of the locules. Through this study, we highlight a broadly applicable workflow that underscores the significance of reticulate evolution analyses in shaping taxonomic revisions from phylogenomic data.

https://doi.org/10.5061/dryad.hx3ffbghm

This package contains the data and software outputs - Analyses_data.tar.gz

Description of the data and file structure

Analyses_data.zip

Alignments

Data 1. Stranvaesia_nuclear.fasta = 801 nuclear single-copy reference genes filtered from thirteen genomes

Data 2. Stranvaesia_78_plastid_coding_genes.fasta = concatenated alignments of 78 plastid coding genes of Stranvaesia
Stranvaesia_78_plastid_coding_genes.partitions.txt = partition file for concatenated alignments of 78 plastid coding genes of Stranvaesia

Data 3. Stranvaesia_426_SCN_genes_dataset.fasta = concatenated alignments of 426 SCN genes dataset
Stranvaesia_426_SCN_genes_dataset.partitions.txt = partition file for concatenated alignments of 426 SCN genes dataset

Data 4. 13-taxa_sampling(Stranvaesia)_for_PhyloNetworks_analysis.fasta = concatenated alignments of 13-taxa sampling (Stranvaesia) dataset
13-taxa_sampling(Stranvaesia)_for_PhyloNetworks_analysis.partitions.txt = partition file for concatenated alignments of 13-taxa sampling (Stranvaesia) dataset

Data 5. 11-taxa_sampling(Stranvaesia)_for_PhyloNetworks_analysis.fasta = concatenated alignments of 11-taxa sampling (Stranvaesia) dataset
11-taxa_sampling(Stranvaesia)_for_PhyloNetworks_analysis.partitions.txt = partition file for concatenated alignments of 11-taxa sampling (Stranvaesia) dataset

TREES

Stranvaesia_78_plastid_coding_genes_RAxML.tre = the plastid topology estimated by RAxML
Stranvaesia_78_plastid_coding_genes_IQ-TREE2.tre = the plastid topology estimated by IQ-TREE2
Stranvaesia_78_plastid_coding_genes_BS10_speciestree-astral2.tre = the species tree inferred from the plastid gene trees with collapsed low-supported (10%) branches

Stranvaesia_426_SCN_genes_dataset_RAxML.tre = the nuclear topology estimated by RAxML
Stranvaesia_426_SCN_genes_dataset_IQ-TREE2.tre = the nuclear topology estimated by IQ-TREE2
Stranvaesia_426_SCN_genes_dataset_BS10_speciestree-astral2.tre = the species tree inferred from the nuclear gene trees with collapsed low-supported (10%) branches

2.1. Taxon sampling, DNA extraction, and sequencing

To refine our understanding of the phylogenetic placement and generic circumscription of Stranvaesia within the apple tribe framework, we undertook a comprehensive taxon sampling of Stranvaesia and its allies. Based on the redefinition by Liu et al. (2019), our sample included six individuals from three accepted species: two from S. bodinieri, one from S. nussia (the type species), and three from S. oblanceolata. Given the unclear phylogenies between Stranvaesia and Photinia, our study also encompassed 19 individuals from Photinia, representing 16 species out of 20 currently recognized species (referring to Yu and Ku, 1974; Lu and Spongberg, 2003). Furthermore, to ascertain the phylogenetic position of the target lineages, we also selected 41 outgroup species spanning 19 genera. These encompassed one genus from the tribe Gillenieae, specifically Gillenia stipulata (Muhl. ex Willd.) Nutt. Additionally, they included 18 currently recognized genera from the tribe Maleae, Amelanchier Medik., Aronia Medik., Chaenomeles Lindl., Cormus Spach, Cotoneaster Medik., Crataegus L., Cydonia Mill., Dichotomanthes Kurz, Kageneckia Ruiz & Pav., Malacomeles (Decne.) Decne., Malus Mill., Pourthiaea Decne., Pseudocydonia (C.K.Schneid.) C.K.Schneid., Pyracantha M.Roem., Pyrus L., Rhaphiolepis Lindl., Sorbus L. sensu lato, and Vauquelinia Corrêa ex Bonpl. In total, our research sampled 66 individuals: eight from RNA-Seq data, 44 from DGS data, and 14 from WGS data. Out of these, 23 DGS data sets were sequenced specifically for this project, 21 were taken from our earlier study (Liu et al., 2022), and the remaining 22 were sourced from NCBI. The DGS raw data for the newly sequenced samples can be available from Genome Sequence Archive (GSA) under the BioProject CRA012248 and Sequence Read Archive (SRA) under BioProject PRJNA859408. The corresponding accession numbers and voucher information are detailed in Table S1.

Total genomic DNAs were extracted from silica-gel dried leaves or herbarium specimens using a modified CTAB (mCTAB) method (Li et al., 2013) in the lab of the Institute of Botany, Chinese Academy of Science (IBCAS) in China. The whole genomic libraries were prepared in the lab of Novogene, Beijing, China using NEBNext^® UltraTM II DNA Library Prep Kit, and then paired-end reads of 2 × 150 bp were generated on the NovoSeq 6000 Sequencing System (Novogene, Beijing).

2.2. Single-copy nuclear marker development

SCN marker development followed the pipeline presented in Liu et al. (2021). Briefly, the coding regions of three genomes retrieved from GenBank (Malus domestica (Suckow) Borkh., accession number: GCF_002114115.1; Prunus persica (L.) Batsch, accession number: GCA_000346465.2; and Pyrus ussuriensis Maxim. × P. communis L., accession number: GCA_008932095.1) were inputted into MarkerMiner v. 1.0 (Chamala et al., 2015) to identify the putative single-copy genes. The resulting genes were then filtered by successively BLASTing (Altschul et al., 1990, 1997; Camacho et al., 2009) against ten genomes of the family, viz., Cydonia oblonga Mill. (accession number: GCA_015708375.1), Dryas drummondii Richardson ex Hook. (accession number: GCA_003254865.1), Fragaria vesca L. (accession number: GCF_000184155.1), Geum urbanum L. (accession number: GCA_900236755.1), Gillenia trifoliata (L.) Moench (accession number: GCA_018257905.1), Malus domestica (accession number: GCF_002114115.1), Prunus persica (accession number: GCF_000346465.2), Purshia tridentata (Pursh) DC. (accession number: GCA_003254885.1), Pyrus ussuriensis×P. communis (accession number: GCA_008932095.1), and Rosa chinensis Jacq. (accession number: GCF_002994745.2) using Geneious Prime (Kearse et al., 2012), with the parameters settings in the Megablast program (Morgulis et al., 2008) as a maximum of 60 hits, a maximum E-value of 1 × 10^-10, a linear gap cost, a word size of 28, and scores of 1 for match and -2 for mismatch in alignments. We first excluded genes with mean coverage > 1.1 for alignments, which indicated potential paralogy and/or the presence of highly repeated elements in the sequences. The remaining alignments were further visually examined to exclude those genes receiving multiple hits with long overlapping but different sequences during BLASTing. It should be noted that the alignments with mean coverage between 1.0 and 1.1 were typically caused by tiny pieces of flanking intron sequences in the alignments. These fragments were considered SCN genes here. The resulting SCN gene reference is available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.hx3ffbghm.

2.3. Reads processing and assembly

We processed the raw reads by trimming low-quality bases and sequence adapters using Trimmomatic v. 0.39 (Bolger et al., 2014), with the average quality per base of a four-base sliding window below 15, and reads less than 36 bases removed. Adapters from the deep genome skimming (DGS) and genome resequencing (WGS) data were removed with TruSeq3-PE.fa as input adapter sequences, and those in the transcriptome (RNA-Seq) data were removed with NexteraPE-PE.fa (available from https://github.com/usadellab/Trimmomatic/tree/main/adapters). The trimmed data were then checked with FastQC v. 0.11.9 (Andrews, 2018) to ensure that all adapters were removed and qualified for downstream analysis.

Plastome assembly followed a two-step strategy proposed by Liu et al. (2019, 2023), integrating NOVOPlasty 3.6 (Dierckxsens et al., 2016) and a Successive Approach combining Reference-based and De novo assembly (hereafter referred to as "SARD approach") (Liu et al., 2023). The SARD approach can handle any amount of data to obtain high-quality plastomes. We generated 66 complete chloroplast genomes, 23 of which were newly assembled for this study, and the remaining 43 were downloaded from NCBI. The circularized plastomes were annotated using Geneious Prime (Kearse et al., 2012) with Photinia prunifolia (Hook. & Arn.) Lindl. (GenBank accession number MK920279) downloaded from NCBI as the reference for Photinia and Stranvaesia nussia (GenBank accession number MK920284) for Stranvaesia. We then manually checked each coding gene’s start and stop codon in all chloroplast genomes and removed incorrect annotations by translating the sequences into proteins. The final assembled chloroplast genome was converted into the format required by GenBank using GB2sequin (Lehwark and Greiner, 2019), and then submitted to NCBI with the corresponding accession number listed in Table S1.

We utilized the ‘hybpiper assemble’ command of HybPiper v. 2.0.1 (Johnson et al., 2016) to assemble the single-copy nuclear locus of each sample with the parameter “--cov_cutoff 5” based on the SCN gene reference mentioned above. We then summarized and visualized the recovery efficiency using the ‘hybpiper stats’ and ‘hybpiper recovery_heatmap’ commands. Because paralogous genes may impact phylogenetic inference, especially in groups with prevalent reticulate evolution, we performed a paralogous genes search using the post-processing command ‘hybpiper paralog_retriever’ in HybPiper v. 2.0.1 (Johnson et al., 2016) and used the genes without paralog warnings in the subsequent phylogenetic analyses. Due to differences in sequence recovery efficiency among samples in this study, we followed Liu et al. (2022)’s pipeline to further process the assembled SCN genes to remove outlier loci and short sequences, and to account for missing data. Briefly, each SCN gene was aligned by MAFFT v. 7.480 (Nakamura et al., 2018) and clipped by trimAL v. 1.2 (Capella‐Gutiérrez et al., 2009) to remove aligned columns with gaps in more than 20% of the sequences and retain sequences with average similarity more than 99.9%. The resulting SCN genes were then concatenated using AMAS v. 1.0 (Borowiec, 2016), and the concatenated genes were then used as input to run Spruceup (Borowiec, 2019) for removing outlier sequences. We also used AMAS v. 1.0 (Borowiec, 2016) to split the processed alignment back into single SCN gene alignments and retrimmed these alignments using trimAL v. 1.2 (Capella‐Gutiérrez et al., 2009) with the same parameters as above. Given the potentially limited informativeness in short sequences, we keep the aligned sequences with more than 250 bp length for downstream analysis using a python script (exclude_short_sequences.py, Liu et al., 2022). To remove possible erroneous sequences in the alignments, we used TreeShrink v. 1.3.9 (Mai and Mirarab, 2018) to detect and remove outlier tips with abnormally long branches in each SCN gene tree. The following phylogenetic analysis is based on these shrunk trees and sequences.

2.4. Phylogenetic analyses

We inferred the phylogenetic relationships of Stranvaesia in the context of Maleae using two sets of data, i.e., nuclear SCN genes and plastid coding sequences (CDSs). In this study, both concatenated and coalescent-based methods were carried out on each data type. All 78 plastid CDSs were extracted from 66 plastomes using Geneious Prime (Kearse et al., 2012). They were aligned by MAFFT v. 7.475 (Nakamura et al., 2018) independently with the “--auto” option and then concatenated by AMAS v. 1.0 (Borowiec, 2016). The best-fit partitioning schemes and/or nucleotide substitution models for downstream analysis were searched using PartitionFinder2 (Stamatakis, 2006; Lanfear et al., 2016), with parameters set to linked branch lengths, Corrected Akaike Information Criterion (AICc) and greedy (Lanfear et al., 2012) algorithm. We first estimated a maximum likelihood (ML) tree with IQ-TREE2 v. 2.1.3 (Minh et al., 2020) with 1000 SH-aLRT and the ultrafast bootstrap replicates, as well as collapsing near zero branches option using the best partitioning schemes and nucleotide substitution models inferred above. An alternative ML tree was inferred using RAxML 8.2.12 (Stamatakis, 2014) with the GTRGAMMA model for each partition and clade support assessed with 200 rapid bootstrap (BS) replicates. Considering possible conflict in evolutionary history among plastid genes, we estimated a coalescent-based species tree based on the 78 plastid CDSs. We inferred individual ML gene tree using RAxML with a GTRGAMMA model, and 200 BS replicates to assess clade support. After collapsing branches with support below 10 using phyx (Brown et al., 2017), all 78 gene trees were then used to estimate a species tree using ASTRAL-III (Zhang et al., 2018) with local posterior probabilities (LPP; Sayyari and Mirarab, 2016) to assess clade support. These three trees are available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.hx3ffbghm.

For the nuclear phylogenetic inference, we concatenated all the previously shrunk SCNs sequences with AMAS v. 1.0 (Borowiec, 2016), then used PartitionFinder2 (Stamatakis, 2006; Lanfear et al., 2016) to estimate the best-fit partitioning schemes and/or nucleotide substitution models under the corrected Akaike information criterion (AICc) and linked branch lengths, as well as with the rcluster algorithm (Lanfear et al., 2014). As described above in the plastid phylogenetic analysis, ML trees were estimated by IQ-TREE2 v. 2.1.3 (Minh et al., 2020) with 1000 SH-aLRT using UFBoot2 and collapsing near zero branches option and RAxML 8.2.12 (Stamatakis, 2014) with GTRGAMMA model for each partition and clade support assessed with 200 rapid bootstrap (BS) replicates. Individual ML gene tree were inferred using RAxML with a GTRGAMMA model, and 200 BS replicates to assess clade support. To decrease the systematic error from low-supported clades, we used phyx (Brown et al., 2017) to collapse the branches of these shrunk SCN gene trees with support below 10. The processed trees were then used to estimate a coalescent-based species tree with ASTRAL-III (Zhang et al., 2018) using local posterior probabilities (LPP) to assess clade support. These three trees are available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.hx3ffbghm.

2.5. Detecting and visualizing nuclear gene tree discordance

Due to the gene tree discordance detected in the inferred phylogeny, we used phyparts (Smith et al., 2015) and Quartet Sampling (QS: Pease et al., 2018) to evaluate gene tree conflict in the nuclear datasets. By comparing the gene trees against the ML tree inferred from RAxML, phyparts can count the number of concordant and conflicting gene trees at each node of the RAxML tree and yield an Internode Certainty (ICA) score reflecting the degree of conflict. We firstly ran a full concordance analysis (-a 1) with the branch/nodes bootstrap support (BS) of the gene tree inferred from RAxML lower than 50% as uninformative. However, the analysis assumes no missing taxa in the input gene trees. Due to the presence of uneven recovery efficiency of nuclear genes obtained from DGS data, we ran an additional quick concordant analyses (-a 0) without the support cutoff option (-s) and combine the information of both analyses. The result of phyparts was visualized as pie charts with a python script (available from https://bitbucket.org/dfmoralesb/target_enrichment_orthology/src/master/phypartspiecharts_missing_uninformative.py). Based on repeated subsampling of quartets from the input tree and alignment to generate counts of the three possible topologies (and uninformative replicates) and calculate the confidence, consistency, and informativeness of each internal branch, QS analysis can better address phylogenetic discordance with comprehensive and specific information on branch support. Therefore, QS analysis was used to evaluate gene tree conflict with paraments set to 100 replicates and the log-likelihood threshold of 2; this can gather more information on these strongly discordant nodes shown in phyparts. The QS result was visualized with plot_QC_ggtree.R (by Shuiyin Liu, available from https://github.com/ShuiyinLIU/QS_visualization).

2.6. Phylogenetic network analyses

Frequent hybridization events and rapid radiation characterize the intricate evolutionary history of Maleae. In order to decipher potential reticulation events that may account for discordance between the nuclear and plastid trees within Stranvaesia, we employed the phylogenomic network method SNaQ, which is implemented in PhyloNetworks (Solís-Lemus et al., 2017). This method pinpoints possible hybridization events and computes the inheritance probabilities γ and 1-γ, signifying the genetic material’s transmission from the two parental lineages to the hybrid. The optimal number of such hybridization events is identified using a pseudolikelihood method. While Solís-Lemus and Ané (2016) suggest that SNaQ can handle up to 35 species, our previous findings indicate that the estimated hybridization events differ when more than 20 species are sampled based on the same dataset. For optimal efficiency, computational effectiveness, and performance in phylogenetic network estimation, we recommend performing the SNaQ analyses with between 10 to 15 species.

To precisely estimate the netlike evolution during the origin and diversification of Stranvaesia and its allies, we adopted two sampling strategies aiming at covering all major clades in the Maleae tribe. Both strategies incorporated a core set of four Stranvaesia species (five individuals): S. bodinieri, S. lasiogyna (= Photinia lasiogyna (Franch.) C.K.Schneid.), S. nussia, and S. oblanceolata. The first strategy extends beyond these four core species to include seven closely associated genera: Chaenomeles, Cotoneaster, Cydonia, Malus, Photinia, Pourthiaea, and Pseudocydonia, using Gillenia as the outgroup (hereafter referred to as “13-Taxon Sampling SNaQ Analyses”). This approach aims to highlight potential reticulation events between Stranvaesia and its closely related genera. Conversely, the second strategy combines the core species with five more distantly related genera: Crataegus, Dichotomanthes, Pyrus, Rhaphiolepis, and Vaquelinia, once again using Gillenia as the outgroup (hereafter referred to as “11-Taxon Sampling SNaQ Analyses”). This dataset is curated to identify gene flow between Stranvaesia and these more remotely related genera.

The quartet concordance factors (CFs) represent the proportion of genes supporting each possible relationship with a given quartet (Larget et al., 2010), and we summarized the CFs based on all nuclear SCN gene trees estimated from RAxML 8.2.12 (Stamatakis, 2014). The species tree estimated with ASTRAL-III (Zhang et al., 2018) was used to generate an optimal starting SNaQ network with no hybridization edges (h_max = 0). The best network (h_max = 0) and the CFs were used to estimate the next optimal network (h_max = 1), and the resulting network was used to estimate the next optimal network (h_max = 2), and so on. We constructed seven optimal networks using h_max values ranging from 0 to 6 with 50 independent runs for each h_max. The pseudo-deviance score estimated from each run’s branch lengths and inheritance probabilities can be used to select the optimal phylogenetic network, with lower pseudo-deviance scores indicating a better fit (Solís-Lemus et al., 2017). We plotted h_max (0 to 6) against the log-likelihood score (i.e., network score) of the optimal network for each h_max value to assess the optimal number of hybridization edges. Plotting the decrease in the pseudo-deviance score and observing a leveling-off in the rate of change was used to determine the optimal h_max value.

2.7. Allopolyploidy analyses

GRAMPA utilizes the least common ancestor (LCA) mapping algorithm and the representation of multi-labeled trees (MUL-trees) to identify the most probable clade with a polyploid origin. In the absence of polyploidy, a singly-labeled tree would be more parsimonious than any MUL-tree. If a MUL-tree is most parsimonious, the parental lineage(s) involved in a genome doubling event can be inferred from the most parsimonious MUL-tree. Due to frequently reported polyploidy events in some genera in Maleae (Robertson et al., 1991; Liu et al., 2022), we ran the GRAMPA (Thomas et al., 2017) analysis to identify possible allopolyploid and/or autopolyploid scenarios involved in the origin of the Stranvaesia clade. We tested if the Stranvaesia clade was a result of allopolyploidization (“-h1” inputs), with the remaining nodes investigated as potential secondary parental branches (“-h2” inputs).

2.8. Dating analysis and ancestral area reconstruction

We estimated the divergence age and the ancestral areas of Stranvaesia in the framework of Maleae using the nuclear SCN datasets. Divergence times were estimated using treePL (Smith and O’Meara, 2012), which uses a penalized likelihood algorithm. The treePL program is suitable for dealing with large datasets with hundreds of taxa by combining two rounds of gradient-based optimization and a partial simulated annealing procedure to achieve speed optimization as well as avoid issues with local optima. We used the best ML tree inferred from RAxML based on the 426 nuclear SCN gene matrix as the phylogenetic backbone for the dating analysis.

A data-driven cross-validation analysis was carried out in treePL (Smith and O’Meara, 2012) to acquire the optimal value for smoothing parameter λ, which determines the appropriate level of rate heterogeneity. We tested 19 smoothing values in multiples of 10 from 1 × 10⁻¹² to 1 × 10⁶ and used 1 × 10⁻¹⁰as the best smoothing values for the following dating analysis. Fossils of Amelanchier peritula and A. scudderi have been discovered in the Florissant Formation, Colorado, USA, and they were dated to Chadronian in Late Eocene (37.2-33.9 million years ago [Ma]). We thus set the stem Amelanchier Medik. with a minimum age of 33.9 Ma and a maximum age of 37.2 Ma. The leaf fossil of Vauquelinia comptoniifolia from Green River Formation, Colorado, USA has been dated to Eocene (MacGinitie, 1969). We constrained this leaf fossil to 40.4 (the minimum age) and 46.2 Ma (the maximum age). Additionally, a leaf fossil of Malus or Pyrus from the Republic site, Washington was used to constrain the divergence between Malus and Pyrus at 46 (the maximum age)-44 (the minimum age) Ma (MacGinitie, 1969). The minimum and maximum ages of the stem of Gillenia Moench were constrained to 53.3 and 58.7 Ma, respectively (Zhang et al., 2017). The nuclear phylogeny inferred from RAxML 8.2.12 (Stamatakis, 2014) was applied to construct all dated bootstrap time trees in treePL (Smith and O’Meara, 2012). The resulting dated bootstrap time trees were then used to generate maximum credibility trees in TreeAnnotator v1.10 implemented in BEAST2 (Bouckaert et al., 2014), and the dated best time tree with confidence age intervals was visualized in FigTree v1.4.4.

Biogeographic analyses were conducted using the SCN data as input for BioGeoBEARS v. 1.1.1 (Matzke, 2018) implemented in RASP v. 4.2 (Yu et al., 2015). We delimit five geographical area units according to the distribution of Maleae: (A), East Asia; (B), Europe; (C), Central Asia; (D), North America; (E), South America. The dated best time tree summarized above by TreeAnnotator was used as input to score each taxon to these areas, and the maximum number of areas per node was set to five. We chose the model with the highest AICc_wt value as the best model.