Untying the Gordian Knot of plastid phylogenomic conflict: a case from ferns

Wang, Ting 1

Research facility: Phylogeny

Published Dec 17, 2021 on Dryad. https://doi.org/10.5061/dryad.fttdz08tz

Data files

Dec 17, 2021 version files 18.68 MB

datasets.rar

18.68 MB

Abstract

Phylogenomic studies based on plastid genome have resolved the recalcitrant relationships among various plants, yet the phylogeny of Dennstaedtiaceae at the taxonomic level remains unresolved due to conflicting plastid genes, limited molecular data and incomplete taxon sampling of previous studies. The present study generated 31 new plastid genomes of Dennstaedtiaceae (9 genera, 30 species) and combined 41 publicly available sequences of plastid genome (including 24 families, 26 genera, 41 species) to solve and explore the evolution of Dennstaedtiaceae. In order to minimize the impact of systematic errors on the resolution of phylogenetic inference, we applied six strategies to generate 30 datasets based on CDS, Spacer, and All datasets, and two tree inference methods (maximum-likelihood, ML; and multispecies coalescent, MSC) to comprehensively analyze the plastome-scale data. Besides, the phylogenetic signal among all loci was quantified for the controversial node using the ML framework, and the phylogenetic hypotheses among all datasets were tested. In the species tree based on different data sets and methods, obvious conflicts were detected at the base of the polypod ferns. Meanwhile, the topology of the “CDS-codon-align-rm3” (CDS removed the third codon) matrix was selected as the primary reference or summary tree due to its analysis results are consistent, and similar to the topological structure of the amino-acid matrix. The final phylogenetic tree supported Dennstaedtiaceae as the sister group to eupolypods, and Dennstaedtia (sen. lat.) can divided into smaller genera, which was also supported by geographical distribution and plastid structure. This robust reconstructed phylogenetic backbone established a framework for future studies on Dennstaedtiaceae classification, evolution and diversification. The present study suggests considering plastid phylogenomic conflict when using plastid genomes. From our results, reducing saturated genes or sites can effectively mitigate the tree conflicts of distantly related taxa. Moreover, amino acid sequences may verify the accuracy of nucleotide-based phylogeny.

Taxon Sampling and Sequencing

In this study, 31 new Dennstaedtiaceae plastomes belonging to 30 species of 9 genera were sequenced. Combined with publicly available complete plastome data of NCBI, plastome from 72 species of 47 genera and 25 families were analyzed. Detailed information on the studied taxa, arranged according to the PPG I (2016) classification, is provided in Supplementary Table S1.

Total DNA was extracted from fresh, young leaves using Plant Mini Kit (Qiagen, CA, USA), following the manufacturer’s protocol. DNA quality and integrity were evaluated on 1% agarose gels. DNA purity was determined with the NanoPhotometer® spectrophotometer (Implen, CA, USA), and DNA concentration was measured using the Qubit® DNA Assay Kit in a Qubit® 2.0 Fluorometer (Life Technologies, CA, USA).

Sequencing libraries were generated using the NEB Next® Ultra DNA Library Prep Kit for Illumina® (NEB, USA) following the manufacturer’s recommendations. Approximately 700 ng of high-quality DNA per sample was used for library preparation; index codes were added to attribute sequences to each sample. Briefly, the chip DNA was purified using the AMPure XP system (Beckman Coulter, Beverly, USA), and DNA fragments of a specified length were selected from the agarose gels. Subsequently, the size-selected, adaptor-ligated DNA was incubated with 3 µL of USER Enzyme (Uracil-Specific Excision Reagent; NEB, USA) at 37 °C for 15 min followed by 5 min at 95 °C. Then, PCR was performed with Phusion High-Fidelity DNA polymerase (Thermo Fisher Scientific), using the universal primers plus one index (X) primer. Finally, the PCR products were purified (AMPure XP system), and the quality was assessed on an Agilent Bioanalyzer 2100 system.

Subsequently, the clustering of the index-coded samples was performed on a cBot Cluster Generation System using the HiSeq 4000 PE Cluster Kit (Illumina) according to the manufacturer’s instructions. After cluster generation, the libraries were sequenced on an Illumina HiSeq 4000 platform, and 150-bp paired-end reads were generated (Yan et al., 2013). The raw data were filtered using the cluster based on default parameter (-L 5, -p 0.5, -N 0.1). High N rate reads (N content > 10%) and low-quality reads (50% of bases with Q ≤ 5) were removed from the raw data to obtain clean data.

Plastid Genome Assembly, Annotation and Comparison

The paired-end reads were filtered following the GetOrganelle pipe-line (https://github.com/Kinggerm/GetOrganelle) to obtain plastid-like reads. The filtered reads were assembled by SPAdes v3.10 (Bankevich et al. 2012), and the final “fastq” files were filtered using GetOrganelle to obtain pure plastid contigs. The filtered de Bruijn graphs were viewed and edited by Bandage (Wick et al. 2015), and finally, circular plastomes were obtained.

The assembled plastomes were annotated with PGA (Plastid Genome Annotator) (Qu et al. 2019), using the published plastome of Histiopteris incisa (MH319942) as the reference. The results were manually aligned to the reference genome using Geneious v11.1.5 (Kearse et al. 2012) to confirm the plastomes. Finally, 31 high-quality, complete plastid genome sequences were obtained. OrganellarGenomeDRAW (OGDRAW) v1.3.1 (https://chlorobox.mpimp-golm.mpg.de/OGDraw.html) was used to visualize the structural features of the plastomes of 31 species (Greiner et al. 2019).

Sequence Alignment and Cleanup

Furthermore, the “get_annotated_regions_from_gb.py” (https://github.com/Kinggerm/PersonalUtilities/) script was used to automatically extract all CDS (coding regions) and intergenic spacer regions from a list of annotated files in GenBank-format and manually calibrate the results. The CDS and intergenic spacer regions were individually aligned using the L-INS-i method of MAFFT (Multiple Alignment using Fast Fourier Transform) v.7.475 (Katoh et al. 2002). Further, loci covering less than 55% of the species, and loci of CDS less than 100 bp or intergenic spacer regions less than 50 bp were removed to minimize the use of loci with limited information or present in relatively few species. Finally, 166 loci, including 81 CDS and 85 intergenic spacer regions, were obtained from 71 plastomes for downstream analysis.

Datasets Generation

The script “concatenate_fasta.py” (https://github.com/Kinggerm/PersonalUtilities/) was used to concatenate the alignments of each locus and create three basic datasets, the CDS, the Spacer (intergenic spacer regions), and the All (the concatenated CDS and Spacer) datasets. Furthermore, four strategies were applied to reduce the systematic error in the three basic datasets. The first strategy excluded ambiguously aligned regions using Gblocks v0.91b (Castresana 2000) with relaxed, default and strict parameters (“Allowed Gap Positions” = “With All/Half/None”), producing the “CDS-GB-all/half/none”, “Spacer-GB-all/half/none”, and “All-GB-all/half/none” datasets. The second and third strategies identified and excluded loci with high levels of excessive substitutional saturation (slope and R² values) and evolutionary distance (long-branch score) using TreSpEx v.1.1 (Struck 2014). Then, the density plots of the long-branch score, slope and R² values were generated with R v.3.2.2 (The R Core Team 2015). The distribution of the long-branch scores of CDS and Spacer showed a small shoulder at 0.45 and 0.90, respectively (Figs. S1a, d), corresponding to the removal of 13 loci of CDS and 9 loci of Spacer from CDS/Spacer/All datasets to form “CDS-LB”, “Spacer-LB”, and “All-LB” datasets. The 22 loci of CDS (0.50) and 18 loci of Spacer (0.344) located on the left “hump” of the R² distribution (Figs. S1b, e) were trimmed from CDS/Spacer/All datasets to form the “CDS-R²”, “Spacer-R²”, and “All-R²” datasets. Then, 9 loci of CDS (0.104) and 73 loci (0.30) of Spacer (Figs. S1c, f) located on the left “hump” of the slope distribution (Figs. S1b, c) were removed from CDS/Spacer/All datasets to form the “CDS-slope”, “Spacer-slope”, and “All-slope” datasets. The fourth strategy used TreSpEx v.1.1 (Struck 2014) to calculate the average bootstrap support (BS) of all nodes in the corresponding maximum likelihood tree generated from 166 loci (Table S2) and then removing loci with at least 75% ultrafast bootstrap (UFBoot) support, generating “CDS-BS75”, “Spacer-BS75”, and “All- BS75” datasets. The loci excluded from each dataset are listed in Supplementary Table S2.

Figure S1 Density plots of long-branch score, slope values and R² values for 166 loci generated using R. (a, d) Average patristic distance as a proxy for genes affected by long-branch attraction; (b, e) Slope of the linear regression between patristic and uncorrected pairwise distances; (d, f) R² of the linear regression between patristic and uncorrected pairwise distances. Red areas indicate deviations from the normal distribution.

Furthermore, in order to compare the effects of different alignment methods, we added “codon-aligned” and “homologous block searching” method by using MACSE 0.9b (Ranwez et al. 2011) and Homblocks (Bi et al. 2018), which produced “CDS-codon-align” and “All-Homblock” datasets from CDS and All datasets. Subsequently, a new custom script “remove_third_codon.py” was developed and used to remove the third-codon positions of the “CDS-codon-align” dataset to form “CDS-codon-align-rm3” datasets. Numerous studies have shown a significantly higher substitution rate at the third-codon position than the first- and second-codon positions (Bofkin et al. 2007; Mordecai et al. 2016; Katz 2020), which may cause site saturation and mislead phylogenetic reconstructions (Struck 2014; Breinholt et al. 2013).

Phylogenetic analysis in plants was carried out based on either nucleic acid or protein sequences. For gene sequences encoding proteins, analysis based on nucleic acid sequences with three times as many characters, can be compared with protein sequences, which seems more informative. However, for phylogenetic analysis involving distantly related taxa, the increased information content in nucleic acid is merely an illusion and, in most cases, a major liability (Gupta 1998). Further, to confirm the accuracy of the final topologies, we translated sers of nucleic acid sequences of 81 codon-alignment into amino acid for downstream phylogenetic analysis.

Phylogenetic Analyses

The maximum-likelihood (ML) and multispecies coalescent (MSC) methods were used to infer species and gene trees for both nucleic acids and amino acids datasets. For phylogenetic inference using the maximum-likelihood (ML) approach, two different methods were used. First, IQ-TREE v2.0.3 (Nguyen et al. 2015) in Model Finder was run with the –TEST and –AICc, and tree search options, using 1,000 bootstrap replicates in a single run (Chernomor et al. 2014; Kalyaanamoorthy et al. 2017). Then, RAxML v8.2.12 (Alexandros 2014) under the GTR + I + G substitution models was run. The support for the nodes in the inferred phylogeny was assessed following bootstrap analysis with 500 pseudo-replicates. In MSC method, gene trees for each dataset were inferred in IQ-TREE using the best-fit substitution model, followed by 1,000 independent likelihood searches from a random starting tree. These gene trees were then used as input for ASTRAL-II v4.11.1 (Siavash et al. 2015) to assess the bootstrap support values under the coalescent model.

Quantification of Phylogenetic Signals of Alternative Tree Topologies

The phylogenetic signal within the three sets of conflicting topologies (T1, T2, and T3) of Dennstaedtiaceae across the 30 datasets was evaluated following the methods by Smith et al. (2012), Shen et al. (2017), Walker et al. (2018) and Zhang et al. (2020). The differences in the site-wise log-likelihood scores (ΔSLS) and the gene-wise log-likelihood scores (ΔGLS) among T1, T2 and T3 for each dataset (Table S3) were calculated to quantify the phylogenetic signal for the three alternative phylogenetic hypotheses.

Generally, tiny subsets of large data matrices, especially genes with abnormal phylogenetic signals, may also drive the resolution of specific internodes and influence phylogenetic inference (Shen et al. 2017). Therefore, to reduce the conflict at the position of Dennstaedtiaceae in the three topologies, the abnormal loci in phylogenetic signaling analysis were identified and deleted (Shen et al. 2017; Walker et al. 2018; Zhang et al. 2020). The average ΔGLS was calculated for each gene in CDS, Spacer and All datasets, and the standard deviation was used to identify the outliers; loci with the average ΔGLS value greater than the upper bound or smaller than the lower bound of a Gaussian-like distribution were defined as the outlier loci. The lower and upper bound were determined as follows:

Upper bound = min(max(x), μ + 3*σ)

Lower bound = max(min(x), μ + 3*σ)

where max(x), min(x), μ, and σ indicate the maximum, minimum, average, and standard deviation, respectively, for a set of ΔGLS values (Shen et al. 2017). Subsequently, six outlier loci were removed from the CDS dataset to generate the “CDS-outlier” dataset (Fig. 4a), five from the Spacer dataset to generate the “Spacer-outlier” dataset (Fig. 4b) and ten from the All dataset to generate the “All-outlier” dataset (Fig. 4c). Phylogenetic trees were then reconstructed using IQ-TREE and ASTRAL as described previously, and phylogenetic signal was recalculated to compare the effect of loci removal.

Topology Testing

The approximately unbiased (AU) test (Shimodaira 2002), Kishino–Hasegawa (KH) test (Kishino et al. 1989), Shimodaira–Hasegawa (SH) test (Shimodaira et al. 1999; Goldman et al. 2000), and weighted Shimodaira–Hasegawa (WSH) test (Shimodaira 1993; Shimodaira 1998; Shimodaira et al. 1999; Buckley et al. 2001) in CONSEL v1.20 (Hidetoshi et al. 2001) were employed to test which topology was statistically better among the three for each dataset. These tests were conducted using the multi-scale bootstrap technique based on the site-wise log-likelihood scores, calculated in RAxML (option -f G).

Figure 4 Distribution of phylogenetic signal for the three alternative topologies of phylogenetic position of Dennstaedtiaceae based on gene-wise log-likelihood scores (ΔGLS) across the (a) CDS, (b) Spacer, and (c) All datasets.

Untying the Gordian Knot of plastid phylogenomic conflict: a case from ferns

Data files

Abstract

Methods

Works referencing this dataset