Mitochondrial genome evolution in Annelida: A systematic study on conservative and variable gene orders and the factors influencing its evolution
Data files
Apr 28, 2023 version files 31.75 MB
-
OnlineAppendices.zip
18.13 MB
-
README.md
55.36 KB
-
TestData_Compiling.zip
13.57 MB
Abstract
The mitochondrial genomes of Bilateria are relatively conserved in their protein-coding, rRNA and tRNA gene complement, but the order of these genes can range from very conserved to very variable depending on the taxon. The supposedly conserved gene order of Annelida has been used to support the placement of some taxa within Annelida. Recently, authors have cast doubts on the conserved nature of the annelid gene order. Various factors may influence gene-order variability including, among others, increased substitution rates, base composition differences, structure of non-coding regions, parasitism, living in extreme habitats, short generation times and biomineralization. However, these analyses were neither done systematically, nor based on well-established reference trees. Several focused on only a few of these factors and biological factors were usually explored ad-hoc without rigorous testing or correlation analyses. Herein, we investigated the variability and evolution of the annelid gene order and the factors that potentially influenced its evolution, using a comprehensive and systematic approach. The analyses were based on 170 genomes, including 33 previously unrepresented species. Our analyses included 706 different molecular properties, 20 life-history and ecological traits and a reference tree corresponding to recent improvements concerning the annelid tree. The results showed that the gene order with and without tRNAs is generally conserved. However, individual taxa exhibit higher degrees of variability. None of the analyzed life-history and ecological traits explained the observed variability across mitochondrial gene orders. In contrast, the combination and interaction of the best predicting factors for substitution rate and base composition explained up to 30% of the observed variability. Accordingly, correlation analyses of different molecular properties of the mitochondrial genomes showed an intricate network of direct and indirect correlations between the different molecular factors. Hence, gene order evolution seems to be driven by molecular evolutionary aspects rather than by life history or ecology. On the other hand, gene order variability does not predict difficulty in placing certain taxa within molecular phylogenetic studies. We also discuss the molecular properties of annelid mitochondrial genomes considering canonical views on gene evolution and potential reasons why they do not always fit to the observed patterns without nuisance.
Methods
Compilation and Annotation of Mitochondrial Genome and 18S Data
In addition to our own mitochondrial genomes, we retrieved mitochondrial genomes available from NBCI as of 17.11.2021 (online Appendix 1). We used tblastx with the COI barcode of Parergodrilus heideri (KY503040), and the records limited to Annelida (taxid:6340) and sequence length from 12,000 to 200,000 bp. If more than one mitochondrial genome per species was present, only the larger one was kept. Additionally, we retrieved 18S rRNA sequences from NCBI using MegaBlast and AF412810 to match the mitochondrial genomes. The following order was applied for the matching: first, if possible, that the 18S was from the same individual. This was the case mostly for our own sequences, but also for some mitochondrial genomes generated from genome skimming data such as Glyceridae and Aphroditiformia. If this was not possible, the sequence was taken from the same species. If this was also not possible and there was no species already representing the genus with both a mitochondrial genome and 18S sequence, an 18S sequence was taken from another species of the same genus. Finally, the mitochondrial genomes of Cirriformia cf. tentaculata, Perinereis aibuhitensis, Ramisyllis multicaudata and Trypanosyllis sp. were excluded as these species had either too short or very divergent 18S sequences, making the alignment problematic and in two cases introduced extremely long branches in the tree. In total, 168 mitochondrial genomes and 18S sequences from 69 annelid families were included in the following analyses, of which 33 were new mitochondrial genomes and 32 new 18S sequences generated in this study. Due to redundancy, lack of matching 18S sequence,s or too divergent 18S sequences, 75 mitochondrial genomes found with the tblastx search were not included. To these datasets, we added two outgroup taxa, Lineus viridis (Nemertea) and Terebratulina retusa (Brachiopoda) as they exhibit the ancestral lophotrochozoan mitochondrial gene order as shown by Bernt, M., Bleidorn, C., et al. (2013).
All 170 mitochondrial genomes were annotated for protein-coding, rRNA, and tRNA genes using MITOS2 (Al Arab, M., Höner zu Siederdissen, C., et al. 2017, Donath, A., Jühling, F., et al. 2019), RefSeq 63 Metazoa as the reference dataset, and the invertebrate mitochondrial genetic code (NCBI code table 5). Finally, all genomes were manually investigated to detect problematic issues and possible genes not found by MITOS2. Specifically, the positions of the mitochondrial rRNAs and tRNAs were checked. The gene order as well as the sequences of the individual mitochondrial genes were compiled using custom-made shell scripts (available at GitHub "CompileDatasets.sh", and "CheckFileNames.sh").
Generate Reference Tree for the Macroevolutionary Analyses using a Constraint Tree and Nuclear 18S Data
We wanted to compare the evolution of the mitochondrial genome in comparison to the nuclear genome including substitution rates along branches derived from the nuclear genome. We did this to avoid tautological reasoning inherent in comparing the evolution of the mitochondrial genome against a tree that would derive branch lengths and topology from the same data source. Hence, we recovered a constraint tree (online Appendix 2) combining the results from recent, mostly phylogenomic studies (i.e., Erséus, C., Williams, B.W., et al. , Struck, T.H., Schult, N., et al. 2007, Zhong, M., Hansen, B., et al. 2011, Struck, T.H., Golombek, A., et al. 2015, Anderson, F.E., Williams, B.W., et al. 2017, Zhang, Y., Sun, J., et al. 2018, Phillips, A.J., Dornburg, A., et al. 2019, Struck, T.H. 2019, Tilic, E., Sayyari, E., et al. 2020). The backbone of this constraint tree was based on the large-scale phylogenomic studies shown in Struck, T.H. (2019), while the relationships within specific groups such as Crassiclitellata or Sabellida were based on studies targeting these groups (e.g., Anderson, F.E., Williams, B.W., et al. 2017, Tilic, E., Sayyari, E., et al. 2020). However, this constraint tree still has some unresolved nodes and does not provide branch length as it is a composite tree derived from multiple different studies. As the available phylogenomic data is very sparse and does not match the representative mitochondrial genome dataset, we used nuclear 18s rRNA sequences as a proxy for the nuclear genome to resolve these nodes and obtain branch lengths. Due to the high availability of the 18S data, we could match nuclear data very well to our mitochondrial genome dataset. The 18S sequences were aligned using MAFFT with automatic selection of the best alignment method, which was FFT-NS-i with the iterative refinement method (max. 2 iterations) (Katoh, K., Kuma, K.-i., et al. 2005). The 3’ and 5’ ends of the alignment were trimmed using AliView (Larsson, A. 2014), such that fewer than 10 sequences had no information on the first and last column of the alignment. Potentially non-homologous positions were masked using AliScore and AliCut with gaps treated as ambiguous data and default settings (Kück, P., Meusemann, K., et al. 2010). Next, we used IQtree (Nguyen, L.-T., Schmidt, H.A., et al. 2015) using the constraint tree and automatic model selection (-m MFP) for both the unmasked and masked datasets. The model chosen according to the Bayesian information criterion (BIC) was TIM2e+R10 for the masked dataset and TN+F+I+G4 for the unmasked one. For the tree obtained using the masked or unmasked dataset, an ultrametric tree with relative ages was generated using chronos of the APE package (Paradis, E. and Schliep, K. 2018) in R (R Core Team 2020) with a lambda of 1 and a correlated model. Herein, we will refer to these four trees (masked, unmasked, and ultrametric) based on 18S data as the reference trees (e.g., online Appendix 3 & 4).
Determine the Different Structural and Sequence-based Properties of Each Species for Different Parts of the Mitochondrial Genomes
For the complete mitochondrial genome sequences, we determined genome size, frequencies of nucleotides, AT and purines, and the AG, AT, CT, and GC skew values using BaCoCa.v1.109 (Kück, P. and Struck, T.H. 2014) as well as normalized relative composition frequency variability (nRCFV) values adjusted for the number of positions, character states and taxa using NuclRCFVReader (J. Fleming pers. comm., available at GitHub). Additionally, we added the number of duplicated genes, of introns, and of intergenic regions as well as the average size of the intergenic regions using a custom-made shell script (available at GitHub "RetrieveIntergenicParts.sh").
For each protein-coding gene, alignments were generated with Mega 11.0.10 (Tamura, K., Stecher, G., et al. 2021) using MUSCLE, the invertebrate mitochondrial code (NCBI code table 5) and default settings to obtain both an amino acid alignment and a nucleotide alignment based on the amino acid one. For each rRNA gene, a nucleotide alignment was generated using also Mega 11.0.10 without codons and default settings. Three supermatrices were generated containing nucleotide alignments; one with all genes, one with only the protein-coding genes, and one with only rRNA genes. Additionally, one supermatrix was generated containing all amino-acid alignments of protein-coding genes. FASconCAT-G v1.05 was used to generate the supermatrices (Kück, P. and Longo, G. 2014). For each gene and supermatrix, to obtain tree-based measurements and get pairwise patristic distances, branch lengths were estimated using a constraint tree based on the 18S reference tree (i.e., if species were lacking in a dataset, they were excluded from the tree) and IQtree (Nguyen, L.-T., Schmidt, H.A., et al. 2015) with automatic model selection (-m MFP for the single genes and –m MFP+MERGE for the supermatrices). For the supermatrices, we implemented the Partitionfinder option of IQtree. The selected models are shown in online Appendix 5. Using BaCoCa.v1.109, we determined the frequencies of nucleotides, gaps, AT and purines or of amino acids, gaps hydrophobic, polar, positively charged, neutral and negatively-charged, and the AG, AT, CT, and GC skew values for nucleotides. Normalized RCFV values were obtained using NuclRCFVReader or ProtRCFVReader, respectively. We also calculated the evolutionary rates, long branch (LB) scores, and tip-to-root distances using TreSpEx v1.2 using the function e and the IQtree trees (Struck, T.H. 2014).
For the gene order, we conducted distance analyses using CRex (Bernt, M., Merkle, D., et al. 2007) and mapped the gene order evolution on the 18S reference tree using TreeRex (Bernt, M., Merkle, D., et al. 2008). Each gene order started with COX1 at position 1 and then the gene order was manually aligned by including not applicable (NA) for genes in the missing region of incomplete mitochondrial genomes and a gap (-) for the most likely position in complete genomes as well as by removing duplicated genes. The aligned gene orders with and without tRNAs (with NA and gaps removed again) were uploaded to CRex and analyses were run using default settings with the following exception: "remove duplicates” was disabled. The distances matrices "common interval", "breakpoints" and "reversal distance" (RD) were obtained. For the TreeRex analyses, the species were sorted based on their phylogenetic affiliation, and the aligned gene order without tRNAs was modified in the following way to reduce missingness in the dataset. Lost genes (i.e., gaps) and lacking genes (i.e., NA) were included in accordance with the closest relatives. The TreeRex analysis was run using all three consistency methods (strong, weak, and parsimonious weak), the 18S reference tree, and the modified aligned gene order file without tRNAs. For the gene orders both with and without tRNAs and without the outgroup gene orders, we also calculated rearrangement frequencies (RF) for the individual genes and rearrangement scores (RS) for the individual gene orders using qMGR (Zhang, J., Kan, X., et al. 2020). To this end, the numbers of the different gene orders were counted using "Count_SequenceOrders.R" and the most common gene order was used as the ground pattern.
Correlation Analyses of Mitochondrial Properties without Reference Tree
We analyzed the correlation of the three gene order distance matrices with and without tRNA to each other in R. We visualized the correlation using ggscatter with a regular line, confidence interval, and a correlation coefficient based on the Pearson method. We conducted visual inspection of the data normality using Q-Q plots (quantile-quantile plots) as there were too many data points for the Shapiro-Wilk normality test. As normality could not be assumed, we also determined the Spearman's ρ rank correlation coefficient. Additionally, using only the reversal distances (RD) we compared the datasets with and without tRNAs to one another. We generated boxplots of all pairwise RDs as well as the mean of each species and plotted the mean RD values of the two datasets in comparison to each other. Moreover, we compared the mean RD values to the mean positional differences of both the nucleotide and the amino acid dataset given both the Maximum Likelihood (ML) and Bayesian trees. The mean positional difference reflects the difference in placement of species in the unconstrained mitochondrial phylogenetic analyses (see below) to the 18S reference tree. To obtain it, the following procedure was used. First, for both the unconstrained and reference tree, the pairwise distances based on equal-spaced cladograms were calculated. This meant that each branch had a length of 1 and, hence, the pairwise distance reflected the number of branches connecting two species in a tree. Second, the absolute difference between the two distance matrices of the unconstrained and reference trees was calculated. Third, the mean value of these differences was estimated for each species. This mean value captures not only the position of the species itself, but also how other species have changed their position in relation to this species. Accordingly, the higher the value, the more the position of the species has changed in comparison to other species in the tree. For example, a value of 2 means that the paths connecting one species to the others are on average two branches longer in either the reference or the unconstrained tree, while a value of 6 means that it is six branches. The mean RD and mean positional difference are plotted against each other. Additionally, we compared the amino acid and nucleotide datasets to each other for the ML and Bayesian trees. To conduct these analyses we used the libraries "ape", "dplyr", "ggplot2", "ggpubr", “tidyr", "TreeDist", and "TreeTools" (Wickham, H. 2016, Smith, M.R. 2019, Kassambara, A. 2020, Smith, M.R. 2020, Wickham, H. 2020, Wickham, H., François, R., et al. 2020) (see R scripts “ExploreReverseDistance.R” and “CorrelationsGeneOrder.r” on GitHub).
The matrix of compiled sequence properties contained 170 species and 706 variables. Before conducting the correlation analyses, we explored the distribution of the sequence properties across species and genes using the R script “ExplorationDataAnalyses.R” (see GitHub) with the libraries "dplyr", "ggplot2", "tidyr", "ggpubr", "ggpmisc" and "data.table" (Wickham, H. 2016, Dowle, M. and Srinivasan, A. 2020, Kassambara, A. 2020, Wickham, H. 2020, Wickham, H., François, R., et al. 2020, Aphalo, P.J. 2021). For the correlation analyses, we first calculated the correlation coefficients of each variable to the others. Second, we retrieved all correlated pairs, which were highly correlated (R2 >0.5 or <-0.5), and reduced them to one variable. The resulting matrix had 80 variables. Third, to reduce the number of variables further, we conducted a hierarchical clustering analysis using the "average" method and the correlation coefficients of the remaining variables. We also generated a corellogram, where all groups of clustered variables with cutoff values of 80% of the maximal height were determined and reduced to one variable. The final resulting matrix had 25 variables. Fourth, we repeated the hierarchical clustering analysis combined with a corellogram. For conducting these steps, we used the libraries "corrplot", "corrr", "data.table", "dplyr", "fastcluster", "ggpubr", "graphics", "Hmisc" and "tidyverse" (Müllner, D. 2013, Wei, T. and Simko, V. 2017, Wickham, H., Averick, M., et al. 2019, Dowle, M. and Srinivasan, A. 2020, E, H.J.F. 2020, Kuhn, M., Jackson, S., et al. 2020) (see R script “CorrelationsSeqProperties_Final.R” on GitHub). Finally, we retrieved all groups of correlated or clustered variables using a custom-made shell script (see GitHub “RetrieveCorrelatedGroups.sh"). Using the libraries “gplots” and “RColorBrewer” (Neuwirth, E. 2014, Warnes, G.R., Bolker, B., et al. 2020), we generated heatmaps with breaks set at 0.001, 0.01, 0.02, 0.03, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9 (see GitHub “GenerateHeatmap.R”).
Macroevolutionary Analyses based on Reference Tree
Using the masked, ultrametric 18S reference tree, we assessed if different mitochondrial properties and life history traits predict the gene order rearrangements considering the phylogeny. First, we tested if the gene orders themselves show phylogenetic signal. We tested both the actual gene orders with and without tRNAs as categorical characters and the mean RD values of each species as continuous characters. Then we used phylogenetic least square regression (pgls) where each of the 706 variables was tested against the mean RD values of both gene orders with and without tRNA as response variables. In all cases, Pagel’s λ was determined using a maximum likelihood approach. When this resulted in optimization problems, λ was set to 1. This was the case for 12 variables when the mean RD values of the gene order with tRNAs was the response and for 86 variables when tRNAs were excluded. All variables which were significant predictors at α=0.001 were kept. For these significant variables, we conducted correlation analyses including hierarchical clustering using the "average" method and generating correlation networks with R2 connection thresholds at 0.7 and corellograms (see R script "pgls_all.R" on GitHub).
We scored several life history traits including those previously suggested as potentially important for genome evolution (see Introduction). Life history data were collected from the literature and further completed based on the authors’ observations (online Appendix 33). We also used a pgls approach to test if life history traits were correlated with RD values (both with and without tRNAs). Taxa for which data for certain traits were missing or found to be polymorphic were excluded from the analyses for the corresponding trait. For conducting these steps, we used the R libraries “caper”, "corrplot", "corrr", "Hmisc" and “tidyverse” (Orme, D., Freckleton, R., et al. 2018) (see R script "pgls_with_tRNA.R" and "pgls_without_tRNA.R" on GitHub).
Phylogenetic Reconstruction
Finally, we reconstructed the phylogeny based on mitochondrial sequence data using the concatenated nucleotide and amino acid sequence data both with and without the constraint tree with Maximum Likelihood; and using the nucleotide and amino acid data without the constraint tree with Bayesian approaches. The alignments of the individual protein-coding and rRNA genes were masked with AliScore and AliCut treating gaps as ambiguous data and concatenated using FASconCAT-G v1.05. For both the nucleotide and the amino acid supermatrix, a ML tree was reconstructed using IQtree v. 1.6.12 with automatic model selection and merging of partitions (MFP+MERGE) to implement a partitioned ML analysis and 1000 rapid bootstrap replicates (Kalyaanamoorthy, S., Minh, B.Q., et al. 2017, Hoang, D.T., Chernomor, O., et al. 2018). The selected model schemes are given in online Appendix 5. For analyses applying constraints, the constraint tree (online Appendix 2) was provided again. For the Bayesian analyses, both the nucleotide and the amino acid datasets were each run with two chains for 40,000 generations each using PhyloBayes-MPI v1.8 (Lartillot, N., Rodrigue, N., et al. 2013). The CAT-GTR model with a discrete gamma distribution with 4 categories was applied implementing site-heterogeneous substitution model that better captures the complexity of the substitution pattern across sites and genes. Convergence and the burnin of the chains were determined using bpcomp, tracecomp and Tracer v1.7.1 (Rambaut, A., Drummond, A.J., et al. 2018). The burnin for both was reached after 10,000 generations, but in both cases the two chains did not converge on the same optimum. Accordingly, we summarized only the trees of the better chain sampling every 10th generation and a burnin of 10,000 generations.