Skip to main content

Supporting trees and alignments for the publication: Cryptic and abundant marine viruses at the evolutionary origins of Earth’s RNA virome

Cite this dataset

Wainaina, James et al. (2022). Supporting trees and alignments for the publication: Cryptic and abundant marine viruses at the evolutionary origins of Earth’s RNA virome [Dataset]. Dryad.


Whereas DNA viruses are known to be abundant, diverse, and commonly key ecosystem players, RNA viruses are relatively understudied outside disease settings. Here, we analyzed ≈28 terabases of Global Ocean RNA sequences to expand Earth’s RNA virus catalogues and their taxonomy, investigate their evolutionary origins, and assess their marine biogeography from pole to pole. Using new approaches to optimize discovery and classification, we identified RNA viruses that necessitate substantive revisions of taxonomy (doubling phyla and adding >50% new classes) and evolutionary understanding. “Species”-rank abundance determination revealed that viruses of new phyla “Taraviricota”, a missing link in early RNA virus evolution, and “Arctiviricota” are widespread and dominant in the oceans. These efforts provide foundational knowledge critical to integrating RNA viruses into ecological and epidemiological models.


To generate phylogenetic trees for each network-derived major cluster, sequences from each of these clusters were aligned separately using the E-INS-i strategy over 1,000 iterations in MAFFT v7.017 (Katoh and Standley, 2013). Aligned sequences were subsequently trimmed using Trimal (Capella-Gutierrez et al., 2009) with sites having more than 20% gaps removed. Prior to phylogenetic analysis, sequences were screened for possible recombination events using 3Seq (Boni et al., 2007), with a recombinant event determined by a Bonferroni-corrected p-value cutoff of 0.05. Recombinant sequences were excluded from phylogenetic analyses. Phylogenetic relationships of sequences within a cluster were first assigned the appropriate evolutionary model using ModelFinder (Kalyaanamoorthy et al., 2017). Then, a subsequent Maximum Likelihood phylogenetic tree was generated using bootstrap support generated for 1,000 iterations in IQ-TREE (Nguyen et al., 2015).

To generate the global phylum-level phylogenetic tree, we used an approach that combined consensus [used for highly divergent sequences (Grandi et al., 2020; Grandi et al., 2018; Vargiu et al., 2016; Chen et al., 2015; Fernandez-Caso et al., 2019; Alipour et al., 2013; Zhang and Firestein, 2002)] and individual sequences in the alignment. Each consensus sequence was generated by first aligning individual sequences per megataxon, then obtaining the consensus sequence of the alignment using Geneious v8.1.9 ( The number of ambiguous residues (i.e., ‘X’s) within each consensus sequence was then determined and each consensus sequence composed of >20% ambiguous sites was replaced by the individual sequences within the megataxon to preserve the quality of the alignment (Wiens, 2006). Almost all of the new megataxa had >20% ambiguous sites and hence, for consistency, they were all represented by their individual sequences. Subsequent alignment, trimming and phylogenetic inferences were as described above, with the only modification being using the -gappyout option during trimming.


  1. K. Katoh, D. M. Standley, MAFFT multiple sequence alignment software version 7:improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
  2. S. Capella-Gutiérrez, J. M. Silla-Martínez, T. Gabaldón, trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 25, 1972–1973 (2009)
  3. M. F. Boni, D. Posada, M. W. Feldman, An exact nonparametric method for inferring mosaic structure in sequence triplets. Genetics. 176, 1035–1047 (2007
  4. S. Kalyaanamoorthy, B. Q. Minh, T. K. F. Wong, A. von Haeseler, L. S. Jermiin, ModelFinder: fast model selection for accurate phylogenetic estimates. Nature Methods. 14, 587–589 (2017)
  5. L.-T. Nguyen, H. A. Schmidt, A. von Haeseler, B. Q. Minh, IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015)
  6. N. Grandi, M. P. Pisano, M. Demurtas, J. Blomberg, G. Magiorkinis, J. Mayer, E. Tramontano, Identification and characterization of ERV-W-like sequences in Platyrrhini species provides new insights into the evolutionary history of ERV-W in primates. Mob.DNA. 11, 6 (2020)
  7. N. Grandi, M. Cadeddu, J. Blomberg, J. Mayer, E. Tramontano, HERV-W group evolutionary history in non-human primates: characterization of ERV-W orthologs in Catarrhini and related ERV groups in Platyrrhini. BMC Evol. Biol. 18, 6 (2018)
  8. L. Vargiu, P. Rodriguez-Tomé, G. O. Sperber, M. Cadeddu, N. Grandi, V. Blikstad, E. Tramontano, J. Blomberg, Classification and characterization of human endogenous retroviruses; mosaic forms are common. Retrovirology. 13, 7 (2016).
  9. M. Chen, Y. Ma, C. Yang, L. Yang, H. Chen, L. Dong, J. Dai, M. Jia, L. Lu, The
    5 combination of phylogenetic analysis with epidemiological and serological data to track HIV-1 transmission in a sexual transmission case. PLoS One. 10 (2015).
  10. B. Fernández-Caso, J. Á. Fernández-Caballero, N. Chueca, E. Rojo, A. de Salazar, L. García Buey, L. Cardeñoso, F. García, Infection with multiple hepatitis C virus genotypes detected using commercial tests should be confirmed using next generation sequencing. Sci. Rep. 9, 9264 (2019).
  11. A. Alipour, S. Tsuchimoto, H. Sakai, N. Ohmido, K. Fukui, Structural characterization of copia-type retrotransposons leads to insights into the marker development in a biofuel crop, Jatropha curcas L. Biotechnol. Biofuels. 6 (2013).
  12. X. Zhang, S. Firestein, The olfactory receptor gene superfamily of the mouse. Nat. 15 Neurosci. 5 (2002).
  13. J. J. Wiens, Missing data and the design of phylogenetic analyses. J. Biomed. Inform. 39 (2006)


Gordon and Betty Moore Foundation, Award: 3790

National Science Foundation, Award: OCE#1829831

The Ohio Supercomputer and Ohio State University’s Center of Microbiome Science

Ramon-Areces Foundation Postdoctoral Fellowship

Laulima Government Solutions, LLC prime contract with the U.S. National Institute of Allergy and Infectious Diseases (NIAID), Award: HHSN272201800013C

National Science Foundation, Award: ABI#1759874

National Science Foundation, Award: DBI# 2022070