Skip to main content

Truly ubiquitous CRESS DNA viruses scattered across the eukaryotic tree of life

Cite this dataset

Zhao, Lele; Lavington, Erik; Duffy, Siobain (2021). Truly ubiquitous CRESS DNA viruses scattered across the eukaryotic tree of life [Dataset]. Dryad.


Until recently, most viruses detected and characterized were of economic significance, associated with agricultural and medical diseases. This was certainly true for the eukaryote-infecting circular Rep (replication-associated protein)-encoding single-stranded DNA (CRESS DNA) viruses, which were thought to be a relatively small group of viruses. With the explosion of metagenomic sequencing over the past decade and increasing use of rolling-circle replication for sequence amplification, scientists have identified and annotated copious numbers of novel CRESS DNA viruses – many without known hosts but which have been found in association with eukaryotes. Similar advances in cellular genomics have revealed that many eukaryotes have endogenous sequences homologous to viral Reps, which not only provide “fossil records” to reconstruct the evolutionary history of CRESS DNA viruses but also reveal potential host species for viruses known by their sequences alone. The Rep protein is a conserved protein that all CRESS DNA viruses use to assist rolling circle replication that is known to be endogenized in a few eukaryotic species (notably tobacco and water yam). A systematic search for endogenous Rep-like sequences in GenBank’s non-redundant eukaryotic database was performed using tBLASTn. We utilized relaxed search criteria for the capture of integrated Rep sequence within eukaryotic genomes, identifying 93 unique species with an endogenized fragment of Rep in their nuclear (78 species), plasmid (1 species), mitochondrial (6 species) or chloroplast (8 species) genomes. These species come from 19 different phyla, scattered across the eukaryotic tree of life. Exogenous and endogenous CRESS DNA viral Rep tree topology suggested potential hosts for one family of uncharacterized viruses and supports a primarily fungal host range for genomoviruses.


The endogenous viral sequences were extracted directly from tBLASTn results, which are amino acid sequences translated from nucleotide sequences, GenBank accession and genomic coordinates are available in the corresponding label files(.txt) and supplement table of the manuscript. 

The endogenous viral sequences were combined with viral query sequences used to generate them, and aligned using MUSCLE and trimmed by TrimAl (detail in Method section of manuscript).

Phylogenetic trees were built with the aligned and trimmed alignments using PhyML 3.0, using CRESS+G+F model.

The trees can be viewed using FigTree and label files can be imported using FigTrees' "import annotation" option.

The CRESS model is also provided here. The derivation of the CRESS model is available here: doi:


National Science Foundation, Award: DEB1240049