Skip to main content

Resurrection of a global, metagenomically defined gokushovirus

Cite this dataset

Kirchberger, Paul (2020). Resurrection of a global, metagenomically defined gokushovirus [Dataset]. Dryad.


Gokushoviruses are single-stranded, circular DNA bacteriophages found in metagenomic datasets from diverse ecosystems wordwide, including human gut microbiomes. Despite their ubiquity and abundance, little is known about their biology or host range: isolates are exceedingly rare, known only from three obligate intracellular bacterial genera. By synthesizing circularized phage genomes from prophages embedded in diverse enteric bacteria, we produced gokushoviruses in an experimentally tractable model system, allowing us to investigate their features and biology. We demonstrate that virions can reliably infect and lysogenize hosts by hijacking a conserved chromosome-dimer resolution system. Sequence motifs required for lysogeny are detectable in other metagenomically defined gokushoviruses; however, we show that even partial motifs enable phages to persist cytoplasmically without leading to collapse of their host culture. This ability to employ multiple, disparate survival strategies is likely key to the long-term persistence and global distribution of Gokushovirinae.


Detection and phylogeny of gokushovirus prophages and their hosts. Using blastp, we queried the NCBI nr database (April 2019) with the Chlamydia Phage 4 major capsid protein VP1 (NCBI Gene ID 3703676) and downloaded the complete genomes of 95 strains within the Enterobacteriaceae containing sequences returning E-values less than 0.0001. Chromosome contigs containing the VP1 gene were visually inspected in Geneious R9 ( for the presence of prophage insertion boundaries by searching for identical 17-bp sequences within 5-kb regions upstream and downstream of the VP1 gene. Prophage genes were annotated with GLIMMER3 (Delcher et al., 2007) using default settings, specifying a minimum gene length of 110 bp and a maximum overlap of 50 bp. Initial alignments of prophage regions (ranging in size from 4047 to 4692 bp) were made with ClustalO 1.2.4 using standard settings, and were refined manually to accommodate hypervariable regions and the phage insertion sites at the 3’ and 5’ ends of the alignment. Average nucleotide identity at each position in the alignment was calculated and visualized using Geneious R9. Maximum likelihood phylogenetic trees of enterobacterial prophages were generated with RAxML 8.0.26 (Stamatakis, 2014) using the GTR+GAMMA substitution model and 100 fast-bootstrap replicates and visualized with FigTree 1.4.3

To evaluate the distribution of prophage hosts within the broad diversity of E. coli at large, we produced core genome alignments of prophage hosts and representative genomes from the Escherichia coli reference (ECOR) collection (Ochman and Selander, 1984) based on protein families satisfying a 30% amino-acid identity cutoff (USEARCH 11, Edgar, 2010), which were aligned in MUSCLE 3.8.31 (Edgar, 2004) as implemented in the BPGA 1.3 pipeline (Chaudhari, 2016). The maximum likelihood phylogenetic tree of core genome alignments was built with IQTree 1.6.2 (Nguyen, 2015), using the JTT substitution model (Jones et al., 1992) and 100 bootstrap replicates.

Phylogenetic analysis of Gokushovirinae. We downloaded a total of 1284 metagenome-assembled genomes (MAGs) of microviruses, which were then reannotated in GLIMMER3 (Delcher et al., 2007) using default settings, with a minimum gene length of 110 bp and a maximum overlap of 50 bp. We recovered homologues to the conserved major capsid protein VP1 and replication initiation protein VP4 in the set of metagenome-assembled microviruses using PSI-BLAST searches and querying with VP1 and VP4 proteins from detected enterobacterial gokushoviruses, gokushovirus genomes of Chlamydia, Spiroplasma and Bdellvibrio, and Bullavirinae phage phiX174. After individual protein alignments using Clustal Omega 1.2.4 (standard settings), we concatenated the VP1 and VP4 alignments, and removed all sites with >10% gaps to decrease the amount of spuriously aligned sites using Geneious R9. The initial phylogenetic tree of all microviruses was built with IQTree 1.6.2 using the LG+F+R10 substitution model as determined by ModelFinder (Kalyaanamoorthy et al., 2017), and branch support was tested using 1000 ultra-fast bootstrap replicates (Hoang et al., 2018) and 1000 SH-aLRT tests. Collapsing all branches with <95% bootstrap support and <80% SH-aLRT support yielded a single, well-supported clade containing all known Gokushovirinae, and all subsequent alignments and phylogenetic trees were refined by including only those genomes represented in this clade, with branch support assessed with 100 bootstrap replicates.

Recovering prophages from metagenomes. To assemble prophages from metagenomic datasets, we downloaded SRA files from BioProjects PRJEB29491 (viral human, Moreno-Gallego et al., 2019), PRJNA362629 (cellular bovine, unpublished), PRJNA290380 (cellular human, Kostic et al., 2015), PRJNA352475 (cellular human, Ferretti et al. 2018), PRJEB6456 (cellular human, Bäckhed et al., 2015), PRJNA385126 (viral human, Stockdale et al., 2018),PRJEB7774 (cellular human, Feng et al., 2016), and PRJNA545408 (human viral, Shkoporov et al., 2019). We performed initial trimming and quality filtering with BBDuk (Bushnell, 2014a) with options ktrim=r k=23 mink=11 hdist=1 tbe tbo. Reads having a minimum nucleotide sequence identity of 50% to sequences of enterobacterial prophages, as determined by BBMap (Bushnell, 2014b), were assembled into contigs using MEGAHIT 1.1.3 (Li et al., 2015) implemented with default settings, and those contigs >1000 bp were retained.


National Institute of General Medical Sciences, Award: R35GM118038