Skip to main content

Prophage-DB: A comprehensive database to explore diversity, distribution, and ecology of prophages

Cite this dataset

Dieppa-Colón, Etan; Martin, Cody; Anantharaman, Karthik (2024). Prophage-DB: A comprehensive database to explore diversity, distribution, and ecology of prophages [Dataset]. Dryad.



Viruses that infect prokaryotes (phages) constitute the most abundant group of biological agents, playing pivotal roles in microbial systems. They are known to impact microbial community dynamics, microbial ecology, and evolution. Efforts to document the diversity, host range, infection dynamics, and effects of bacteriophage infection on host cell metabolism are still at the surface level. Among phages, some adopt the lysogenic mode of infection, where the genome integrates into the host cell genome, forming a prophage. Prophages enable viral genome replication without host cell lysis and often contribute novel and beneficial traits to the host genome. Despite their importance, research on prophages is limited. Current phage research predominantly focuses on lytic phages, leaving a significant gap in knowledge regarding prophages, including their biology, diversity, and ecological roles.


To bridge this gap, the creation of Prophage-DB, a prophage database, aims to address the limited knowledge of these crucial biological entities. To create the database, we identified lysogenic viruses from genomes in three publicly available databases. We applied several state-of-the-art tools in our pipeline to annotate these viruses, cluster them, taxonomically classify them, and detect their respective AMGs. With our approach, we identified over 350,000 prophages and 35,000 auxiliary metabolic genes.


By summarizing the collected information we have created a database with extensive metadata regarding phage and host taxonomy, host information, and auxiliary metabolic genes. We identified numerous phages, from a wide variety of archaeal and bacterial hosts, which show a wide environmental distribution. In addition, the identified auxiliary metabolic genes will improve our understanding of them given the context of our study. We estimate this comprehensive prophage database will be a valuable resource for advancing prophage research, offering insights into viral taxonomy, host relationships, auxiliary metabolic genes, and environmental distribution. Its use promises to contribute towards understanding microbial ecosystems and unlocking the mysteries of microbial dark matter.

README: Prophage-DB: A comprehensive database to explore diversity, distribution, and ecology of prophages

This dataset contains prophage sequences (available as .fna files) identified from prokaryotic genomes from three public databases (Genome Taxonomy Database (GTDB) (release 207), National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database (accessed March 2023), and Searchable Planetary-scale mIcrobiome REsource (SPIRE). The downloaded prokaryotic genomes from these databases contained both archaeal and bacterial representative genomes (SPIRE also included data from unknown hosts). 


Prophage identification from downloaded representative genomes was carried out using VIBRANT (v1.2.1). We used the default arguments when using VIBRANT (minimum scaffold length requirement = 1000 base pairs, minimum number of open readings frames (ORFs, or proteins) per scaffold requirement = 4). The identified prophages are provided in .fna format within three .tar.gz files listed in the next section.  

We used skani (v0.2.1) to perform virus clustering. Identified prophages (i.e. VIBRANT output nucleotide files for phages), were used as input. We performed all-to-all comparisons using the skani default arguments, except for the alignment fraction argument which was set to 85 (--min-af 85). After obtaining ANI and alignment fraction, we removed viral sequences for which ANI was 100 and both the query and subject had at least 85 alignment fractions. 

Taxonomic assignment of viral sequences was carried out using geNomad (v1.7.0). Taxonomic assignment was carried out with the annotate module. In addition, we utilized CheckV (v1.0.1) to assess viral quality, completeness, and contamination.

Description of the data and file structure

Prophage-DB contains a total of 356,776 prophage sequences (323,608 sequences from bacterial hosts, 21,226 sequences from unknown hosts, and 11,942 from archaeal hosts). These sequences are available in three different files corresponding to each host group (archaeal_host_prophages.tar.gz, bacterial_host_prophages.tar.gz, unknown_host_prophages.tar.gz). The metadata file (metadata.xlsx) contains the collected metadata from GTDB and SPIRE, in addition it includes geNomad, CheckV results, and auxiliary metabolic gene information. The description of each column is found in the medatadata file. Data that appears as NA was not available in the original metadata files or was not obtained by the used software.

This database contains three compressed files (.tar.gz format):

archaeal_host_prophages.tar.gz, bacterial_host_prophages.tar.gz, unknown_host_prophages.tar.gz

To open these files use the following commands in Unix-based systems and Windows (10 or later):

  • tar -xzf archaeal_host_prophages.tar.gz
  • tar -xzf bacterial_host_prophages.tar.gz
  • tar -xzf unknown_host_prophages.tar.gz

Once extracted, the .tar.gz files will contain .fna files, which are FASTA files containing the prophage nucleotide sequences.

To view the files you can use pre-installed text editors such as nano, vim, TextEdit or Notepad (Windows). Example:

  • nano filename.fna
  • notepad filename.fna

Access information

Prokaryte genomes were obtained from the following sources:

Software used in our study

Publication: Kieft, K., Zhou, Z. & Anantharaman, K. VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome 8, 90 (2020). Link to software:

Publication: Skani enables accurate and efficient genome comparison for modern metagenomic datasets. Nat. Methods 20, 1633–1634 (2023). Link to software:

Publication: Camargo, A. P. et al. Identification of mobile genetic elements with geNomad. Nat. Biotechnol. 1–10 (2023) doi:10.1038/s41587-023-01953-y. Link to software:

Publication: Nayfach, S. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat. Biotechnol. 39, 578–585 (2021). Link to software:


NIH Common Fund