Metatranscriptomic analysis uncovers prevalent viral ORFs compatible with mitochondrial translation
Data files
May 19, 2025 version files 57.26 MB
-
Begemanetal_mSystems_2023_Raw_Data.zip
57.22 MB
-
README.md
38.13 KB
Abstract
RNA viruses are ubiquitous components of the global virosphere, yet relatively little is known about their genetic diversity or the cellular mechanisms by which they exploit the biology of their diverse eukaryotic hosts. A hallmark of (+)ssRNA (positive single-stranded RNA) viruses is the ability to remodel host endomembranes for their own replication. However, the subcellular interplay between RNA viruses and host organelles that harbor gene expression systems, such as mitochondria, is complex and poorly understood. Here we report the discovery of 763 new virus sequences belonging to the family Mitoviridae by metatranscriptomic analysis, the identification of previously uncharacterized mitovirus clades, and a putative new viral class. With this expanded understanding of the diversity of mitovirus and encoded RNA-dependent RNA polymerases (RdRps), we annotate mitovirus-specific protein motifs and identify hallmarks of mitochondrial translation, including mitochondrion-specific codons. This study expands the known diversity of mitochondrial viruses and provides additional evidence that they co-opt mitochondrial biology for their survival.
Metatranscriptomic studies have rapidly expanded the cadre of known RNA viruses, yet our understanding of how these viruses navigate the cytoplasmic milieu of their hosts to survive remains poorly characterized. In this study, we identify and assemble 763 new viral sequences belonging to the Mitoviridae, a family of (+)ssRNA viruses thought to interact with and remodel host mitochondria. We exploit this genetic diversity to identify new clades of Mitoviridae, annotate clade-specific sequence motifs that distinguish the mitoviral RdRp, and reveal patterns of RdRp codon usage consistent with translation on host cell mitoribosomes. These results serve as a foundation for understanding how mitoviruses co-opt mitochondrial biology for their proliferation.
General Information
Dataset Overview
For a comprehensive overview and the methodology used to generate and process all the data in this repository please see the corresponding open access publication (see below).
Corresponding Author Information
Name: Samantha C Lewis
ORCID: https://orcid.org/0000-0001-6306-443X
Affiliations:
Department of Molecular and Cell Biology, University of California, Berkeley, CA USA
Innovative Genomics Institute, Berkeley, CA, USA
Helen Wills Neuroscience Institute, Berkeley, CA USA
Department of Nutritional Sciences and Toxicology, University of California, Berkeley, CA USA
email: samlewis@berkeley.edu
Related Publication
Begeman, A., Babaian, A., Lewis, S. C., Metatranscriptomic analysis uncovers prevalent viral ORFs compatible with mitochondrial translation. mSystems 8:e01002-22.(2023). https://doi.org/10.1128/msystems.01002-22
Funding Information
This work was supported by the Shurl and Kay Curci Foundation, National Institutes of Health grants 5T32GM007232-38 and R00GM129456 to Samantha C. Lewis, and a National Science Foundation Graduate Research Fellowship to Adam Begeman. Computing resources were provided by the University of British Columbia Community Health and Wellbeing Cloud Innovation Centre, powered by AWS.
Software Used and Python Packages Required
- Graphpad Prism Version 10.1.0
- Cytoscape Version 3.8.2
- Clustal Omega
- Serratus
- SRA Toolkit
- Diamond v.2.0.6
- NCBI OrfFinder Version 0.4.3
- NCBI Blast (accessed 23 Feb. 2021)
- Bowtie2 Version 2.4.5
- Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST)
- FastTree Version 2.1.11
- iTOL
- R Version 4.1.2
- Python Version 3.7
- MEME
- Jalview
- AlphaFold/ ColabFold
- PYMOL
- RNAfold
Description of Directories and File Structure
Begemanetal_mSystems_2023_Raw_Data
The compressed master directory contains 3 sub-directories corresponding to the code used to generate figures and analysis for the paper, code used to assemble novel sequences, and all newly discovered mitovirus sequences
|- Code and Data for Figures/
|- Code to Assemble Sequences/
|- Newly Discovered Mitovirsues/
Sub-directories
Code and Data for Figures
All code and data used in the generation of figures for manuscript.
|---Figure 02
|---Figure 03
|---Panel 3C
|---Panel 3B
|---Figure 04
|---Panel 4A
|---Panel 4B
|---Figure 05
|---prediction_ERR3412979_228_4_03457
|---MEME Motifs
|---Figure 06
|---Panel 6B
|---Panel 6C-D
|---Figure 07
|---Panel 7A
|---Panel 7B
|---Panel 7C
|---Figure S1
|---prediction_ERR3412979_228_4_03457
|---Figure S2
|---Figure S3
|---Panel S3B
|---Panel S3A
|---Figure S4
|---Panel S4A
|---Panel S4B
|---Panel S4C
|---Tables
Figure 02
Data for the diversity of sequencing sample collection sites. Map of all sample collection sites that resulted in the identification of new putative mitovirus.
- Figure02_metadata.csv: Data file corresponding to metadata for SRA projects with the following column labels (NOTE: N/A values represent data unavialable):
- sra: SRA acquisition code
- org: organization responsible for SRA run
- date published: date SRA project was published
- date collected: date SRA data was collected
- size (bases): size of the sequencing run in bases
- location: location the samples were collected
- env context: context of where the sample was collected
- organism: organism of the sample
- type: type of sequencing performed
- positive hits: number of mitoviruses found in the sequencing run
- bases: number of bases in sequencing run
- log2: the log base 2 of the bases
- cir dim: the corresponding diameter of the circle in the figure based on base2
- log10: the log base 10 of the bases
- cir dim: the corresponding diameter of the circle in the figure based on base10
- location: location of where the sample was collected
- hits/size: number of mitoviruses found relative to the size of the sequencing run
- Mag: percent of magenta based on hits/size
- Cy: percent of cyan based on hits/size
- absolute Mag: percent of magenta based on just hits
- absolute Cy: percent of cyan based on just hits
Figure 03
Data for the discovery and characterization of novel putative mitovirus sequences.
- Panel 3B
- all_mitovirus_rdrps_and_all_outgroups.fasta: fasta file of all newly discovered mitovirus RDRPs including 10 representative RDRPs from each outgroup (Levivirus, Narnavirus, and Ourmiavirus)
- new_final_tree_file.tre: phylogenetic tree file for new mitoviruses identified, generated using FastTree.
- Panel 3C
- at_content.pzfx: Prism file for quantification of AT content in fungal mitochondrial genomes, reference mitovirus genomes, newly identified mitovirus genomes, and representative out groups (Narnavirus, Levivirus, and Ourmiavirus).
- fungi_mito_ref_at_content.csv: data for AT content for fungal mitochondrial genomes with the following columns.
- Column 1: NCBI acquisition name
- Column 2: Percent of genome that is AT.
- leviviridae_cds_at_content.csv: data for AT content for levivirus genomes with the following columns.
- Column 1: NCBI acquisition name
- Column 2: Percent of genome that is AT.
- mitovirus_cds_at_content.csv: data for AT content for new mitovirus genomes with the following columns.
- Column 1: new mitovirus name
- Column 2: Percent of genome that is AT.
- mitovrius_ref_at_content.csv: data for AT content for reference mitovirus genomes with the following columns.
- Column 1: NCBI acquisition name
- Column 2: Percent of genome that is AT.
- narnavirus_cds_at_content.csv: data for AT content for narnavirus genomes with the following columns.
- Column 1: NCBI acquisition name
- Column 2: Percent of genome that is AT.
- ourmiavirus_cds_at_content.csv: data for AT content for ourmiavirus genomes with the following columns.
- Column 1: NCBI acquisition name
- Column 2: Percent of genome that is AT.
Figure 04
Data for sequence similarity networks of reference and putative new mitovirus RdRps.
- Panel 4A
- 81562_all_mitovirus_rdrps_and_all_outgroups_20_cutoff_fasta_repnode-.90_ssn.xgmml: Sequence similarity network of all newly discovered mitovirus RDRPs and RDRPs from each outgroup (Levivirus, Narnavirus, and Ourmiavirus) generated using EFI-EST with an E value cutoff of 1x10^-5.
- all_mitovirus_rdrps_and_all_outgroups.fasta: fasta file of all newly discovered mitovirus RDRPs including 10 representative RDRPs from each outgroup (Levivirus, Narnavirus, and Ourmiavirus).
- Cytoscape_Network.sif: Cytoscape file used to visualize sequence similarity network.
- Panel 4B
- 81625_all_pub_and_new_mitoviruses_60_evalue_60_cutoff_fasta_repnode-.90_ssn.xgmml: Sequence similarity network of reference and new mitovirus RDRPs generated using EFI-EST with an E value cutoff of 1x10^-60.
- all_pub_and_new_mitoviruses.fasta: fasta file of newly discovered and reference mitovirus RDRPs.
- Cytoscope_network.sif: Cytoscape file used to visualize sequence similarity network.
Figure 05
Data for Mitovirus-conserved protein motifs.
- MMEM Motifs: directory of all discovered MEME protein motifs for mitovirus RDRP.
- prediction_ERR3412979_228_4_03457: directory of the colabfold output for alphafold protein structure preduction of new mitovirus ERR3412979_228_4.
- ERR3412979_288_4_colored.pse: Pymol visualization file of alphafold predicted new mitovirus RDRP structure with motifs color coded.
Figure 06
Data for analysis of mitovirus non-canonical codon usage.
- Panel 6B:
- uga per length.pzfx: Prism file for depicting the number of UGA codons per length of RDRP in new and reference mitoviruses as well as outgroups (Levivirus, Narnavirus, and Ourmiavirus).
- leviviridae_mito_spec_codons.csv: Data quantifying the open reading frames in levivirus genomes when using the standard and mitochondrial codon tables with the following labels:
- name: NCBI acquisition
- cds_mito: sequence of open reading frame using mitochondrial codon table
- cds_mito_length: length of that open reading frame
- cds_std: sequence of open reading frame using standard codon table
- cds_std_length: length of that open reading frame
- number_of_mito_codons: number of UGA codons in that open reading frame
- levivirus_uga_per_length.csv: data quantifying the number of UGA codons per length in levivirus open reading frames with the following labels.
- name: NCBI acquisition
- length: length of RDRP
- ugas: number of UGA codons
- uga per length: number of UGA codons per length
- mito_specific_codon_only_mitovirus_analysis.csv: Data quantifying the open reading frames in new mitovirus genomes when using the standard and mitochondrial codon tables with the following labels:
- column 1: index
- sra: sra project sequence was assembled from
- node: node of assembly sequence was in
- codon_table: codon table used to translate open reading frame
- sequence: sequence of assembled genome
- cds_mito: sequence of open reading frame using mitochondrial codon table
- cds_mito_len: length of that open reading frame
- cds_std: sequence of open reading frame using standard codon table
- cds_std_len: length of that open reading frame
- number_of_mito_codons: number of UGA codons in that open reading frame
- narnavirus_mito_spec_codons.csv: Data quantifying the open reading frames in narnavirus genomes when using the standard and mitochondrial codon tables with the following labels:
- name: NCBI acquisition
- cds_mito: sequence of open reading frame using mitochondrial codon table
- cds_mito_length: length of that open reading frame
- cds_std: sequence of open reading frame using standard codon table
- cds_std_length: length of that open reading frame
- number_of_mito_codons: number of UGA codons in that open reading frame
- narnavirus_uga_per_length.csv: data quantifying the number of UGA codons per length in narnavirus open reading frames with the following labels.
- name: NCBI acquisition
- length: length of RDRP
- ugas: number of UGA codons
- uga per length: number of UGA codons per length
- our_mitoviruses_uga_per_length.csv: data quantifying the number of UGA codons per length in new mitovirus open reading frames with the following labels.
- name: NCBI acquisition
- length: length of RDRP
- ugas: number of UGA codons
- uga per length: number of UGA codons per length
- ourmiavirus_mito_spec_codons.csv: Data quantifying the open reading frames in ourmiavirus genomes when using the standard and mitochondrial codon tables with the following labels:
- name: NCBI acquisition
- cds_mito: sequence of open reading frame using mitochondrial codon table
- cds_mito_length: length of that open reading frame
- cds_std: sequence of open reading frame using standard codon table
- cds_std_length: length of that open reading frame
- number_of_mito_codons: number of UGA codons in that open reading frame
- ourmiavirus_uga_per_length.csv: data quantifying the number of UGA codons per length in ourmiavirus open reading frames with the following labels.
- name: NCBI acquisition
- length: length of RDRP
- ugas: number of UGA codons
- uga per length: number of UGA codons per length
- ref_mitovirus_mito_spec_codons.csv: Data quantifying the open reading frames in reference mitovirus genomes when using the standard and mitochondrial codon tables with the following labels:
- name: NCBI acquisition
- cds_mito: sequence of open reading frame using mitochondrial codon table
- cds_mito_length: length of that open reading frame
- cds_std: sequence of open reading frame using standard codon table
- cds_std_length: length of that open reading frame
- number_of_mito_codons: number of UGA codons in that open reading frame
- ref_mitoviruses_uga_per_length.csv: data quantifying the number of UGA codons per length in reference mitovirus open reading frames with the following labels.
- name: NCBI acquisition
- length: length of RDRP
- ugas: number of UGA codons
- uga per length: number of UGA codons per length
- Panel 6C-D:
- RdRp_sizes.pzfx: Prism file for depicting the sizes of the viral RdRps when translated in the standard codon table and when translated using the mitochondrial codon table.
- leviviridae_rdrp_size.csv: Data for length of levivrus RdRps with the following labels.
- name: NCBI acquisition
- size: length of RdRp
- mito_specific_codon_only_mitovirus_analysis.csv: Data for length of new mitovirus RdRps in standard and mitochondrial codon tables with the following labels.
- sra: SRA project acquisition
- node: node from assembly
- codon_table: codon table used to identify RdRp
- sequence: full assembled sequence
- cds_mito: open reading frame using mitochondrial codon table
- cds_mito_len: length of open reading frame using mitochondrial codon table
- rdrp_mito_len: length of translated RdRp using mitochondrial codon table
- cds_std: open reading frame using standard codon table
- cds_std_len: length of open reading frame using standard codon table
- rdrp_std_len: length of translated RdRp using standard codon table
- number_of_mito_codons: number of UGA codons in open reading frame
- narnavirus_rdrp_size.csv: Data for length of narnavirus RdRps with the following labels.
- name: NCBI acquisition
- size: length of RdRp
- ourmiavirus_rdrp_size.csv: Data for length of ourmiavirus RdRps with the following labels.
- name: NCBI acquisition
- size: length of RdRp
- ref_mitovirus_mito_spec_codons.csv: Data for length of reference mitovirus RdRps in standard and mitochondrial codon tables with the following labels.
- name: NCBI acussion
- cds_mito: open reading frame using mitochondrial codon table
- cds_mito_len: length of open reading frame using mitochondrial codon table
- cds_std: open reading frame using standard codon table
- cds_std_len: length of open reading frame using standard codon table
- number_of_mito_codons: number of UGA codons in open reading frame
Figure 07
Data for codon usage bias of mitovirus sequences.
- Panel 7A:
- CpMV1_Codon_Frequency.csv: Codon frequency of representative mitovirus CpMV1 with its host mitochondrial and nuclear codon frequencies with the following labels.
- Mitovirus Frequency: Codon frequency of mitovirus
- Host Mitochondrial Frequency: Codon frequency of host mitochondria
- Host Nuclear Frequency: Codon frequency of host nucleus
- Representative_Mitovirus_Codon_Frequency.csv: Codon frequency of representative new mitovirus with its host mitochondrial and nuclear codon frequencies with the following labels.
- New Mitovirus Frequency: Codon frequency of mitovirus
- Mitochondrial Frequency: Codon frequency of fungal mitochondria
- Nuclear Frequency: Codon frequency of fungal nucleus
- CpMV1_Codon_Frequency.csv: Codon frequency of representative mitovirus CpMV1 with its host mitochondrial and nuclear codon frequencies with the following labels.
- Panel 7B:
- Dataset_Correlation.csv: Pairwise correlation of reference codon frequencies of fungal mitochondrial, fungal nuclear, plant mitochondrial, plant nuclear, metazoan mitochondrial, metazoan nulcear, and bacterial genomes.
- Panel 7C:
- Heatmap_Callout_Data.csv: Data for subset of new mitovirus sequences with high metazoan mitochondrial codon frequency correlations with the following labels.
- Column 1: Name of new mitovirus sequence
- mito: pairwise correlation of genome with reference metazoan mitochondrial genomes
- nuc: pairwise correlation of genome with reference metazoan nuclear genomes
- Heatmap_Data.csv: Data for new mitovirus pairwise correlation of codon frequencies with each reference genome sets with the following labels.
- Column 1: Name of new mitovirus sequences
- fungi-mito-r2: pairwise correlation between mitovirus codon frequency and reference fungal mitochondrial genomes
- fungi-genome-r2: pairwise correlation between mitovirus codon frequency and reference fungal nuclear genomes
- viridiplantae-mito-r2: pairwise correlation between mitovirus codon frequency and reference plant mitochondrial genomes
- viridiplantae-genome-r2: pairwise correlation between mitovirus codon frequency and reference plant nuclear genomes
- metazoan-mito-r2: pairwise correlation between mitovirus codon frequency and reference metazoan mitochondrial genomes
- metazoan-genome-r2: pairwise correlation between mitovirus codon frequency and reference metazoan nuclear genomes
- bacteria-r2: pairwise correlation between mitovirus codon frequency and reference bacterial genomes
- Heatmap_Callout_Data.csv: Data for subset of new mitovirus sequences with high metazoan mitochondrial codon frequency correlations with the following labels.
Figure S1
Data for alphafold Prediction of representative mitovirus RdRp.
- prediction_ERR3412979_228_4_03457: directory of the colabfold output for alphafold protein structure preduction of new mitovirus ERR3412979_228_4.
- ERR3412979_288_4_colored.pse: Pymol visualization file of alphafold predicted new mitovirus RDRP structure with motifs color coded.
Figure S2
Data for representative multiple sequence alignment of mitoviral RdRp and closest evolutionary neighbors.
- representative_alignment.phy: Representative multiple sequence alignment generated using Clustal Omega between three representative mitovirus RdRps and representative RdRps from outgroups (Levivirus, Narnavirus, and Ourmiavirus)
Figure S3
Data for Narnavirus and protist virus codon usage correlations.
- Panel S3A:
- Narnavirus_Codon_Frequency.csv: Codon frequency of representative narnavirus with its host mitochondrial and nuclear codon frequencies with the following labels.
- Narnavirus Frequency: Codon frequency of narnavirus
- Mitochondrial Frequency: Codon frequency of host mitochondria
- Nuclear Frequency: Codon frequency of host nucleus
- Narnavirus_Codon_Frequency.csv: Codon frequency of representative narnavirus with its host mitochondrial and nuclear codon frequencies with the following labels.
- Panel S3B:
- CV_analysis.xlsx: Data for codon frequency analysis of example virus and its individual host with the following labels.
- Codon: trinucleotide codon
- CV Frequency: codon frequency of virus
- cryptosporidium Frequency: codon frequency of host
- Bacteria Frequency: codon frequency of bacterial genomes
- Fungi Nuclear Frequency: codon frequency of fungal nuclear genomes
- Fungi Mito Frequency: codon frequency of fungal mitochondrial genomes
- Mammal Nuclear Frequency: codon frequency of mammalian nuclear genomes
- Mammal Mito Frequency: codon frequency of mammalian mitochondrial genomes
- Metazoan Nuclear Frequency: codon frequency of metazoan nuclear genomes
- Metazoan Mito Frequency: codon frequency of metazoan mitochondrial genomes
- Plant Nuclear Frequency: codon frequency of plant nuclear genomes
- Plant Mito Frequency: codon frequency of plant mitochondrial genomes
- ESV_analysis.xlsx: Data for codon frequency analysis of example virus and its individual host with the following labels.
- Codon: trinucleotide codon
- ESV Frequency: codon frequency of virus
- Eimeria Stiedai Frequency: codon frequency of host
- Bacteria Frequency: codon frequency of bacterial genomes
- Fungi Nuclear Frequency: codon frequency of fungal nuclear genomes
- Fungi Mito Frequency: codon frequency of fungal mitochondrial genomes
- Mammal Nuclear Frequency: codon frequency of mammalian nuclear genomes
- Mammal Mito Frequency: codon frequency of mammalian mitochondrial genomes
- Metazoan Nuclear Frequency: codon frequency of metazoan nuclear genomes
- Metazoan Mito Frequency: codon frequency of metazoan mitochondrial genomes
- Plant Nuclear Frequency: codon frequency of plant nuclear genomes
- Plant Mito Frequency: codon frequency of plant mitochondrial genomes
- GLV_analysis.xlsx: Data for codon frequency analysis of example virus and its individual host with the following labels.
- Codon: trinucleotide codon
- GLV Frequency: codon frequency of virus
- Giardia Lamblia Frequency: codon frequency of host
- Bacteria Frequency: codon frequency of bacterial genomes
- Fungi Nuclear Frequency: codon frequency of fungal nuclear genomes
- Fungi Mito Frequency: codon frequency of fungal mitochondrial genomes
- Mammal Nuclear Frequency: codon frequency of mammalian nuclear genomes
- Mammal Mito Frequency: codon frequency of mammalian mitochondrial genomes
- Metazoan Nuclear Frequency: codon frequency of metazoan nuclear genomes
- Metazoan Mito Frequency: codon frequency of metazoan mitochondrial genomes
- Plant Nuclear Frequency: codon frequency of plant nuclear genomes
- Plant Mito Frequency: codon frequency of plant mitochondrial genomes
- TVV_analysis.xlsx: Data for codon frequency analysis of example virus and its individual host with the following labels.
- Codon: trinucleotide codon
- TVV Frequency: codon frequency of virus
- Tri_vag Frequency: codon frequency of host
- Bacteria Frequency: codon frequency of bacterial genomes
- Fungi Nuclear Frequency: codon frequency of fungal nuclear genomes
- Fungi Mito Frequency: codon frequency of fungal mitochondrial genomes
- Mammal Nuclear Frequency: codon frequency of mammalian nuclear genomes
- Mammal Mito Frequency: codon frequency of mammalian mitochondrial genomes
- Metazoan Nuclear Frequency: codon frequency of metazoan nuclear genomes
- Metazoan Mito Frequency: codon frequency of metazoan mitochondrial genomes
- Plant Nuclear Frequency: codon frequency of plant nuclear genomes
- Plant Mito Frequency: codon frequency of plant mitochondrial genomes
- CV_analysis.xlsx: Data for codon frequency analysis of example virus and its individual host with the following labels.
Figure S4
Data for extended codon usage analysis.
- Panel S4A
- Correlation_Data.csv: Data for pairwise codon frequency correlations for reference mitoviruses, new mitoviruses, and narnaviruses with the following labels.
- fungi_mito_r2: reference mitovirus pairwise codon frequency correlation with fungal mitochondrial genomes.
- fungi-mito-r2 sra: new mitovirus pairwise codon frequency correlation with fungal mitochondrial genomes.
- fungi_mito_r2 narna: narnavirus pairwise codon frequency correlation with fungal mitochondrial genomes.
- fungi_genome_r2: reference mitovirus pairwise codon frequency correlation with fungal nuclear genomes.
- fungi-genome-r2 sra: new mitovirus pairwise codon frequency correlation with fungal nuclear genomes.
- fungi_genome_r2 narna: narnavirus pairwise codon frequency correlation with fungal nuclear genomes.
- viridiplantae_mito_r2: reference mitovirus pairwise codon frequency correlation with plant mitochondrial genomes.
- viridiplantae-mito-r2 sra: new mitovirus pairwise codon frequency correlation with plant mitochondrial genomes.
- viridiplantae_mito_r2 narna: narnavirus pairwise codon frequency correlation with plant mitochondrial genomes.
- viridiplantae_genome_r2: reference mitovirus pairwise codon frequency correlation with plant nuclear genomes.
- viridiplantae-genome-r2 sra: new mitovirus pairwise codon frequency correlation with plant nuclear genomes.
- viridiplantae_genome_r2 narna: narnavirus pairwise codon frequency correlation with plant nuclear genomes.
- metazoan_mito_r2: reference mitovirus pairwise codon frequency correlation with metazoan mitochondrial genomes.
- metazoan-mito-r2 sra: new mitovirus pairwise codon frequency correlation with metazoan mitochondrial genomes.
- metazoan_mito_r2 narna: narnavirus pairwise codon frequency correlation with metazoan mitochondrial genomes.
- metazoan_genome_r2: reference mitovirus pairwise codon frequency correlation with metazoan nuclear genomes.
- metazoan-genome-r2 sra: new mitovirus pairwise codon frequency correlation with metazoan nuclear genomes.
- metazoan_genome_r2 narna: narnavirus pairwise codon frequency correlation with metazoan nuclear genomes.
- bacteria_r2: reference mitovirus pairwise codon frequency correlation with bacterial genomes.
- bacteria-r2 sra: new mitovirus pairwise codon frequency correlation with bacterial genomes.
- bacteria_r2 narna: narnavirus pairwise codon frequency correlation with bacterial genomes.
- Correlation_Data.csv: Data for pairwise codon frequency correlations for reference mitoviruses, new mitoviruses, and narnaviruses with the following labels.
- Panel S4B
- metazoan_callout_data.csv: Data for subset of new mitovirus sequences with high metazoan mitochondrial codon frequency correlations with the following labels.
- Column 1: Name of new mitovirus sequence
- mito: pairwise correlation of genome with reference metazoan mitochondrial genomes
- nuc: pairwise correlation of genome with reference metazoan nuclear genomes
- metazoan_callout_data.csv: Data for subset of new mitovirus sequences with high metazoan mitochondrial codon frequency correlations with the following labels.
- Panel S4C
- Plant_Genome_Mitochondrial_Correlation.csv: Data for the pairwise codon frequency correlation between plant mitochondrial and nuclear genomes with the following labels.
- Genome Frequency: Codon frequency for nuclear codons
- Mitochondrial Frequency: Codon frequency for mitochondrial codons
- Plant_Genome_Mitochondrial_Correlation.csv: Data for the pairwise codon frequency correlation between plant mitochondrial and nuclear genomes with the following labels.
Tables
Extended data and supplementary tables
- Supplemental_Table_01.xlsx: Extended data for SRA search with following sheets.
- Sheet Metadata: metadata for excel notebook with following labels.
- Sheet Name: name of sheet
- Discription: description of sheet
- SRA Runs Searched: Table of SRA runs searched by Serratus with following labels.
- Run: SRA run
- ReleaseDate: date release
- LoadDate: date loaded onto Serratus
- spots: number of spots
- bases: number of bases
- spots_with_mates: number of spots with mates
- avgLength: average length of spot
- size_MB: size in MB
- AssemblyName: name of assembly
- download_path: path used to download data
- Experiment: SRA Experiment
- LibraryName: name of library
- LibraryStrategy: strategy of sequencing
- LibrarySelection: how sequences were selected
- LibrarySource: how sequences were captured
- LibraryLayout: paired or unpaired sequencing
- InsertSize: size of insert
- InsertDev: type of insert
- Platform: sequencing platform
- Model: model of sequencing platform
- SRAStudy: SRA study
- BioProject: Bioproject ID
- Study_Pubmed_id: pubmed ID
- ProjectID: project ID
- Sample: Sample ID
- BioSample: Bio sample ID
- SampleType: type of sample
- TaxID: taxometry ID
- ScientificName: Scientific name of species
- SampleName: name of sample
- CenterName: name of data depositor
- Submission: submission ID
- dbgap_study_accession: dbgap study ID
- Consent: Consent to use sequences status
- RunHash: Run Hash
- ReadHash: Read Hash
- SRA Runs with Mitovirus Reads: SRA runs with positive IDs for mitovirus with the following column labels.
- Experiment: SRA experiment ID
- Bucket Alignment: Serratus bucket alignment metadata
- Family: family of virus
- Score: Serratus score
- pctid: percent identity to known viruses
- alns: number of sequences with alignments
- Column7: Serratus average column score
- SRA Runs Analyzed: SRA runs assembled and used in this study with the following labels.
- sra: SRA ID
- org: Organization that collected sample
- date collected: Date the sample was collected
- size (bases): Size of sequencing run
- location: Location of the sample collection
- env context: environmental context of sample
- organism: organism the sample was collected from
- type: type of sequency performed
- mitoviruses Found: number of new mitoviruses found
- publication (if available): publication for corresponding sequencing data
- Sheet Metadata: metadata for excel notebook with following labels.
- Supplemental_Table_02.xlsx: Extended data for new identified mitoviruses with the following sheets.
- New Mitovirus Sequences: Full Data on new mitovirus sequences with following column labels.
- sra: SRA run ID
- node: Assembly node
- codon-table: codon table open reading frame is in
- sequence: sequence of assembled virus
- cds: open reading frame
- protein: protein sequence
- length: length of protein sequence
- start: start position on genome of open reading frame
- stop: end position on genome of open reading frame
- fungi-mito-r2: pairwise codon frequency correlation with fungal mitochondrial genomes
- fungi-genome-r2: pairwise codon frequency correlation with fungal nuclear genomes
- mammalia-mito-r2: pairwise codon frequency correlation with mammalian mitochondrial genomes
- mammalia-genome-r2: pairwise codon frequency correlation with mammalian nuclear genomes
- viridiplantae-mito-r2: pairwise codon frequency correlation with plant mitochondrial genomes
- viridiplantae-genome-r2: pairwise codon frequency correlation with plant nuclear genomes
- metazoan-mito-r2: pairwise codon frequency correlation with metazoan mitochondrial genomes
- metazoan-genome-r2: pairwise codon frequency correlation with metazoan nuclear genomes
- bacteria-r2: pairwise codon frequency correlation with bacterial genomes
- at-content: percent of genome that is AT
- query name: name of new virus
- subject name: blast subject name
- subject length: blast subject length
- alignment length: blast alignment length
- percent identity: blast percent identity
- identical matches: blast number of identical matches
- mismatches: blast number of mismatches
- positive: blast positive value
- gaps: blast gaps
- percent positive: blast percent positive
- query start: blast query start
- query end: blast query end
- subject start: blast subject start
- subject end: blast subject end
- e value: blast e value
- bitscore: blast bitscore
- subject title: blast subject title
- subject taxid: blast subject tax ID
- subject sci name: blast subject scientific name
- subject common name: blast subject common name
- subject blast name: blast subject classification
- mitovirus?: True False if virus identified is mitovirus
- rpm: reads per million
- rpkm: reads per billion
- Misannotated NCBI Mitoviruses: Identified NCBI mitoviruses that were misannoted with the following column labels.
- NCBI Subject: NCBI ID of misannotated
- mitovirus?: True/ False if it is a mitovirus
- New Mitovirus Sequences: Full Data on new mitovirus sequences with following column labels.
- Supplemental_Table_03.xlsx: Extended data on codon frequency analysis and reference genomes used with the following sheet names.
- NewMitovirus UGA Codon Analysis: UGA codon analysis for newly identified mitovirus sequences with the following column labels
- sra: SRA project acquisition
- node: node from assembly
- codon_table: codon table used to identify RdRp
- sequence: full assembled sequence
- cds_mito: open reading frame using mitochondrial codon table
- cds_mito_len: length of open reading frame using mitochondrial codon table
- cds_std: open reading frame using standard codon table
- cds_std_len: length of open reading frame using standard codon table
- number_of_mito_codons: number of UGA codons in open reading frame
- RefMitovirus UGA Codon Analysis: UGA codon analysis for reference mitovirus sequences with the following column labels
- name: NCBI name
- cds_mito: open reading frame using mitochondrial codon table
- cds_mito_len: length of open reading frame using mitochondrial codon table
- cds_std: open reading frame using standard codon table
- cds_std_len: length of open reading frame using standard codon table
- number_of_mito_codons: number of UGA codons in open reading frame
- Outgroup UGA Codon Analysis: UGA codon analysis for outgroups (Narnavirus, Levivirus, Ourmaivirus) with the following column labels
- NCBI Acc: NCBI name
- RdRp Size: size of the RdRp
- Fungi Mito Codon Frequency: Data for reference fungal mitochondrial genomes codon frequency with the following column labels
- Codon: trinucleotide codon
- Count: number found in genomes
- Frequency: frequency of that codon in genomes
- Fungi Nuclear Codon Frequency: Data for reference fungal nuclear genomes codon frequency with the following column labels
- Codon: trinucleotide codon
- Frequency: frequency of that codon in genomes
- Mammal Mito Codon Frequency: Data for reference mammal mitochondrial genomes codon frequency with the following column labels
- Codon: trinucleotide codon
- Count: number found in genomes
- Frequency: frequency of that codon in genomes
- Mammal Nuclear Codon Frequency: Data for reference mammal nuclear genomes codon frequency with the following column labels
- Codon: trinucleotide codon
- Frequency: frequency of that codon in genomes
- Plant Mito Codon Frequency: Data for reference Plant mitochondrial genomes codon frequency with the following column labels
- Codon: trinucleotide codon
- Count: number found in genomes
- Frequency: frequency of that codon in genomes
- Plant Nuclear Codon Frequency: Data for reference Plant nuclear genomes codon frequency with the following column labels
- Codon: trinucleotide codon
- Frequency: frequency of that codon in genomes
- Metazoan Mito Codon Frequency: Data for reference Metazoan mitochondrial genomes codon frequency with the following column labels
- Codon: trinucleotide codon
- Count: number found in genomes
- Frequency: frequency of that codon in genomes
- Metazoan Nuclear Codon Frequency: Data for reference Metazoan nuclear genomes codon frequency with the following column labels
- Codon: trinucleotide codon
- Frequency: frequency of that codon in genomes
- Bacteria Codon Frequency: Data for reference Bacterial genomes codon frequency with the following column labels
- Codon: trinucleotide codon
- Frequency: frequency of that codon in genomes
- Narnavirus Codon Analysis: Codon frequency analysis for narnaviruses with the following column labels
- NCBI ID: NCBI name
- CDS: open reading frame
- protein: protein translation
- length: length of protein
- fungi_mito_r2: pairwise codon frequency correlation with fungal mitochondrial genomes
- fungi_genome_r2: pairwise codon frequency correlation with fungal nuclear genomes
- mammalia_mito_r2: pairwise codon frequency correlation with mammalian mitochondrial genomes
- mammalia_genome_r2: pairwise codon frequency correlation with mammalian nuclear genomes
- viridiplantae_mito_r2: pairwise codon frequency correlation with plant mitochondrial genomes
- viridiplantae_genome_r2: pairwise codon frequency correlation with plant nuclear genomes
- metazoan_mito_r2: pairwise codon frequency correlation with metazoan mitochondrial genomes
- metazoan_genome_r2: pairwise codon frequency correlation with metazoan nuclear genomes
- bacteria_r2: pairwise codon frequency correlation with bacterial genomes
- NewMitovirus UGA Codon Analysis: UGA codon analysis for newly identified mitovirus sequences with the following column labels
Code to Assemble Sequences
Linux Code used to assemble sequences from SRA sequencing runs
- Mitovirus-Code-main: directory of Linux command line tools and code used for assembling SRA runs
- mitovrius_pipeline_v2: entire pipeline used for assembly
- working_pipeline.sh: shell script for running pipeline, see corresponding readme
- README.md: read me dictating how pipeline should be run
- mitovrius_pipeline_v2: entire pipeline used for assembly
Newly Discovered Mitovrisues
Compiled files for all newly discovered mitovirus sequences
- final_mito_list.csv: Full Data on new mitovirus sequences with following column labels.
- sra: SRA run ID
- node: Assembly node
- codon-table: codon table open reading frame is in
- sequence: sequence of assembled virus
- cds: open reading frame
- protein: protein sequence
- length: length of protein sequence
- start: start position on genome of open reading frame
- stop: end position on genome of open reading frame
- fungi-mito-r2: pairwise codon frequency correlation with fungal mitochondrial genomes
- fungi-genome-r2: pairwise codon frequency correlation with fungal nuclear genomes
- mammalia-mito-r2: pairwise codon frequency correlation with mammalian mitochondrial genomes
- mammalia-genome-r2: pairwise codon frequency correlation with mammalian nuclear genomes
- viridiplantae-mito-r2: pairwise codon frequency correlation with plant mitochondrial genomes
- viridiplantae-genome-r2: pairwise codon frequency correlation with plant nuclear genomes
- metazoan-mito-r2: pairwise codon frequency correlation with metazoan mitochondrial genomes
- metazoan-genome-r2: pairwise codon frequency correlation with metazoan nuclear genomes
- bacteria-r2: pairwise codon frequency correlation with bacterial genomes
- at-content: percent of genome that is AT
- query name: name of new virus
- subject name: blast subject name
- subject length: blast subject length
- alignment length: blast alignment length
- percent identity: blast percent identity
- identical matches: blast number of identical matches
- mismatches: blast number of mismatches
- positive: blast positive value
- gaps: blast gaps
- percent positive: blast percent positive
- query start: blast query start
- query end: blast query end
- subject start: blast subject start
- subject end: blast subject end
- e value: blast e value
- bitscore: blast bitscore
- subject title: blast subject title
- subject taxid: blast subject tax ID
- subject sci name: blast subject scientific name
- subject common name: blast subject common name
- subject blast name: blast subject classification
- mitovirus?: True False if virus identified is mitovirus
- mitovirus_non_coding_region_sequences.fasta: fasta file of non coding region of new mitvirus sequences
- mitoviruses_cds_sequences.fasta: fasta file of cds region of new mitovirus sequences
- mitoviruses_full_sequences.fasta: fasta file of new mitovirus sequences
- mitoviruses_protein_sequences.fasta: fasta file of protein of new mitovirus sequences
This dataset contains raw data and analyses used to produce figures in the referenced manuscript by Begeman, et al. A detailed description of the methods can be found in the manuscript. Data and analyses on which manuscript figures are based are provided in the following formats: .csv (tabular) .txt (text; code), .tre (phylogenetic trees), .fasta (sequence alignment), .sif (Cytoscape analyses), .eps or .png (image files), .pdb (protein structure), .phy (3D protein structure model). Tables are provided as Excel files.