Single-cell RNA-seq of the rare virosphere reveals the native hosts of giant viruses in the marine environment
Data files
Feb 29, 2024 version files 147.13 MB
Abstract
Giant viruses (phylum Nucleocytoviricota) are globally distributed in aquatic ecosystems. They play significant roles as evolutionary drivers of eukaryotic plankton and regulators of global biogeochemical cycles. However, we lack knowledge about their native hosts, hindering our understanding of their lifecycle and ecological importance. Here, we used single-cell RNAseq and samples from an induced E. huxleyi bloom during a mesocosm experiment to link giant viruses with their protist hosts. We observe active giant virus infections in multiple host lineages, including members of the algal groups Chrysophycae and Prymnesiophycae, as well as heterotrophic flagellates in the class Katablepharidaceae. Katablepharids were infected with a rare Imitevirales-07 giant virus lineage expressing cell fate regulation genes. Analysis of the temporal dynamics of this host-virus interaction indicated a role for the Imitevirales-07 in the collapse of the host Katablepharid population. Our results demonstrate that single-cell RNA-seq can be used to identify previously undescribed host-virus interactions and study their ecological relevance.
README: Single-cell RNA-seq of the rare virosphere reveals the native hosts of giant viruses in the marine environment
Supplementary Files used in the project. These are the main intermediate files that can help reproduce the data. For a detail description on how to reproduce these files and using them, Go to the GitHub site: https://github.com/vardilab/host-virus-pairing
Description of the data and file structure
Fromm_2023_Data_Availability
Sequences
GVDB.markergenes.90.fna # De-duplicated database of NCLDV marker genes
Transcripts_Cells_Ehux-EhV.95.fasta # The host-virus reference, based on the single-cell transcriptomes of infected cells, to which we added genes from EhV and E. huxleyi.
Transcripts_Katablepharidacea.fasta # Transcripts assembled from a highly infected subpopulation of Katablepharidacea
Blast_results
first_cells.transcripts.edit.metaPR2.csv # Blastn results of single-cell transcripts assembled from highly infected cells (~970) detected against metapr2 database.
first_cells.transcripts.edit.PR2.csv # Blastn results of single-cell transcripts assembled from highly infected cells (~970) detected against pr2 database.
all_cells.transcripts.edit.metaPR2.csv # Blastn results of single-cell transcripts assembled from all cells (~28,000) detected against metapr2 database.
all_cells.transcripts.edit.PR2.csv # Blastn results of single-cell transcripts assembled from all cells (~28,000) detected against metapr2 database.
cells.filtered.blastx.csv # Blastx results of single-cell assembled transcripts against refseq database (to find viral transcripts).
DATA-SPECIFIC INFORMATION FOR: all_cells.transcripts.edit.metaPR2.csv
1. Number of variables: 12
2. Number of rows: 1409286
3. Variable List:
* qseqid: query or source (gene) sequence id
* sseqid: subject or target (reference genome) sequence id
* pident: percentage of identical positions
* length: alignment length (sequence overlap)
* mismatch: number of mismatches
* gapopen: number of gap openings
* qstart: start of alignment in query
* qend: end of alignment in query
* sstart: start of alignment in subject
* send: end of alignment in subject
* evalue: expect value
* bitscore: bit score
DATA-SPECIFIC INFORMATION FOR: all_cells.transcripts.edit.PR2.csv
1. Number of variables: 12
2. Number of rows: 1549087
3. Variable List:
* qseqid: query or source (gene) sequence id
* sseqid: subject or target (reference genome) sequence id
* pident: percentage of identical positions
* length: alignment length (sequence overlap)
* mismatch: number of mismatches
* gapopen: number of gap openings
* qstart: start of alignment in query
* qend: end of alignment in query
* sstart: start of alignment in subject
* send: end of alignment in subject
* evalue: expect value
* bitscore: bit score
DATA-SPECIFIC INFORMATION FOR: first_cells.transcripts.edit.metaPR2.csv
1. Number of variables: 12
2. Number of rows: 56763
3. Variable List:
* qseqid: query or source (gene) sequence id
* sseqid: subject or target (reference genome) sequence id
* pident: percentage of identical positions
* length: alignment length (sequence overlap)
* mismatch: number of mismatches
* gapopen: number of gap openings
* qstart: start of alignment in query
* qend: end of alignment in query
* sstart: start of alignment in subject
* send: end of alignment in subject
* evalue: expect value
* bitscore: bit score
DATA-SPECIFIC INFORMATION FOR: first_cells.transcripts.edit.PR2.csv
1. Number of variables: 12
2. Number of rows: 67864
3. Variable List:
* qseqid: query or source (gene) sequence id
* sseqid: subject or target (reference genome) sequence id
* pident: percentage of identical positions
* length: alignment length (sequence overlap)
* mismatch: number of mismatches
* gapopen: number of gap openings
* qstart: start of alignment in query
* qend: end of alignment in query
* sstart: start of alignment in subject
* send: end of alignment in subject
* evalue: expect value
* bitscore: bit score
DATA-SPECIFIC INFORMATION FOR: cells.filtered.blastx.csv
1. Number of variables: 10
2. Number of rows: 2021
3. Variable List:
* qseqid: query or source (gene) sequence id
* sseqid: subject or target (reference genome) sequence id
* pident: percentage of identical positions
* bitscore: bit score
* domain: domain of life the of subject
* phylum: phylum the of subject
* family: family the of subject
* genus: genus the of subject
* species: species the of subject
* cell barcode: cell barcode of query
UMI_tables_first_cells
data_raw.pickle.gz # Data combined from all UMI tables of fastq files mapped to the NCLDV marker gene database
UMI_tables_scatterplot
data_raw.pickle.gz # Data combined from all UMI tables of fastq files mapped to the host-virus reference
data.pickle.gz # Preprocessed ata combined from all UMI tables of fastq files mapped to the host-virus reference
metadata_dimentionality_reduction_1_1.2_.pickle.gz # Metadata of preprocessed data, after dimensionality reduction.
Access information
The PR2 database can be accessed from here: https://github.com/pr2database/pr2database
Methods
Mesocosm core setup and sampling procedure
Samples were obtained during the AQUACOSM VIMS-Ehux mesocosm experiment in Raunefjorden near Bergen, Norway (60°16′11N; 5°13′07E), in May 2018. Seven bags were filled with 11m3 water from the fjord, containing natural plankton communities. Algal blooms were induced by nutrient addition and monitored for 24 days, as previously described23. 10 samples were collected from four bags, as follows: From bag 3, on days 15 and 20 (named B3T15, B3T20 correspondingly). From bag 4, on days 13, 15,19, and 20 (named B4T13, B4T15, B4T19, and B4T20, correspondingly). From bag 6, on day 17 (named B6T17). From bag 7, on days 16, 17, and 18 (named B7T16, B7T17, and B7T18, correspondingly).
Samples were initially filtered as follows: 2 liters of water were filtered with a 20 µm mesh and collected in a glass bottle. The cells were then concentrated through gentle gravity filtration on a 3 µm polycarbonate filter (Whatman), mounted on a reusable bottle top filter holder (Thermo Fischer). The biomass on the filter was regularly resuspended by gentle pipetting.
For samples B7T16, B7T18, B4T15, B3T15, B6T17, B7T17, and B4T19, the 2 liters of seawater were concentrated down to 100 ml, distributed in two 50 ml tubes, which corresponds to a 200 times concentration. For B4T13, the concentration factor was 140 times. For B4T20 and B3T20, the concentration factor was 100 times. The different concentration factors are explained by filter clogging and various field constraints, including processing time. For all samples except B3T20, the 50 ml tubes were centrifuged for 4 min at 2500g, after which the supernatant was discarded. Pellets corresponding to the same day and same bag were pooled and resuspended in a final volume of 200 µl of chilled PBS. 1800 µl of pre-chilled high-performance liquid chromatography (HPLC) grade 100% methanol was added drop by drop to the concentrated biomass. For B3T20, the concentrated biomass was centrifuged for 4 min at 2500g, resuspended in 100 µl of chilled PBS, to which 900 µl of chilled HPLC grade 100% methanol was added. Then, samples were incubated for 15 minutes on ice and stored at -80°C until further analysis.
Library preparation and RNA-seq sequencing using 10X Genomics
For analysis by 10X Genomics, tubes were defrosted and gently mixed, and 1.7 ml of the samples were transferred into an Eppendorf Lowbind tube and centrifuged at 4°C for 3 min at 3000g. The PBS/methanol mix was discarded and replaced by 400 µl of PBS. Cell concentration was measured using an iCyt Eclipse flow cytometer (SONY) based on forward scatter. Cell concentration ranged from 1044 cells ml-1 to 9855 cells ml-1. All concentrations were brought to 1000 cells ml-1 to target 7000 cells recovery, according to the 10X Genomics Cell Suspension Volume Calculator Table provided in the user guide. The cellular suspension was loaded onto Next GEM Chip G targeting 7000 cells and then ran on a Chromium Controller instrument to generate GEM emulsion (10x Genomics). Single-cell 3' RNA-seq libraries were generated according to the manufacturer's protocol (10x Genomics Chromium Single Cell 3' Reagent Kit User Guide v3/v3.1 Chemistry) on different occasions: B4T19 and B7T17 in January 2020 and B3T15, B3T20, B4T13, B4T15, B4T20, B6T17, B7T16, and B7T18 in August 2020 with 12 cycles for cDNA amplification and 15 cycles for library amplification. Library concentrations and quality were measured using the Qubit dsDNA High Sensitivity Assay kit (Life Technologies, Carlsbad, CA). Libraries were pooled according to targeted cell number, aiming for a minimum of 20,000 reads per cell. Pooled libraries were sequenced using the NextSeq® 500 High Output kit (75 cycles).
Bioinformatic pipeline
A step-by-step description of the bioinformatic pipeline from this step onward, including all in-house scripts used, is detailed in the GitHub repository under github.com/vardilab/host-virus-pairing.
Detection of infected cells in the single-cell RNA-seq data using a custom viral genes database
To detect viral transcripts, a reference was built from a database of highly conserved genes6 from all NCLDV in the Giant Virus Database9, such as family B DNA polymerase, RNA polymerase subunits, and the major capsid protein. The genes were clustered using CD-HIT v. 4.6.6 at 90% nucleotide identity To remove redundancy43. From this database of 34866 genes, a reference was created using the 10X Genomics Cell Ranger mkref command. The Cell Ranger Software Suite (v. 5.0.0) was used to perform barcode processing (demultiplexing) and single-cell unique molecular identifier (UMI) counting on the raw reads from 47391 cells using the count script (default parameters), with the deduplicated NCLDV database as a reference. For downstream analysis, 972 cells that highly expressed multiple NCLDV genes and were considered "highly infected" were selected. These 'highly infected' cells were selected based on the following criteria: (a) cell expresses in total ≥10 viral UMIs22,24, (b) expression of more than one viral gene (>1), (c) expression of at least one gene with a UMI count greater than one (>1). Cell selection was wrapped using an in-house script (choose_cells.py).
Identifying the taxonomy of individual cells by sequence homology to ribosomal RNA
Raw reads from each cell were pulled by the cell's unique barcode identifier using seqtk v. 1.2. Reads were then trimmed (command: trim_galore --phred33 -j 8 --length 36 -q 5 --stringency 1 --fastqc -e 0.1), and poly-A was removed (command: trim_galore --polyA -j 1 --length 36), using TrimGalore (v. 0.6.5), a Cutadapt wrapper 44. Trimmed reads from each cell were assembled using rnaSPAdes 3.1545 with kmer 21,33. Raw reads pulling, trimming, and assembly was wrapped using an in-house script (assemble_cells.sh). To identify the taxonomy of the cells, assembled contigs from each cell were matched against 18S rRNA sequences from the Protist Ribosomal Reference (PR2)46 and metaPR247. To remove redundancy, the sequences in each database were clustered using CD-HIT v. 4.6.6 at 99% identity43. Contigs were filtered using SortMeRNA v. 4.3.648 with default parameters against the PR2 database and then aligned to the PR2 and metaPR2 databases using Blastn49, at 99% identity, E-value ≤ 10-10 and alignment length of at least 100 bp. Contigs were ranked by their bitscore, and only the best hit was kept for each contig. Each contig was assigned to one of the following taxonomic groups that were prevalent in the sample: the classes Bacillariophyta, Prymnesiophyceae, Chrysophyceae, MAST-3, and Katablepharidaceae, the divisions Pseudofungi, Lobosa (Amoebozoa), Ciliphora (Ciliates), Dinoflagellata and Cercozoa. Contigs that matched other groups were assigned as "other eukaryotes". Contigs that matched more than one of these taxonomic groups were considered non-specific or chimeric and were therefore ignored. This downstream analysis of Blast result was wrapped using an in-house script (Sankey_wrapper_extended.ipynb). To avoid detection of doublets and predators, Cells that transcribe 18S rRNA transcripts homologous to more than one taxonomic group were conservatively omitted. Of the 972 infected cells detected, 418 (43%) were omitted because we could not assemble specific 18s rRNA contigs from them or because their identity was ambiguous. None of the cells that were assigned "other eukaryotes" had contigs with conflicting annotations (contigs matching different classes).
Identifying the infecting virus using a homology search against a custom protein database
To identify transcripts derived from giant viruses, reads from the detected 972 infected cells were compared to a custom protein database using a translated alignment approach. To ensure that as many giant viruses as possible were represented, a database was constructed by combining RefSeq v. 20750 with all predicted proteins in the Giant Virus Database9. The proteins were then masked with tantan51 (using the -p option) and generated the database with the lastdb command (using parameters -c, -p). To identify the infecting virus, the raw sequencing reads in each of the 972 single-cell transcriptomes were compared to the constructed database using LASTAL v. 95952 (parameters -m 100, -F 15, -u 2) with best matches retained. The same procedure was done for the assembled transcripts from each cell to identify viral transcripts. The results were analyzed at different taxonomic levels, consistent with the Giant Virus Database (for giant viruses) or NCBI taxonomy33(everything else).
754 Cells whose best matching virus was coccolithovirus were omitted from the downstream analysis since EhV-infected cells were already reported to be abundant in the algal bloom25, and our analysis aims to explore other host-virus pairs.
Plotting host-virus pairs in a Sankey plot for host cells and their infecting giant viruses
Of the 218 cells detected as infected by viruses other than EhV, 71 were selected that could be identified using assembled 18S rRNA transcripts and have at least 10 reads aligned to one of the virus families (Supplementary Data 1). Only links representing at least 10% of the aligned reads in each cell are shown in order to highlight the strong links. The Sankey plot was constructed using Holoviews v. 1.15.4; see sankey_wrapper.ipynb in the GitHub repository.
Phylogenetic trees of viral and host marker genes
For phylogenetic analysis, 31 cells were chosen based on a strong correlation (≥90% of viral reads matched one virus family) between the host and a virus.
To obtain reference 18S rRNA sequences to include in a phylogeny, all transcripts assembled from these cells were compared to the PR2 database46 using BLASTN v. 2.9.0+ (parameters -perc_identity 95, -evalue 10-10, -max_target_seqs 20, -max_hsps 1). Sequences shorter than 1000 bp were removed from the reference, and the remainder of the sequences were de-replicated with cd-hit v. 4.743 (-c 0.99) to prevent the inclusion of excessive nearly identical references. Sequences were aligned with Muscle553 (default parameters), and diagnostic trees were created with FastTree 2.1.1054 for quick visualization of trees and for pruning long branches. The final phylogenetic trees were constructed with IQ-TREE v. 2.1.252 (parameters -m GTR+F+G4 -alrt 1000 -T AUTO --runs 10). To identify major capsid protein sequences in the single-cell transcriptomes, proteins were first predicted using FragGeneScanRs v. 1.1.056 (parameters -t, illumina_10). The resulting protein sequences were compared to MCP proteins in the Giant Virus Database with BLASTP v. 2.12.0+ (parameters -evalue 10-3, -max_target_seqs 20, -max_hsps 1) as well as to a custom MCP HMM that were previously designed6 using hmmsearch in the HMMER3 v. 3.3.2 package57 (E-value ≤ 10-3). The results of these searches were manually inspected, and sequences were subsequently aligned with Muscle 5 (default parameters). Similarly, as with the 18S rRNA sequences, diagnostic trees were first made with FastTree 2.1.10 and pruned long branches before making a final tree with IQ-TREE v. 2.1.2 (parameters m LG+F+G4 -alrt 1000 -T AUTO --runs 10). Cells for which transcripts are present in both viral and host trees were denoted (Supplementary Data 4). All the codes used to produce the trees are wrapped in the folder "marker_gene_trees" in the GitHub repository.
Single-cell RNA-seq data alignment to a custom reference
A new host-virus reference database was curated from the transcriptome of the infected cells (Fig. 2). Repetitive sequences were removed using BBduk (BBtools 38.90)58. An Additional long repetitive sequence was removed manually. A database of E. huxleyi and EhV genes, which were shown to be abundant in the samples25, was also added to this reference to specifically detect E. huxleyi cells and to avoid a non-specific alignment of reads from these cells to other contigs. For EhV, the predicted CDSs in the EhVM1 were used as a reference59. For the host, an integrated transcriptome reference of E. huxleyi was used as a reference60. Viral transcripts in the database were identified using a homology search against a custom protein database as described above. A reference was created from the database using the Cell Ranger mkref command. Raw reads were aligned to this reference database using 10X Genomics Cell Ranger v. 5.0.0 count analysis.
Preprocessing of transcript abundance and dimensionality reduction
A total of 28,656 cells from the 10 samples were initially aligned to the reference database. Cells with zero UMIs and cells with the lowest 1% number of UMIs, as compared with the distribution of transcripts per cell in the entire dataset, were removed for downstream analyses. To prevent cases of doublet or multiplet cells, which can be biological (cell digestion) or technical (fused cells), cells with the highest 1% number of UMIs were also removed. The raw UMIs of 28,015 cells were further preprocessed using the Python package scprep v. 1.0.10: Low expressing genes were filtered with filter.filter_rare_genes and min_cells=2. This number was chosen because we did not want to include genes mapped to only one cell, but we also did not want to exclude low-expressed genes, as they might represent gene expression of low-abundant organisms. Expression was normalized by cell library size with normalize.library_size_normalize, and the data was scaled with transform.sqrt. Preprocessing was wrapped in an in-house script; see 00.01.filter_normalize_scale_single_cell_data.py in the GitHub repository.
To represent the cells in two dimensions based on their gene expression profiles, dimensionality reduction was performed using scprep v. 1.1.0 package PCA (method='svd', eps=0.1) and Uniform Manifold Approximation and Projection (UMAP) dimensionality reduction was conducted using the UMAP method in the manifold package of the Python library scikit-learn v. 0.24.1 (minimum distance=0.4 spread=2, number of neighbors=7). Dimensionality reduction was wrapped in an in-house script (00.02.dimentionality_reduction_single_cell_data.py).
Assigning taxonomy to each detected cell using rRNA homology search
To identify the taxonomy of each detected cell, reads from each cell were assembled independently. The taxonomy of the cells was determined by 18S rRNA homology to one of the following groups, which were abundant in the population: the classes Bacillariophyta (diatoms), Prymnesiophyceae, Chrysophyceae, MAST-3 and Katablepharidaceae, the divisions, Ciliphora (Ciliates), Dinoflagellata and Cercozoa. Other taxonomic groups were clustered under "Other eukaryotes". 16,358 cells were identified this way, and 11,657 cells that could not be identified were excluded from the plot for convenience. Cells with 18S rRNA contigs homologous to more than one taxonomic group were also conservatively omitted. As described above, cells expressing at least 10 viral UMIs were considered infected1,2. This section was wrapped in a Jupyter notebook (Coexpression_wrapper_extended.ipynb).
Identifying the Leucocryptos host and its virus using homology search
To better identify the detected Katablepharidaceae cells and to identify their infecting virus, 26 infected Katablepharidaceae cells from bag #4, day 20, were selected. Reads from these cells were retrieved using the unique molecular identifier and then trimmed using TrimGalore v. 0.6.5, a Cutadapt wrapper44. Trimming was wrapped in an in-house script; see pull_trim_clean.sh in the GitHub repository. Trimmed read files from all these cells were concatenated into one file and assembled altogether using rnaSPAdes v. 3.1545. To identify the specific Katablepharidaceae host, assembled contigs were matched against the PR2 rRNA database using blastn at 90% identity, E-value ≤ 10-10, and alignment length ≥ 100bp. Contigs best matched to an unknown Katablepharidaceae (>99% nucleotide identity), but after removing unidentified genera, these contigs best matched (>95% nucleotide identity) the Katablepharidaceae species Leucocryptos marina. Transcripts that matched classes other than Katablepharidaceae were matched against the entire NCBI database using the NCBI web server61. They, too, mostly matched Katablepharidaceae genes, specifically 28S rRNA or internal transcribed spacer (ITS) sequences (Supplementary Data 3). To identify the specific infecting virus, transcripts were matched against an NCLDV gene marker database6 at 90% identity, E-value ≤ 10-10, and alignment length ≥ 100bp. After finding homology to Leucocryptos and the virus GVMAG-M-3300020187-271, gene expression was calculated using RSEM v.1.3.162 (rsem-calculate-expression -p 10 --bowtie2 --fragment-length-mean 58). The genomic features of the virus were taken from Schulz (2020)1, and the viral genome was plotted using ShinyCircos v. 2.063. Gene expression in the plot is measured in expected counts after log 2 transformation. The relative abundance data in Fig. 4 was obtained from an 18S rRNA amplicon sequencing on a size fraction of 2-20µm in bag #4 during the mesocosm experiment23. Days 19, 22, and 23 were sampled twice; all other days were sampled once. In Fig. 4c, relative abundance is calculated per taxa as a fraction of all amplicon sequencing variants (ASV), excluding metazoans. Fig. 4d shows the fraction of Katablepharidaceae out of all ASVs matching Katablepharidaceae (excluding metazoans). E. huxleyi abundance was measured by flow cytometry based on high side scatter and high chlorophyll signals. These data were obtained from the source data of the same study23.
Phylogenetic tree of Katablepharidaceae ASVs and 18S rRNA genes
To verify the taxonomy of the ASVs, A phylogenetic tree was constructed of 89 ASVs identified as Katablepharidaceae, selected 18S rRNA sequences of Katablepharidaceae and other species from the PR2 database, and the longest single cell assembled contig from the infected Katablepharidaceae cells. Sequences were aligned with ClustalOmega v. 1.2.4 (default parameters)64. A diagnostic tree was first made with FastTree 2.1.1054 for pruning long branches before making the final tree with IQ-TREE55. All but three ASVs and one PR2 sequence clustered together with the assembled Leucocryptos transcript, verifying the taxonomy of 97% of the ASVs used in the relative abundance analysis (Extended Data Fig. 4).
Phylogenetic trees of viral heat-shock proteins and metacaspase
To examine the evolutionary history of the heat-shock proteins encoded in GVMAG-M-3300020187-27, phylogenetic trees of these proteins were constructed together with homologs present in eukaryotes, bacteria, archaea, and other giant viruses. For this, a custom database of proteins from reference genomes was compiled from EggNOG v. 5.065 (eukaryotes), bacteria and archaea (the Genome Taxonomy Database v. 95)66, and other giant viruses (the Giant Virus Database9). For bacterial and archaeal genomes in the GTDB, proteins were predicted first with Prodigal v. 2.6.367 using default parameters. Proteins were searched against Pfam models for each protein using hmmsearch with the noise cutoff (--cut_nc) and subsequently aligned sequences with ClustalOmega v. 1.2.3 (default parameters). Phylogenetic trees were constructed using IQ-TREE v. 2.1.255 (parameters m TEST -bb 1000 -T 6 --runs 10) using ultrafast bootstraps and with the best model determined with ModelFinder68. Substation matrixes used for the phylogenetic trees: Bax-1 - VT+F+R7; Metacaspase - VT+R7; HSP90 - LG+F+R10; HPS70 - LG+F+R10.