Skip to main content

Supplementary datasets for: Large-scale genome sequencing reveals the driving forces of viruses in microalgal evolution

Cite this dataset

Nelson, David (2020). Supplementary datasets for: Large-scale genome sequencing reveals the driving forces of viruses in microalgal evolution [Dataset]. Dryad.


Microalgae are integral primary producers for global ecosystems whose genomes can be mined for ecological insights, but representative genome sequences are lacking for many phyla. We cultured and sequenced 107 microalgae species from 11 different phyla indigenous to varied geographies and climates. This genome collection was used to resolve genomic differences between saltwater and freshwater microalgae. Freshwater species showed domain-centric ontology enrichment for nuclear and nuclear membrane functions, while saltwater species were enriched in organellar and cellular membrane functions. Marine species contained significantly more viral families in their genomes (p-value = 8 x 10(-4)). Viral sequences were identified from Chlorovirus, Coccolithovirus, Pandoravirus, Marseillevirus, Tupanvirus, and others integrated into algal genomes. Algal, viral-origin sequences were found to be expressed and to code for a wide variety of functions. Our results clarify the poorly characterized occurrences of viral elements in algal genomes and define a unified adaptive strategy for algal halotolerance.



Microalgal strain selection and cultivation

Cultivation, DNA extraction, and sequencing of isogenic microalgae was done in several international culture collections and sequencing centers; UTEX (Austin, TX, USA), Bigelow laboratories (NMCA culture collection center, East Boothbay, ME, USA), New York University Abu Dhabi Center for Genomics and Systems Biology (Abu Dhabi, UAE), Admera Health LCC (South Plainfield, NJ, USA), and Novogene (HK).

The UTEX strains were grown on slants using one of the following media as appropriate: BG11 Medium, Bristol Medium, Cyandidium Medium, f/2 Medium, Modified Artificial Seawater Medium, Modified Bold's 3N Medium, Porphyridium Medium, Proteose Medium, Soil Extract Medium, Trebouxia Medium, or Volvox Medium with 1.5% agar as described on the UTEX website (; grown under cool white fluorescent lights at 20⁰C on a 12 hour light cycle. For species isolated and cultured at NYUAD, f/2 medium (Lananan et al., 2013), or Tris-minimal medium (, was used (

The species chosen for sequencing were intended to represent as many microalgae phyla and as many different environments as possible. We sequenced representatives from 11 phyla (see Table S1). Most of the species were from the Chlorophyta or the Ochrophyta phyla. The project designations were algallCODE phase II (n=107, this manuscript), algallCODE phase I (n=22), NCBI-hosted (n=43), and Phytozome-hosted (n=2). Individual strain cultures were selected as representative species for their lineages or as standards to confirm workflow reproducibility (see Table S1; Dataset S1).

We emphasized maximizing the sample size of each saltwater and freshwater species (Fig. 1; Table S1). Of our initial effort to culture >150 species, 24 failed to produce sufficient biomass, six were contaminated, and 3 yielded reads inadequate for an assembly matching the expected size (Dataset S1). The de novo assemblies from the final batch of 107 sequenced species were combined with publicly-available algal genomes for downstream analyses, including coding sequence (CDS) predictions (Dataset S2), hidden Markov model (HMM)-based functional predictions, including viral and protein family domain identification, hierarchical bi-clustering (Fig. 1), enrichment analyses (Fig. 5), principal component analyses (Fig. 5), and ternary graph-based analyses (Fig. S10, Dataset S12).

The natural habitats of these microalgae include a range of diverse geographic locations and all climatic zones, with various temperatures, wind speeds, precipitation, and solar radiation. To allow the study of their evolution, we included species from different types of environments (from the arctic to the tropics) and both salt- and fresh-water habitats (Fig. 1A; Table S1). The freshwater species sequenced in this project included members of the Chlorophyta and Ochrophyta; most of the Haptophytes, Rhodophytes, and Myzozoa we sequenced were saltwater species. Most of the UTEX accessions were freshwater species (28/40); most of the NCMA accessions were saltwater species (50/57). Alexandrium andersonii, a mixotrophic dinoflagellate (1.7Gb), Heterocapsa arctica, an arctic dinoflagellate (1.3 Gb), Lingulodinium polyedra, a red-tide dinoflagellate (1.2 Gb), Amphidinium gibbosum (1.1Gb), and Karena brevis (1.0 Gb) were the largest de novo-assembled genomes in this work (Table S2).

Long-read assemblies, including those from other studies (i.e., Chromochloris zofingiensis (Roth et al., 2017), Thalassiosira pseudonana (Armbrust et al., 2004), and Chlamydomonas reinhardtii (Merchant et al., 2007)), were used to validate our high-throughput short-read assembly process. Four subtropical axenic isolates (from the United Arab Emirates) were sequenced for this study using long-read technologies, including 10x Genomics (Pleasanton, CA, USA) linked-reads, and Pacific Biosciences (Menlo Park, CA, USA) Sequel long reads. Long reads were used to validate assemblies, viral element insertions, and to resolve repeat-containing regions (Ummat and Bashir, 2014; Vondrak et al., 2019). Our results indicated that the CDSs that provided the foundational information for the comparative analyses in this manuscript were reliably determined using short reads (Illumina HiSeq X or Novoseq6000). For example, the Chromochloris zofingiensis genome is the highest quality algal genome published to date (Roth et al., 2017) and has 33,513 exons; our short-read assembly for Chromochloris zofingiensis had 33,910 exons. Other reference microalgae used in this study as resequencing standards included Thalassiosira psuedonana (Armbrust et al., 2004), Chlamydomonas reinhardtii (Merchant et al., 2007), Scenedesmus sp., Guillardia theta (Curtis et al., 2012), Fragilariopsis cylindricus (Mock et al., 2017), Coccomyxa subellipsoidea (Blanc et al., 2012), and Bigelowiella natans (Curtis et al., 2012). A comparison of the assemblies generated from the monoculture and sequencing performed in this study and previous whole-genome sequencing projects is presented in Fig. S3, and QUAST and BUSCO assembly metrics are in Table S2. Hidden Markov models were used to predict structure and function from the whole-genome sequences (Fig. 1, B–D; Tables S3,5; Datasets S6, S7). The results for functional characterization using Enzyme Commission (EC) codes (Alborzi et al., 2017; Ryu et al., 2019), Kyoto Encyclopedia of Genes and Genomes (KEGG) designations (Porollo, 2014), and Gene Ontology (GO) terms (Hayes and Mamano, 2018; Teng et al., 2017) are in Tables S6,8.

DNA extraction

DNA was extracted from mature cultures with QIAGEN DNeasy Plant Maxi kits for HiSeqX 150x2 paired-end (short read) sequencing or QIAGEN MagAttract High Molecular Weight DNA Kits (48) for long-read sequencing. DNA was quantified and assayed for integrity as per the kit manufacturer's protocol. For HMW DNA extraction, briefly, DNA concentration was measured using a Qubit Fluorometer and checked for size by pulsed-field electrophoresis. A length-weighted mean of 50-70 kb was obtained, or the sample was rejected for sequencing. See Dataset S2 for FastQC reports and Fig. S1 for the gel images showing DNA integrity. Extracted DNA with low integrity was not included in library preparations. More than 30 cultures were grown whose DNA did not meet the quality threshold; in these instances, substitute strains were chosen. The final, sequenced strains are listed in Table S1.


Genomic DNA sequencing was performed with Illumina paired-end (Illumina, San Diego, CA, USA), PacBio Sequel (Pacific Biosciences, Menlo Park, CA, USA), and 10x Genomics linked-reads, where indicated, (10x Genomics, Pleasanton, CA, USA) to enable reliable coverage, contig assembly, and de novo genomic sequence assembly. For Illumina paired-end sequencing, Nextera 2x150 bp libraries (Illumina, San Diego, CA, USA) with approximately 72 million reads per sample passing quality filters (Dataset S2) were used for sequencing with a HiSeqX ( All reads are uploaded to the National Center for Biotechnology Information (NCBI) under the Bioproject accession PRJNA517804). The target coverage was 100x on a 100 Mbp genome. Quality control for library preparation for Illumina sequencing was done with Qubit, Tapestation (Fig. S1), and qPCR. Combining these technologies assisted the validation of VFAM placement within selected genomes and ensured reliable assembly.

De novo genome assembly

De novo assembly can produce variable output depending on the source species and software used; we used both ABySS 2.0 (Jackman et al., 2017) and the Platanus (Kajitani et al., 2019) pipelines for each species sequenced with short-reads (Illumina HiSeqX (Illumina, San Diego, CA, USA)). The ABySS 2.0 command was: 'unset SLURM_NTASKS && mkdir -p $READFILE && TMPDIR=/tmp ABySS-pe -j 18 lib=pe1 k=64 name=$READFILE pe1='$READFILE R1_001.fastq.gz $READFILE R2_001.fastq.gz' --directory=/ data/analysis/ABySS_pe/$READFILE'. The Platanus commands were: '/platanus assemble -o $READ.OUT -f $READ-1.trimmed $READ-2.trimmed -t 4 -m 72 2>assemble.$READ.log'. Details of all YML workflows used are in Dataset S3 and the Key Resources Table lists all essential software used in the creation and analysis of these genomes.

The output with the most single-copy, universally-conserved orthologs, according to “Based on evolutionary-informed expectations of the gene content of near-universal single-copy orthologs,” BUSCO, was chosen for subsequent analyses (Table S2, Dataset S2). This step produces some bias, as ABySS produced assemblies that were much closer in size to the estimated genome sizes from close relatives. For example, Alexandrium andersonii, a dinoflagellate expected to have a large (>1Gb) genome, yielded an ABySS assembly of 1.6Gb, whereas its Platanus assembly was only 302 Kb. The final selected assemblies are in Dataset S1; they are designated with '.AAC' extensions (algallCODE), and previously published genomes are included with a '.PRE' extension. BUSCO (Waterhouse et al., 2017) and QUAST metrics (Table S2) were used to select the higher quality assembly for downstream analyses (Dataset S1). All versions of assemblies are in Dataset S1. The genome assemblies presented in this work are draft versions; they may be updated or edited in subsequent versions or as per data repository regulations.

For ~ 50% of the assemblies, Platanus yielded higher BUSCO and QUAST scores. However, Platanus was unable to assemble several genomes (see: Platanus assemblies, Dataset S1), producing a final assembly significantly smaller than expected for the target species. For these unassembled genomes, and others assembled with Platanus having low BUSCO scores, the ABYSS 2.0 assembly was selected for downstream analysis. For long-read assemblies, PacBio Sequel reads, SMRT sequencing data or 10X Genomics Chromium linked reads were used. PacBio long reads were assembled using the Hierarchical Genome Assembly Process (HGAP) and long-read polishing ( to produce genome assemblies with >99% accuracy. 10X Genomics linked reads were assembled using Supernova (v2.0.1, using the megabubbles de novo assembly parameter.

Artificial degradation experiments

The genomes’ BUSCO scores varied by genome; thus, we assessed the robustness of our assemblies and downstream comparative genomics analyses through the downsampling of the assembled genomic reads. We downsampled FASTA records from the CDSs to 25 thousand randomly selected sequences; these were used as input for the same downstream analyses we applied to the original genome set. This experiment addresses the effects of genome size and possible quality degradation on our primary conclusions. We found that comparative analyses using these downsampled genomes could reproduce our principal findings regarding the differences of microalgae from different clades and environments.

The downsampling normalized for set numbers of FASTA records and addresses potential biases introduced by genome size or overall quality differences. Our approach is similar to the rarifying approach to genome normalization. From a computational perspective, the rarifying approach is essentially the opposite process but also aims to standardize input sequences in forcing sequences to a selected number by stripping away overflow sequences. Overall, this resulted in a 74% decrease in total sequences as input for our comparative pipelines. This reduction in data is equivalent to 26% of predicted sequences, well below our average BUSCO completeness scores (see Fig. S2). This technique has the added benefit of normalizing for genome size by feature scaling. For example, total gene counts are ignored, and ‘whale’ species (e.g., Dinoflagellates) have less impact on the conclusions. Gene content is summarized as a percentage of the genome, but fewer overall genes are used, and potentially essential genes are lost. All of the computational analyses involved in this downsampled genome set are included in Dataset S2.

Downsampled genomes could yield the conclusion that saltwater species had more VFAMs in their genomes on average (~2x, average= 946 VFAMs in saltwater species, 510 VFAMs in freshwater species (P=0.00096 in a one-tailed t-test)). The same unique GO term enrichment trends were seen after downsampling (see Fig. 5). For example, four GO terms involved in slowing cell cycle progression were uniquely enriched in the freshwater species’ downsampled genomes (FDR P-value<0.003 for all terms). The GO term ‘negative regulation of cell cycle’ was uniquely enriched in freshwater species in the original (FDR P-value=5.3e-3) and the downsampled genome sets (FDR P-value= 1.56e-3).

Saltwater species were uniquely enriched for membrane-related functions, including vesicle formation, in the original (FDR P-value=5.5e-3 (Table S9)), and downsampled (FDR P-value=2.46e-3) genome sets. The GO term GO:0015291 for active membrane transport activity was also uniquely enriched in the downsampled saltwater genome set (FDR P-value=2.38e-3), which is consistent with our conclusions of uniquely enriched membrane processes in saltwater species (Table S9). Finally, the 25% sequence retention downsampled genome set also showed less uniquely enriched GO terms in saltwater species than freshwater species (28 salt / 116 fresh unique GO terms in the downsampled set vs. 56 salt / 151 fresh unique GO terms in the original set). Our primary conclusions did not lose significance in these artificial data degradation experiments.

Contamination screening

All cultures in this study were routinely defined by their transfer cycle, examined by microscopy to ensure that they were unialgal and were the intended algal species and that there is no overt evidence of bacterial contamination. Axenic strains were regularly tested using an internal bacterial test media (f/2 + bactopeptone and methylamine HCl; Marine (MTM) and freshwater (FTM), both of which contain bactopeptone and malt extract protocols to ensure that they remain axenic. None of the assemblies included in the manuscript were determined to be non-axenic at the culturing phase.

In the computational phase, genomic assemblies were screened for bacterial contamination by examining G+C% distributions and BLAST hits to common contaminants. Assemblies without visible contamination, apparent from abnormal G+C% distributions or high numbers of BLASTP hits with contaminant species (Table S2, 'contamination report' sheet), were considered to be axenic. The BLASTP search also included the 62 previously published genomes included in the study, and all of these genomes were found to have multiple hits to bacteria; thus, genomes having more bacterial BLASTP hits than the well-curated Emiliania huxleyi genome (Martinez Martinez et al., 2011; Sheyn et al., 2018; Trainic et al., 2018; Vardi et al., 2012) were considered to be contaminated. This included Entomoneis spp, Prorocentrum minimum, Cladosiphon okamuranus, Coelasterella sp. M60, Rhodella maculata, Dunaliella sp. RO, Halamphora MG11, Aureococcus anophagefferens, Prymnesium polylepis, Chloromonas AAM2, and Rhodomonas salina.

The Diamond BlastP command was: 'diamond blastp --db uniref100 --query $GENOME.aa.fa --out $GENOME.diamond.blastp.txt --more-sensitive --threads 4 -k 1 --outfmt 6 qseqid qlen sseqid sallseqid slen qstart qend sstart send qseq sseq evalue bitscore score length pident nident mismatch positive gapopen gaps ppos qframe btop stitle salltitles qcovhsp qtitle'. The remaining assemblies were presumably axenic and used for downstream comparative analyses.


Genome annotation

Hidden Markov Models were used to identify repeats (Repeatmasker (Tempel, 2012), coding sequences (CDSs,, protein families (PFAMs, ), and viral families (VFAMs, (Skewes-Cox et al., 2014)). This workflow uses Bayesian classification architecture based on amino acid sequence alignment from a subset of model organisms; thus, the predictive capacity of this dataset is limited by the sample size used to create HMM models. An advantage to this type of predictive approach is that all samples are treated equally, and biases from different annotation methods are eliminated.

Overall, 102 marine strains, 62 freshwater strains, and ten strains that grow in both environments (i.e., euryhaline) were included in comparative analyses. Genomes assembled in this project (algallCODE, n=107), and the downloaded genomes (n=67), were assessed according to genome completeness using N50, total contig number, and integrity of genes from near-universally-conserved ortholog groups (BUSCO, Bench-Marking Single Copy Orthologs (Waterhouse et al., 2017)).

Annotations for the 174 genomes were based on the unsupervised machine learning algorithms in HMMer (, with the exception of Diamond BLASTP search results. For example, HMMsearch results for VFAM HMM alignments are in Table S4. The 174 combined genomes were used as input for RepeatMasker (Tempel, 2012) and SNAP ( RepeatMasker was run with a RepeatMasker Combined Database: Dfam Consensus-20170127 run with rmblastn version 2.6.0+. RepeatMasker results are in Dataset S4 with the VFAM results. SNAP was run with an Arabidopsis thaliana HMM for all species and provided the base CDSs. The SNAP command was: 'cd /snap/ && ./snap A.thaliana.hmm $GENOME.fa -aa $GENOME.aa.fa -tx $ -gff -quiet -name $GENOME > $GENOME.gff'. The SNAP-produced CDSs were used in downstream comparative analyses (e.g., PFAM prediction). Multiple newly sequenced genomes were predicted to be diploid or polyploid, and PFAM domains from identical sequences were not predicted to be duplication; however, heterozygous genes yielded additional PFAM counts.

HMMsearch was done with these CDSs against the PFAM-A-v31.1 database. The command was: 'hmmsearch --noali -E 0.000000001 --cpu 28 --domtblout $OUT $IN.aa.fa'. The HMMer user guide ( provides in-depth descriptions of the reported values, including sequence coordinates for alignment and matches with HMM models, E-values, percent identity, and bias. Protein families from HMMsearch (E-value < 1e-9, see Dataset S3) were grouped according to whether they belonged to freshwater or saltwater species (see Table S1), divided by the number of species in that group (freshwater n =102; saltwater n= 62), then compared across groups to determine the relative abundance of each PFAM domain in saltwater v. freshwater species. Euryhaline species were included as saltwater species where indicated. Exclusion analyses (e.g., ternary graphs)did not include euryhaline species.

Each subgroup of CDSs was individually used for the diamond blast command /software/diamond/diamond blastp --db /db/nrD.dmnd --query ./$seqs.aa.fa --out FUSTr-“$”.diamond.blastp.txt --more-sensitive --threads 24 -k 11 --outfmt 6 qseqid qlen sseqid sallseqid slen qstart qend sstart send qseq sseq evalue bitscore score length pident nident mismatch positive gapopen gaps ppos qframe btop stitle salltitles qcovhsp qtitle. These results were used to inform further investigations into inheritance patterns, including horizontal gene transfers (HGTs) for selected genes.

Homology analysis with the Basic Local Alignment Search Tool (BLAST, was performed with the accelerated Diamond Blast algorithm (Buchfink et al., 2015.). The Diamond BlastP command was: 'diamond blastp --db uniref100 --query $GENOME.aa.fa --out $GENOME.diamond.blastp.txt --more-sensitive --threads 4 -k 1 --outfmt 6 qseqid qlen sseqid sallseqid slen qstart qend sstart send qseq sseq evalue bitscore score length pident nident mismatch positive gapopen gaps ppos qframe btop stitle salltitles qcovhsp qtitle'. In a BLASTP search using a cutoff of 1e-10, species queried had from 10%-90% of their proteins with significant hits (See Dataset S11).

EC designations were obtained using PFAM associations described in the EC domain-miner manuscript (Alborzi et al., 2017). The association list is a downloadable file, and EC designations were obtained with the command: 'while read l; do grep -w $l $PFAM.txt > $EC.out' and the resultant EC designations are in Dataset S10. KEGG designations were obtained from EC IDs using EC2KEGG (Porollo, 2014). The EC codes were used to retrieve KEGG terms using EC2KEGG (Porollo, 2014). Briefly, the scripts in this package use 'libwww-perl' (for internet communication with KEGG,, Text-NSP (for computing the Fisher exact test,, and 'Statistics-Multtest' (for correcting p-values on multiple hypotheses testing, Shared and unique predicted KEGG pathways for saltwater and freshwater species, and an expanded description of the genes involved in these pathways is in Fig. S9.

Dataset S10 informed the pathway comparison in Fig. S9 using iPath3.0 ((Darzi et al., 2018),; EC numbers for PFAMs with higher multiplicity in either saltwater or freshwater species are in Table S7. Nodes in the iPath diagrams represent enzymatic reactions with corresponding EC designations. For the iPATH analysis, EC designations were uploaded to the iPath website (, and primary metabolism, secondary metabolism, and metabolism associated with microbes from diverse environments maps were used to generate the images in Fig. S9.

Domain-centric GO enrichment analysis

Gene Ontology (GO) names and enrichment scores were obtained using dcGO (, (Fang and Gough, 2013a, b)) from CDSs, EVOPs, or genes under positive selection with significant (e=1e-9) hits against the PFAM-A database. Algorithms used in dcGO convert Z-scores and FDR-corrected P-values were calculated based on hypergeometric distributions.

Shared and unique enriched GO terms were delineated with InteractiVenn ( All supplementary files with the '.ivenn' extension can be uploaded to, and gene lists corresponding to the displayed numbers are provided as copiable text (see Supplementary Datasets).

Validation using reference genomes

Cultures of microalgal species that have previously undergone whole-genome sequencing have provided much insight into microalgal evolution and ecology (see references in Table S1). Cultures were grown of reference species to validate the sequencing and assembly workflows. Estimates of size (Dataset S1), coding sequences (CDSs, Dataset S2,3), aligned homologs in known species, repeat composition (Dataset S4) were similar for our reconstructed genomic assemblies and the NCBI-hosted reference genomes (Fig. S2, Table 1). The similarity across sequencing and bioinformatics pipelines suggests reproducibility for whole-genome sequencing and their functional inferences. All genomes, including those sequenced in this project (algallCODE, .AAC), and those downloaded (previously sequenced, .PRE) were pooled for parallel CDS predictions (Dataset S1) and used for downstream analyses.

The genome of Chromochloris zofingiensis, sequenced because of its capacity for oil and pigment production, is regarded to be the highest quality genome currently available (Roth et al., 2017). We compared the functional predictions inferred from this genome with our re-sequenced de novo assembly for validation and found that their HMM profiles were identical in their regards to matches with the PFAM-A and the VFAM (viral family, available at, (Skewes-Cox et al., 2014)) databases. Other assemblies included in re-sequencing validation controls included Micromonas pusilla and Galdieria sulfaria (Fig. S3). Our assemblies were nearly identical in their HMM matches to public databases (Fig. 1, Datasets S3 and S5). The similarity of these HMM matches confirms the robustness of the technical aspects of our sequencing and genomic assembly and the use of HMM profiles for gene function prediction. All of the assemblies resulting from our resequencing yielded highly similar counts regarding gene content, G+C% and size estimates, and functional domain prediction, including BUSCO results and BLAST hits.

Amino acid residue comparisons

Translated coding sequences produced as described in the 'Genome Annotation' section were used. The SNAP program adds asterisks to indicate a peptide terminal; these and headers were removed, and the twenty amino acids were quantified and classified into salt or fresh based on their designations in Table S1. Counts for each amino acid were divided by the total number of residues within each species, then pooled counts were divided by the number of species in each group (salt n=100; fresh n=61). The compositions of multiple amino acids were significantly different between these two groups, as indicated in Fig. 5 legend. A two-tailed Student's t-test with Welch correction was used for each amino acid.

Dimension reduction analyses

PFAM dCNs from total CDSs, genes under positive selection, and endogenous, viral-origin PFAMs (EVOPs)) in the microalgal genomes were used as inputs for principal component analysis (PCA) in ClustVis ( Loadings and scores, relative to Fig. 2, are in Table S8. Marine and terrestrial species were observed to separate from PC1; thus, scores contributing to this separation were inferred to contribute to the phenotypes of the respective, clustered species.

T-distributed neighbor embedding (t-SNE) and uniform manifold approximation projection (UMAP (McInnes et al., 2018) ( used PFAM dCNs as in the PCAs using the TensorFlow projector online tool available at Briefly, PFAM counts were used as vectors for n=174 species in 4,217 different PFAMs. PFAM descriptions were used to trace accessions and functions. Embeddings in Fig. S5 were obtained using eight nearest neighbors in high dimensional space, according to (McInnes et al., 2018). For Fig. S5 panels B-D, the point with the largest text was selected to reveal the nearest neighbors, annotated with smaller text, in high-dimensional space.

Ternary graphs

Ternary plots were graphed in Plotly ( Briefly, PFAM counts were used as variable criteria in the groups described in Fig. S10. Values for the combined groups were weighted as proximity to a corner such that higher counts minimized distance. Genes with an expression value of zero in one out of three groups were found along edges and considered to be excluded from the other two groups.

Positive selection analysis

Sixty genomes from small microalgae, 30 freshwater, and 30 saltwater species, and their CDSs and amino acid sequences were used in the FUSTr workflow (Cole and Brewer, 2018) to calculate dS/dN ratios for genes and their phylogenies (see Dataset S13). This multi-software computational pipeline is similar to PAML in its scope of analysis. Briefly, FUSTr finds genes in transcriptomic datasets under strong positive selection and automatically detects isoform designation patterns in transcriptome assemblies to maximize phylogenetic independence in downstream analysis. Non-nucleotide characters were detected and removed from sequences. Header patterns are examined to identify isoforms and eliminate redundancies. Coding sequences are extracted from transcripts with Transdecoder v3.0.1 (Haas et al., 2013). For the purposes of this manuscript, the longest open reading frame (ORF) for the transcript under consideration was used. Then, homologies in peptide sequences are assessed with BLASTP in DIAMOND (v.0.9.10, (Buchfink et al., 2015)) with an e-value cutoff of 10−5. This homology matrix is parsed into putative gene families with SiLiX (v.1.2.11) transitive clustering. SiLiX has been shown to be faster, have more efficient memory allocation, and reduce domain chaining compared to other clustering algorithms such as MCL (Miele et al., 2011). Predicted peptides included in families had 35% minimum identity, 90% minimum overlap, at least 100 amino acids.

Multiple amino acid sequence alignments are generated by the software selected by MAFFT v7.221 (Katoh et al., 2017). Hanging sequence ends were removed with Trimal v1.4.1’s gappyout algorithm (Capella-Gutierrez et al., 2009). Phylogenies of each family’s untrimmed amino acid multiple sequence alignment are reconstructed using FastTree v2.1.9 (Price et al., 2010). Trimmed multiple sequence codon alignments are then generated by reverse translation of the amino acid alignment using the CDS sequences. The final output shows detected gene families that are under strong selection and the average dN/dS per family. Details per codon position within family alignments include: alpha mean posterior synonymous substitution rate at a site; beta mean posterior non-synonymous substitution rate at a site; mean posterior beta-alpha; posterior probability of negative selection at a site; posterior probability of positive selection at a site; Empirical Bayes Factor for positive selection at a site; potential scale reduction factor; and estimated effective sample site for the probability that beta exceeds alpha.

For our FUSTr analysis, a total of 4,403,010 CDSs were used as input; CDSs and predicted proteins from these genomes were organized into 2,991,159 families; 792 had >15 sequences per family. The FUSTr analysis detected 335 families under strong positive selection; 401,500 PFAMs (HMMsearch, e<0.0000000001, n = 60 species) were identified from genes in these families as being under strong positive selection. The FUSTr dataset generated for this project includes 792 phylogenetic trees and includes 13,841 VFAM-CDSs and other proteins with strong positive selection signatures.

Viral family sequence identification

A caveat in using expression-only (RNA) data to describe VFAMs is the possibility that the assembled RNA sequence was contamination and not expressed from the microalgal genome. Here, long-read sequencing technologies, including 10x Genomics (Pleasanton, CA, USA) linked-reads, and Pacific Biosciences (Menlo Park, CA, USA) Sequel long reads were combined to ensure scaffold integrity. In contigs with high coverage of long reads, VFAMS were interspersed with known microalgal genes in nuclear genome contigs (Fig. S6). Synteny was maintained in a manner that suggested integrations of VFAM sequences in core microalgal assembled contigs. In addition to our long-read validations, we used high-quality, previously published, microalgal genomes that were also created from long reads such as Chromochloris zofingiensis (Roth et al., 2017) and confirmed that VFAMs were integrated into these published chromosomes.

Viral family sequences within genomes were discovered using HMMs built from Markov clustering and multiple sequence alignments (Skewes-Cox et al., 2014) from all of the virally annotated proteins in RefSeq as of 2014. These VFAMs can detect viral sequences with high accuracy and did not cluster non-viral sequences into the Markov clusters in the published test sets. We used a strict 1e-9 E-value to call VFAM domains in the translated CDS (tCDS) from the microalgal genomes. Altogether, 91,757 distinct VFAM tCDSs were discovered and extracted using in-house scripts. The VFAM HMMsearch results, extracted tCDSs, and the scripts used to process them are in Datasets S3 and S6. Extracted tCDS were used as input for dimension reduction analyses, including those described in Figs. 2-4, BLAST, dcGO enrichment based on clade or environment (see Dataset S13), and DFAM searches to find transposable elements (Dataset S5). VFAM-CDSs were used as input for HMMsearch against the Dfam repeat database (DFAM.HMM) to search for transposable elements and 720 out of 91,757 of these sequences (0.78%) were predicted to code for transposon elements (E-value=1e-9).

After establishing the presence of VFAMs in microalgal chromosomes, we validated their presence in expressed genes. Transcriptomic data from the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP, (Keeling et al., 2014)) was downloaded from VFAM predictions we performed on the MMETSP dataset are in Dataset S9. From HMM-based VFAM detection performed in the same manner as with the predicted CDSs from de novo and downloaded assemblies, MMETSP de novo transcriptomes were confirmed to contain substantial portions of VFAMs. Briefly, HMMsearch was done with transcriptome fasta files against the PFAM-A-v31.1 database. The command was: 'hmmsearch --noali -E 0.000000001 --cpu 28 --domtblout $OUT $IN.aa.fa'. The HMMer user guide ( describes the reported values, including sequence coordinates for alignment and matches with HMM models, E-values, percent identity, and bias.

Additional confirmation for the viral origins of VFAM HMM-identified sequences was carried out by examining BLAST results, extracting high-confidence hit sequences from public repositories, and performing alignments and maximum likelihood phylogenies (see Figs. S7-8). The BLAST command for the 91,757 VFAM-CDSs was ‘diamond blastp --db nrD.dmnd --query all_vfam_extracted.aa.fa --out all_vfam_extracted.diamond.blastp.txt --more-sensitive --threads 4 -k 10 --outfmt 6 qseqid qlen sseqid sallseqid slen qstart qend sstart send qseq sseq evalue bitscore score length pident nident mismatch positive gapopen gaps ppos qframe btop stitle salltitles qcovhsp qtitle”. Altogether, BLAST results were inspected to discern phylogenies for the viral sequences. The top 25 hits were analyzed for phylogeny and Biosample location. Further confirmation of the likelihood of HGT events was instilled if multiple high-confidence hits came from unrelated lineages but shared a similar habitat.

Alignments were performed using MUSCLE with 100 iterations and with using diagonal sequence alignments to detect islands of homologous sequence with greater distance. Maximum likelihood phylogenies using these alignments and a neighbor-joining tree construction method. The substitution model for branch reconstruction was the Whelan and Goldman (WAG) matrix (Whelan and Goldman, 2001) with a 2.0 transition/transversion ratio. Rate variation was included in likelihood calculations under four categories and a gamma distribution parameter of 1.0.

A caveat to this study is that we cannot conclusively attribute every VFAM as having been directly transferred from a virus to the algae containing that VFAM in its genome, although recent studies have shown a high likelihood of ultimate viral origin (i.e., de novo gene generation) (Legendre et al., 2018). Additional evidence supporting this scenario has been presented in (Graves et al., 1999; Hron et al., 2018; Romani et al., 2013; Schvarcz and Steward, 2018; Seligmann, 2018; Slimani et al., 2013; Sobhy, 2018; Sun et al., 2015; Watanabe et al., 2018).


New York University Abu Dhabi Institute, Award: 73 71210 CGSB9