A chromosome-level genome for the nudibranch gastropod Berghia stephanieae helps parse clade-specific gene expression in novel and conserved phenotypes
Data files
Dec 22, 2023 version files 8.74 GB
Abstract
How novel phenotypes originate from conserved genes, processes, and tissues remains a major question in biology. Research that sets out to answer this question often focuses on the conserved genes and processes involved, an approach that explicitly excludes the impact of genetic elements that may be classified as clade-specific, even though many of these genes are known to be important for novel, or clade-restricted, phenotypes. This is especially true for understudied phyla such as mollusks, where limited genomic and functional biology resources for members of this phylum has long hindered assessments of genetic homology and function. To address this gap, we constructed a chromosome-level genome for the gastropod Berghia stephanieae (Valdés, 2005) to investigate the expression of clade-specific genes across both novel and conserved tissue types in this species. The final assembled and filtered Berghia genome is comparable to other high quality mollusk genomes in terms of size (1.05 Gb) and number of predicted genes (24,960 genes), and is highly contiguous. The proportion of upregulated, clade-specific genes varied across tissues, but with no clear trend between the proportion of clade-specific genes and the novelty of the tissue. However, more complex tissue like the brain had the highest total number of upregulated, clade-specific genes, though the ratio of upregulated clade-specific genes to the total number of upregulated genes was low. Our results, when combined with previous research on the impact of novel genes on phenotypic evolution, highlight the fact that the complexity of the novel tissue or behavior, the type of novelty, and the developmental timing of evolutionary modifications will all influence how novel and conserved genes interact to generate diversity.
README
This README.txt file was generated on 2023-08-01 by Jessica Goodheart
GENERAL INFORMATION
- Title of Dataset: Data from: Phylogenomics suggests that larvae are ancestral in polyclads, but not homologous to the trochophore.
- Author Information Corresponding Investigator Name: Dr Jessica A Goodheart Institution: Division of Invertebrate Zoology, American Museum of Natural History, New York, NY, 10024, USA Email: jgoodheart@amnh.org <br> Co-investigator 1 Name: Robin A. Rio Institution: Bioengineering Department, Stanford University, Stanford, CA, USA <br> Co-investigator 2 Name: Neville F. Taraporevala Institution: Quinney College of Natural Resources, Utah State University, Logan, UT, USA <br> Co-investigator 3 Name: Rose A. Fiorenza Institution: Scripps Institution of Oceanography, University of California San Diego, La Jolla, CA, USA <br> Co-investigator 4 Name: Seth R. Barnes Institution: Scripps Institution of Oceanography, University of California San Diego, La Jolla, CA, USA <br> Co-investigator 5 Name: Kevin Morrill Institution: Scripps Institution of Oceanography, University of California San Diego, La Jolla, CA, USA <br> Co-investigator 6 Name: Mark Allan C. Jacob Institution: Scripps Institution of Oceanography, University of California San Diego, La Jolla, CA, USA <br> Co-investigator 7 Name: Cal Whitesel Institution: Scripps Institution of Oceanography, University of California San Diego, La Jolla, CA, USA <br> Co-investigator 8 Name: Park Masterson Institution: Scripps Institution of Oceanography, University of California San Diego, La Jolla, CA, USA <br> Co-investigator 9 Name: Grant O. Batzel Institution: Scripps Institution of Oceanography, University of California San Diego, La Jolla, CA, USA <br> Co-investigator 10 Name: Hereroa T. Johnston Institution: Scripps Institution of Oceanography, University of California San Diego, La Jolla, CA, USA <br> Co-investigator 11 Name: M. Desmond Ramirez Institution: Department of Biology, University of Massachusetts Amherst, Amherst, MA, USA <br> Co-investigator 12 Name: Paul S. Katz Institution: Department of Biology, University of Massachusetts Amherst, Amherst, MA, USA <br> Co-investigator 13 Name: Deirdre C. Lyons Institution: Scripps Institution of Oceanography, University of California San Diego, La Jolla, CA, USA Email: d1lyons@ucsd.edu
- Date of data collection: 2019-2022
- Recommended citation for this dataset: Goodheart et al. (2023), Data from: A chromosome-level genome for the nudibranch gastropod Berghia stephanieae helps parse clade-specific gene expression in novel and conserved phenotypes
DATA & FILE OVERVIEW
- Description of dataset
These data were generated to produce a genome for the nudibranch species Berghia stephanieae and investigate the expression of novel (or clade-restricted) genes in novel and conserved tissues. A major question in biology is how novel phenotypes originate from conserved genes, processes, and tissues. Research that sets out to answer this question often focuses on the conserved genes and processes involved, an approach that explicitly excludes the impact of genetic elements that may be classified as clade-specific, even though many clade-specific genes are known to be important for novel, or clade-restricted, phenotypes. This is especially true for understudied phyla such as mollusks, where limited genomic and functional biology resources for members of this phylum has long hindered assessments of genetic homology and function. To address this gap, we use a new chromosome-level genome for the gastropod Berghia stephanieae (Valdés, 2005) to investigate the expression of clade-specific genes across both novel and conserved tissue types in this species. The final assembled and filtered Berghia genome is comparable to other mollusk genomes in terms of size (1.05 Gb) and number of predicted genes (24,960 genes), and is highly contiguous. The proportion of clade-specific genes upregulated varied across tissues, with novel tissues like the distal ceras unexpectedly expressing a fairly average frequency of clade-specific genes compared to other tissues. By contrast, more complex tissue like the brain had the highest total number of upregulated, clade-specific genes, though the proportion of upregulated that were clade-specific was low. Our results, when combined with previous research on the impact of novel genes on phenotypic evolution, highlight the fact that the complexity of the novel tissue or behavior, type of novelty, and developmental timing of evolutionary modifications will all influence how novel and conserved genes interact to generate diversity.
- File List: <br> File 1 Name: 1_genome_filtering.zip File 1 Description: Important input, intermediate, and output files for our Berghia genome filtering analyses prior to annotation. File 1 Subdirectories: File 1.1: Berghia_Apr2021_purged_blobtools File 1.1 Description: Directory of Blobtools json files from analysis post-purgedups and prior to filtering. File 1.2: initial_assembly File 1.2 Description: Output files from the initial Dovetail WTDBG2 assembly, including asm.cns.fa, filtered.asm.cns.fa, hap.fa, and purged.fa. <br> File 2 Name: 2_repeatmasking.zip File 2 Description: Output files from both hard and soft masking repeatmodeler analysis for Berghia stephanieae. File 2 Subfiles: File 2.1: Bsteph_ref.fa.no_tes.fa, a reference proteome file from an assembled Berghia reference transcriptome using Illumina short read data. Files 2.2: Berghia_alltissues_onerep_trinity291_transdecoder_cdhit95_noaliens_fulltranscripts_novectors_nocontaminants.fasta.transdecoder., coding sequence and aa sequence files for the proteome predicted from RNAseq data.Files 2.3: Berghia_Apr2021_hirise_purged.filtered.fasta_(hard/soft). are the outputs from repeat identification analyses. Files 2.4: Berghia_Apr2021_hirise_purged.filtered.fasta.*masked are the hard masked and soft masked Berghia genomes used for RNAseq mapping and in the BRAKER2 analysis. <br> File 3 Name: 3_braker_prediction.zip File 3 Description: Various input and output files from the Berghia genome BRAKER2 run. File 3 Subfiles: File 3.1: Berghia_Apr2021_hirise_purged.filtered.fasta_edited.softmasked is the input genome for BRAKER2. File 3.2: mollusca_odb10_with_berghiabuscos.fasta contains the input protein sequences from Berghia. Files 3.3: hintsfile.gff and augustus.hints.* are the output gtf, gff3, aa, and codingseq files from the original unfiltered BRAKER2 run. Files 3.4: augustus.anysupport.* are the output gtf and gff3 files post-filtration. Files 3.5: augustus.hints.anysupport.* are the output aa and codingseq files post-filtration. Files 3.5: .proteins.v4.0.5.txt are the BUSCO output files for both the filtered and unfiltered predicted protein files.File 3.6: filter_seqs.sh is the script used to filter the BRAKER2 output.Files 3.7: Bs_augustus_RM_iso_.pdf are reports detailing the statistics for different sets of predictions filtered using File 3.6 (filter_seqs.sh). <br> File 4 Name: 4_functional_annotation.zip File 4 Description: Output files from functional annotation runs with the filtered data set, including BLASTP and InterProScan analyses. File 4 Subfiles: Files 4.1: berghia_RM_iso_anysupport_ipscan_2021_11.* are the output files from InterProScan. Files 4.2: blastp.*.outfmt6 contain the direct outputs from BLASTP runs comparing the Berghia proteome with databases (RefSeq and UniProt). File 4.3: Bs_protein-blasthit-ALL-info.txt is the full functional annotation table that combines data from the RefSeq and UniProt BLASTP runs. <br> File 5 Name: 5_OF_proteomes.zip File 5 Description: Input proteomes for our OrthoFinder analysis. <br> File 6 Name: 6_orthofinder_results.zip File 6 Description: Output files from our OrthoFinder analysis. File 6 Subdirectories: Subdirectory 6.1: Comparative_Genomics_Statistics/ includes output files describing the comparative genomics statistics from the analysis Subdirectory 6.2: Gene_Duplication_Events/ includes output files describing the gene duplications statistics from the analysis Subdirectory 6.3: Gene_Trees/ includes output gene tree files for each orthogroup Subdirectory 6.4: Orthogroup_Sequences/ includes output fasta files containing the sequences in each orthogroup. Subdirectory 6.5: Orthogroups/ includes output files describing the composition of the orthogroups from the analysis Subdirectory 6.6: Orthologues/ includes output files identifying orthologues from all vs. all comparisons Subdirectory 6.7: Phylogenetic_Hierarchical_Orthogroups/ includes output files identifying orthogroups for each node Subdirectory 6.8: Phylogenetically_Misplaced_Genes/ includes output files identifying phylogenetically "misplaced" genes from each proteome Subdirectory 6.9: Putative_Xenologs/ includes output files identifying putative xenologs from each proteome Subdirectory 6.10: Resolved_Gene_Trees/ includes the resolved output gene tree files for each orthogroup Subdirectory 6.11: Species_Tree/ includes the species tree files from the analysis <br> File 7 Name: 7_kinfin.zip File 7 Description: Contains the input and output files from our KinFin analysis. File 7 Subdirectories: Subdirectory 7.1: berghia_sept2023.kinfin_results/ includes output files from the KinFin analysis <br> File 8 Name: 8_differential_expression.zip File 8 Description: Contains the input and output files from our differential expression analyses across clades and tissues. File 8 Subdirectories: Subdirectory 8.1: inputs/ contains all of the inputs for these analyses, except the read counts files. Subdirectory 8.2: read_files/ contains all of the read counts files for each transcriptome dataset. Subdirectory 8.1: outputs/ contains all of the outputs for these analyses <br> File 9 Name: 9_final_genome_files.zip File 9 Description: Contains the final genome and proteome fasta files, along with gtf and gff annotation files. <br> File 10 Name: 10_scaffold_names.txt File 10 Description: Contains the sequence names that correspond to the final assigned chromosomes and scaffolds in the genome file.
METHODOLOGICAL INFORMATION
Sample preparation and genome sequencing
We isolated one Berghia juvenile from the Lyons lab culture prior to mating to minimize genomic contamination. While isolated, we fed the animal ~½ of a medium Exaiptasia diaphana (defined by Taraporevala et al. 2022) each day for 34 days. We then starved the animal for 44 days prior to shipping. To minimize residual food in the gut diverticula, cerata were removed with forceps and the remaining body was blotted on a kimwipe to remove excess water, then the animal was placed in a cryotube and flash frozen in liquid nitrogen and stored at -80 until shipping to Dovetail Genomics (now Catana Bio, Scotts Valley, CA). Dovetail Genomics used an input of ~101mg into a slow CTAB protocol to extract high molecular weight DNA. They measured the efficiency of DNA extraction using a Qubit 2.0 Fluorometer (Life Technologies, Carlsbad, CA, USA) High Sensitivity Kit. Overall, they obtained 12.1 ug of high molecular weight DNA. They then used a Mini Column for clean up and resuspended the pellet in 75 µl TE. They then quantified DNA samples using the Qubit. They constructed the PacBio SMRTbell library (~20kb) for PacBio Sequel using SMRTbell Express Template Prep Kit V 2.0 (PacBio, Menlo Park, CA, USA) using the manufacturer recommended protocol. They then bound the library to polymerase using the Sequel II Binding Kit 2.0 (PacBio) and loaded onto PacBio Sequel II (PacBio) on 8M SMRT cells (SRR25687008).
For scaffolding, Dovetail fixed chromatin in place with formaldehyde in the nucleus for extraction and analysis via Dovetail® Omni-C® proximity ligation. They then digested the fixed chromatin with DNAse I, repaired the chromatin ends and ligated to a biotinylated bridge adapter followed by proximity ligation of adapter containing ends. After proximity ligation, they reversed crosslinks and purified the DNA. They treated purified DNA to remove biotin that was not internal to ligated fragments. They generated sequencing libraries using NEBNext Ultra enzymes and Illumina-compatible adapters. They then isolated biotin-containing fragments using streptavidin beads before PCR enrichment of each library. Technicians then sequenced the library on an Illumina HiSeqX platform to produce approximately 30x sequence coverage. They then used HiRise MQ>50 reads for scaffolding (see "read-pair" above for figures).
Short read RNA sample collection and sequencing
We obtained Berghia adult tissue samples, including the: (1) brain, (2 samples; SRR14337001-SRR14337002) (2) oral tentacles (3 samples; SRR12072210, SRR25598601-SRR25598601), (3) rhinophores (3 samples; SRR12072209, SRR25598592-SRR25598593), (4) foot (2 samples; SRR12072206, SRR25598598), (5) tail (2 samples; SRR12072205, SRR25598599), and (6) proximal (3 samples; SRR12072208, SRR25598594-SRR25598595) and (7) distal ceras (3 samples; SRR12072207, SRR25598596-SRR25598597). We also obtained earlier stage transcriptome data from: (8) multiple embryonic stages (bulk sample of 500-600 individual embryos from each time point reared at 27°C (12, 24, 36, 48, 60 hpo and 4, 7, and 9 dpo; SRR12072213) and (9) juveniles 15 dpo at 27°C (500 individuals from 3 egg masses laid the same day; SRR12072212). We starved adults for ~1 week prior to removal of some tissues (all but the brain) to reduce symbiont presence and minimize contamination. We extracted total RNA from most adult tissues (minus the brain) using the RNeasy Kit (QIAGEN, Redwood City, CA) and submitted the extracted total RNA to Novogene Ltd. (Sacramento, CA) for quality assessment, library preparation and sequencing (Illumina NovaSeq 6000; 150bp paired-end reads). We prepared the adult brain total RNA using the Clontech SmartSeq v4 Ultra-Low Input RNA Kit (Takara). We prepared libraries with the Nextera XT DNA Library Preparation Kit and 96-Sample Index Kit (Illumina, San Diego, CA) and quantified them using Qubit (ThermoFisher Scientific, Waltham, MA) and assessed quality using a Bioanalyzer (Agilent, Santa Clara, CA). We sequenced the brain sample on the Illumina NextSeq 500 (75bp paired-end reads) at the Genomics Resource Laboratory, University of Massachusetts, Amherst. For the first two samples (bulk embryonic stages and juveniles), total RNA was extracted using TRIzol (Ambion) following the standard protocol, quality was assessed using Tapestation (Agilent) and sent to the IGM UCSD Genomic Center for library preparation (TruSeq mRNA stranded library) and sequencing (Illumina NovaSeq 6000; 150bp paired-end reads).
Reference Transcriptome Construction
Berghia samples used for reference transcriptome construction included a subset of samples to minimize computational cost while maximizing read breadth. These included single samples from multiple adult tissues, selected at random, including the (1) brain (SRR12072211), (2) oral tentacle (SRR12072210), (3) rhinophore (SRR12072209), (4) foot (SRR12072206), (5) tail (SRR12072205), (6) proximal (SRR12072208) and (7) distal ceras (SRR12072207), as well as samples (8) embryos (SRR12072213), and (9) juveniles (SRR12072212). We merged all FASTQ output files for the above samples into two files (Read 1 and Read 2) for downstream analysis. We used default parameters for all programs unless otherwise specified. We trimmed and filtered reads using fastp (version 0.20.0; (Chen et al. 2018)), and assembled transcripts using Trinity (version 2.9.1; (Grabherr et al. 2011)). We predicted open reading frames (ORFs) with TransDecoder (version 5.5.0; (Haas et al. 2013)). Duplication levels were quite high (~56%), so we clustered predicted ORFs using CD-HIT-EST (version 4.8.1; (W. Li and Godzik 2006; Fu et al. 2012)) at 95% identity and word size of 11 (-c 0.95, -n 11). Post-clustering, we filtered transcripts with alien_index (https://github.com/josephryan/alien_index); based on an algorithm described in (Gladyshev, Meselson, and Arkhipova 2008)). We constructed alien index databases using previously constructed metazoan and non-metazoan databases (obtained from http://ryanlab.whitney.ufl.edu/downloads/alien_index) and all “Symbiodinium” sequences present on UniProt (UniProt Consortium 2008) as of 31 March 2020. We removed all sequences with an alien index greater than 45 from the transcriptome. We then compiled full transcripts for each predicted ORF sequence remaining from the assembled transcriptome using a custom Python script (full_transcripts.py). We then scanned the transcriptome for vectors and possible contaminants via the NCBI VecScreen (https://www.ncbi.nlm.nih.gov/tools/vecscreen/). We removed vectors using a small script (trim_adapters.pl) available through the Trinity Community Codebase (https://github.com/trinityrnaseq/trinity_community_codebase). We removed or trimmed sequences containing contamination using the Contaminants.txt file provided by NCBI and a custom script (remove_contamination.py). Custom scripts are available at https://github.com/lyons-lab/berghia_reference_transcriptome). We assessed transcriptome quality across all steps using BUSCO v5.1.2 (Seppey, Manni, and Zdobnov 2019; Simão et al. 2015; Manni et al. 2021) scores by comparing assembled transcripts to the metazoa_odb10 (C:98.1%[S:83.1%,D:15.0%],F:0.9%,M:1.0%,n:954) and mollusca_odb10 (C:93.4%[S:76.9%,D:16.5%],F:1.3%,M:5.3%,n:5295) databases.
Long read RNA sample collection and sequencing
We obtained Berghia adult tissue samples from animals starved for at least four weeks to minimize gut contaminants, including the (1) head (one animal), (2) oral tentacles (two animals), (3) rhinophores (three animals), (4) cerata (one animal), (5) mantle (one animal), and (6) homogenized mid-body tissue (one animal). We also collected two developmental samples, including (1) embryos from the trochophore (72 hours post oviposition; 300 animals) and eyed veliger stages (9-10 days post oviposition; 120 animals), and (2) post-metamorphic and post-feeding juveniles (34 animals). We extracted total RNA using the standard TRIzol Reagent (Life Technologies, Carlsbad, CA, USA) protocol, with some modifications: After the addition of chloroform, we centrifuged samples for 20 minutes at max speed (16,000 RCF) and precipitated samples in 100% isopropanol for ~1 hour at -20°C. We assessed total RNA sample quality on a 1% agarose gel and quantified the RNA in each sample with a Qubit 2.0 High Sensitivity kit (ThermoFisher Scientific, Waltham, MA). We then pooled developmental stages (embryos and juveniles; DEV), adult rhinophore and oral tentacle samples (RHOT), and adult mantle and cerata samples (MCE) in equivalent amounts. We sent these five total RNA samples (DEV, MCE, RHOT, head, mid-body) to the Roy J. Carver Biotechnology Center at the University of Illinois at Urbana-Champaign for IsoSeq library construction (5 libraries) and sequencing. They performed sequencing (on the five pooled libraries) on a single SMRT 8M cell with the PacBio Sequel II (PacBio, Menlo Park, CA) and a 30h movie. They then clustered the raw subreads using the IsoSeq v3 clustering workflow (https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-clustering.md).
Genome assembly and scaffolding
Dovetail used 167 gigabase-pairs of PacBio CLR reads as an input to WTDBG2 v2.5 (Ruan and Li 2020) with genome size 2.0g, minimum read length 20000, and minimum alignment length 8192. Additionally, they enabled realignment with the -R option and set read type with the option -x sq. They then used BLASTn results of the WTDBG2 output assembly against the nt database as input for blobtools v1.1.1, and scaffolds identified as possible contamination were removed from the assembly. Finally, they used purge_dups v1.2.3 (Guan et al. 2020) to remove haplotigs and contig overlaps.
Dovetail used input de novo assembly and Dovetail Omni-C library reads (3 samples; SRR25687005-SRR25687007) as input data for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies (Putnam et al. 2016). They aligned Dovetail Omni-C library sequences to the draft input assembly using BWA with default parameters (H. Li and Durbin 2010). They then analyzed separations of Dovetail Omni-C read pairs mapped within draft scaffolds by Hi-Rise to produce a likelihood model for genomic distance between read pairs, and used the model to identify and break putative misjoins, to score prospective joins, and make joins above a threshold.
We initially filtered the Berghia stephanieae genome with purge_dups v1.2.5 (Guan et al. 2020) to automatically identify and remove haplotigs and contig/scaffold overlaps from heterozygous sites. This full assembly has been accessioned at NCBI (ACCESSION NO). Following duplicate purging, we assessed completeness with BUSCO v5.1.2 (Simão et al. 2015; Seppey, Manni, and Zdobnov 2019; Manni et al. 2021) by comparing to metazoa_odb10 and mollusca_odb10. BUSCO further used the programs HMMER v3.1 (Mistry et al. 2013) and MetaEuk v4.a0f584d (Levy Karin, Mirdita, and Söding 2020) for gene prediction and analysis. We then used Nucleotide-Nucleotide BLAST 2.11.0+ (Camacho et al. 2009; Altschul et al. 1990) to compare our scaffolds to the nt database (downloaded April 2021), and mapped the original PacBio reads used for assembly via minimap2 v2.18-r1015 (H. Li 2018). With these results, we used BlobToolKit (Challis et al. 2020)(blobtools2 filter option) to remove additional scaffolds considered contamination. Scaffold selection for removal was based on GC content, PacBio read coverage results, BLASTn hits (we removed no-hit and bacterial contamination), and finally a minimum size threshold (150 kb). This size threshold was selected because it was the point at which the removal of sequences would not change the BUSCO score, as determined by the use of BlobToolKit Viewer v1.1 (Challis et al. 2020). Most removed sequences contained differences in GC content and coverage compared to those that were retained in the final annotated genome, in addition to their smaller size. We also performed a Nucleotide-Nucleotide BLAST 2.11.0+ to compare the removed sequences with the final genome to assess duplication rates. Of those sequences removed from the final genome, 92.8% hit to one of the final 18 scaffolds (98.7% of which with an e-value of 0.0), and 1.6% of removed sequences were obvious contaminants. To compare the Berghia genome with other available Mollusca genomes, we downloaded assembled genomes from NCBI using the datasets command line function (v.15.24.0)(Sayers et al. 2022) with flags for the taxon Mollusca and genomes at assembly levels “scaffold”, “chromosome”, or “complete”. We then assessed completeness of each genome with BUSCO v5.1.2 compared to the database metazoa_odb10, and assessed scaffold N50 using the command line tool n50 in SeqFu, a suite of FASTX utilities (Telatin, Fariselli, and Birolo 2021).
Gene prediction and annotation
We analyzed filtered genome scaffolds with RepeatModeler v2.0.1 (Flynn et al. 2020), which used Tandem Repeat Finder (TRF) v4.09 (Benson 1999), RECON v1.08 (Bao and Eddy 2002), RepeatScout v1.0.6 (Price, Jones, and Pevzner 2005), and RepeatMasker v4.1.2 (https://www.repeatmasker.org), to construct a de novo repeat library for Berghia stephanieae. This included the ---LTRStruct flag to run an LTR Structural Analysis, which used GenomeTools v1.6.1 (http://genometools.org), LTR_Retriever v2.9.0 (Ou and Jiang 2018), Ninja v1.10.2 (https://github.com/ninja-build/ninja), MAFFT v7.480 (Katoh and Standley 2013), and CD-HIT v4.8.1 (W. Li and Godzik 2006; Fu et al. 2012). We then used this species-specific library to detect repeat sequences (via both soft and hardmasking) with RepeatMasker in the Berghia genome.
Following repeat masking, we mapped all short RNA-seq reads (unfiltered) to the hardmasked version of the genome using two separate read mapping software programs, HiSat2 v2.2.1 (Kim et al. 2019) and STAR v2.7.9a (Dobin et al. 2013). This was intended to account for mapping bias in order to maximize the possibility for support in our gene annotations. For long read (IsoSeq) mapping, we first obtained Full Length Non-Concatemer (FLNC) reads from step three of the IsoSeq v3 workflow. These FLNC reads were mapped directly to the non-repeatmasked genome using minimap2 v2.22-r1105-dirty (H. Li 2018) with the recommended options according to PacBio (https://github.com/Magdoll/cDNA_Cupcake/wiki/Best-practice-for-aligning-Iso-Seq-to-reference-genome:-minimap2,-deSALT,-GMAP,-STAR,-BLAT). Post-mapping, sam output files were reformatted into bam files and indexed using samtools v1.11(H. Li et al. 2009). We used BRAKER v2.1.6 (Hoff et al. 2016; Brůna et al. 2021; Hoff et al. 2019) for preliminary gene prediction of the Berghia genome, which uses GeneMark-EP+ v4 (Brůna, Lomsadze, and Borodovsky 2020; Lomsadze et al. 2005), DIAMOND v2.0.8 (Buchfink, Xie, and Huson 2015), spaln v2.4.3 (Gotoh 2008; Iwata and Gotoh 2012), and Augustus v3.4.0 (Stanke et al. 2008). We used long and short read RNA-seq mapping results used as expression support input. We also used a protein hints file generated by combining the mollusca_odb10 database with B.stephanieae sequences identified as BUSCO hits from our initial mollusca_odb10 BUSCO run. We ran BRAKER with the additional flags --etpmode, --gff3 and --softmasking. After initial gene prediction, we generated a filtered predicted gene set using a script included with the BRAKER installation (selectSupportedSubsets.py) and the --anySupport flag to only include genes at least partially supported by hints. IsoSeq and short read RNA-sequencing data were mapped to both sets of gene models to assess the impact of filtering. Both unfiltered and filtered gene prediction results are provided in Dryad (DOI: 10.6076/D1BS33), but we only used the filtered set (braker_annotations_anysupport.gff3) in subsequent functional annotation and clade-specific gene analyses.
For functional annotation of predicted genes, we used Protein-Protein BLAST 2.11.0+ (BLASTP) and InterProScan v.5.52-86.0. For BLASTP analyses, we used an e-value cutoff of 1e-3 with -max_target_seqs of one against three databases: (1) UniProtKB/Swiss-Prot, (2) RefSeq, and (3) Trembl (all downloaded April 2021). We then combined the hits to all three databases in a single blast annotation file. For InterProScan analyses, we used the default parameters with some additional flags, including: -goterms to look up gene ontology, -dp to disable pre-calculated match lookup, and -t p to indicate protein sequences.
Assessment of clade-specific genes and their expression
To determine the distribution of clade-specific genes for Berghia, we created a proteome dataset containing 47 metazoan species using both published genomes and transcriptomes (Additional file 4: Table S3). Our final dataset included 36 molluscs, including fourteen nudibranchs (five from Anthobranchia and nine from Cladobranchia). We downloaded predicted proteomes from genome datasets from MolluscDB (Liu et al. 2021). We downloaded the transcriptomes from the NCBI Sequence Read Archive (SRA), which we then filtered with fastp v0.20.0 (Chen et al. 2018) and assembled with Trinity v2.9.1; (Grabherr et al. 2011). We predicted ORFs using Transdecoder v5.5.0 (https://github.com/TransDecoder/TransDecoder), with default parameters. Our final proteomes ranged from 17,606 to 72,541 proteins (x̅ = 33,691; Additional file 4: Table S3). We identified orthologous gene families among our metazoan proteomes using the OrthoFinder package (Emms and Kelly, 2019) with default parameters and a user generated species tree as input (Additional File 11). Our user generated species tree topology was based on recent metazoan phylogenies (Laumer et al. 2019; Hejnol et al. 2009; Kocot et al. 2017; Marlétaz et al. 2019), the MolluscDB phylogeny provided on the website (Liu et al. 2021), and recent Mollusca (Sigwart and Lindberg 2015) and nudibranch (Karmeinski et al. 2021; Goodheart et al. 2017, 2015) phylogenetic analyses. We then analyzed orthologous groups using the program KinFin v1.0.3 (Laetsch and Blaxter 2017) to determine which predicted genes in Berghia stephanieae are clade-specific (meaning that they only cluster with sequences from a particular clade). We used the default parameters, with additional flags (--infer_singletons --plot_tree -r phylum,class,order,superfamily).
To determine the expression patterns of clade-specific genes, we mapped our short read RNAseq data (from the brain, oral tentacles, rhinophores, foot, tail, and proximal and distal ceras) to the Berghia genome using STAR v2.7.9a (Dobin et al. 2013) with default parameters plus additional flags (--readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --twopassMode Basic --sjdbGTFfeatureExon 'CDS'). We counted reads using the command htseq-count from the HTSeq framework v1.99.2 (Putri et al. 2021), which is a Python package for analysis of high-throughput sequencing data. We analyzed counts using the DESeq function from DESeq2 v1.26.0 (Love, Huber, and Anders 2014) to perform differential analysis, and generated results using the results function with contrasts comparing each focal tissue with an average of all other tissues. We considered genes upregulated if the adjusted p-value (padj) was greater than 0.05 and log2FoldChange was greater than 2.
References
Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. 1990. “Basic Local Alignment Search Tool.” Journal of Molecular Biology. https://doi.org/10.1016/s0022-2836(05)80360-2.
Bao, Zhirong, and Sean R. Eddy. 2002. “Automated de Novo Identification of Repeat Sequence Families in Sequenced Genomes.” Genome Research 12 (8): 1269–76.
Benson, G. 1999. “Tandem Repeats Finder: A Program to Analyze DNA Sequences.” Nucleic Acids Research. https://doi.org/10.1093/nar/27.2.573.
Brůna, Tomáš, Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke, and Mark Borodovsky. 2021. “BRAKER2: Automatic Eukaryotic Genome Annotation with GeneMark-EP+ and AUGUSTUS Supported by a Protein Database.” NAR Genomics and Bioinformatics 3 (1): lqaa108.
Brůna, Tomáš, Alexandre Lomsadze, and Mark Borodovsky. 2020. “GeneMark-EP+: Eukaryotic Gene Prediction with Self-Training in the Space of Genes and Proteins.” NAR Genomics and Bioinformatics 2 (2): lqaa026.
Buchfink, Benjamin, Chao Xie, and Daniel H. Huson. 2015. “Fast and Sensitive Protein Alignment Using DIAMOND.” Nature Methods 12 (1): 59–60.
Camacho, Christiam, George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, and Thomas L. Madden. 2009. “BLAST : Architecture and Applications.” BMC Bioinformatics. https://doi.org/10.1186/1471-2105-10-421.
Challis, Richard, Edward Richards, Jeena Rajan, Guy Cochrane, and Mark Blaxter. 2020. “BlobToolKit - Interactive Quality Assessment of Genome Assemblies.” G3 10 (4): 1361–74.
Chen, Shifu, Yanqing Zhou, Yaru Chen, and Jia Gu. 2018. “Fastp: An Ultra-Fast All-in-One FASTQ Preprocessor.” Bioinformatics 34 (17): i884–90.
Dobin, Alexander, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R. Gingeras. 2013. “STAR: Ultrafast Universal RNA-Seq Aligner.” Bioinformatics 29 (1): 15–21.
Flynn, Jullien M., Robert Hubley, Clément Goubert, Jeb Rosen, Andrew G. Clark, Cédric Feschotte, and Arian F. Smit. 2020. “RepeatModeler2 for Automated Genomic Discovery of Transposable Element Families.” Proceedings of the National Academy of Sciences of the United States of America 117 (17): 9451–57.
Fu, Limin, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. 2012. “CD-HIT: Accelerated for Clustering the next-Generation Sequencing Data.” Bioinformatics 28 (23): 3150–52.
Gladyshev, Eugene A., Matthew Meselson, and Irina R. Arkhipova. 2008. “Massive Horizontal Gene Transfer in Bdelloid Rotifers.” Science 320 (5880): 1210–13.
Goodheart, Jessica A., Adam L. Bazinet, Allen G. Collins, and Michael P. Cummings. 2015. “Relationships within Cladobranchia (Gastropoda: Nudibranchia) Based on RNA-Seq Data: An Initial Investigation.” Royal Society Open Science 2 (9): 150196.
Goodheart, Jessica A., Adam L. Bazinet, Ángel Valdés, Allen G. Collins, and Michael P. Cummings. 2017. “Prey Preference Follows Phylogeny: Evolutionary Dietary Patterns within the Marine Gastropod Group Cladobranchia (Gastropoda: Heterobranchia: Nudibranchia).” BMC Evolutionary Biology 17 (1): 221.
Gotoh, Osamu. 2008. “A Space-Efficient and Accurate Method for Mapping and Aligning cDNA Sequences onto Genomic Sequence.” Nucleic Acids Research 36 (8): 2630–38.
Grabherr, Manfred G., Brian J. Haas, Moran Yassour, Joshua Z. Levin, Dawn A. Thompson, Ido Amit, Xian Adiconis, et al. 2011. “Full-Length Transcriptome Assembly from RNA-Seq Data without a Reference Genome.” Nature Biotechnology 29 (7): 644–52.
Guan, Dengfeng, Shane A. McCarthy, Jonathan Wood, Kerstin Howe, Yadong Wang, and Richard Durbin. 2020. “Identifying and Removing Haplotypic Duplication in Primary Genome Assemblies.” Bioinformatics 36 (9): 2896–98.
Haas, Brian J., Alexie Papanicolaou, Moran Yassour, Manfred Grabherr, Philip D. Blood, Joshua Bowden, Matthew Brian Couger, et al. 2013. “De Novo Transcript Sequence Reconstruction from RNA-Seq Using the Trinity Platform for Reference Generation and Analysis.” Nature Protocols 8 (8): 1494–1512.
Hejnol, Andreas, Matthias Obst, Alexandros Stamatakis, Michael Ott, Greg W. Rouse, Gregory D. Edgecombe, Pedro Martinez, et al. 2009. “Assessing the Root of Bilaterian Animals with Scalable Phylogenomic Methods.” Proceedings. Biological Sciences / The Royal Society 276 (1677): 4261–70.
Hoff, Katharina J., Simone Lange, Alexandre Lomsadze, Mark Borodovsky, and Mario Stanke. 2016. “BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS.” Bioinformatics 32 (5): 767–69.
Hoff, Katharina J., Alexandre Lomsadze, Mark Borodovsky, and Mario Stanke. 2019. “Whole-Genome Annotation with BRAKER.” Methods in Molecular Biology 1962: 65–95.
Iwata, Hiroaki, and Osamu Gotoh. 2012. “Benchmarking Spliced Alignment Programs Including Spaln2, an Extended Version of Spaln That Incorporates Additional Species-Specific Features.” Nucleic Acids Research 40 (20): e161.
Karmeinski, Dario, Karen Meusemann, Jessica A. Goodheart, Michael Schroedl, Alexander Martynov, Tatiana Korshunova, Heike Wägele, and Alexander Donath. 2021. “Transcriptomics Provides a Robust Framework for the Relationships of the Major Clades of Cladobranch Sea Slugs (Mollusca, Gastropoda, Heterobranchia), but Fails to Resolve the Position of the Enigmatic Genus Embletonia.” BMC Ecology and Evolution 21 (1): 226.
Katoh, Kazutaka, and Daron M. Standley. 2013. “MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability.” Molecular Biology and Evolution 30 (4): 772–80.
Kim, Daehwan, Joseph M. Paggi, Chanhee Park, Christopher Bennett, and Steven L. Salzberg. 2019. “Graph-Based Genome Alignment and Genotyping with HISAT2 and HISAT-Genotype.” Nature Biotechnology. https://doi.org/10.1038/s41587-019-0201-4.
Kocot, Kevin M., Torsten H. Struck, Julia Merkel, Damien S. Waits, Christiane Todt, Pamela M. Brannock, David A. Weese, et al. 2017. “Phylogenomics of Lophotrochozoa with Consideration of Systematic Error.” Systematic Biology 66 (2): 256–82.
Laetsch, Dominik R., and Mark L. Blaxter. 2017. “KinFin: Software for Taxon-Aware Analysis of Clustered Protein Sequences.” G3 7 (10): 3349–57.
Laumer, Christopher E., Rosa Fernández, Sarah Lemer, David Combosch, Kevin M. Kocot, Ana Riesgo, Sónia C. S. Andrade, Wolfgang Sterrer, Martin V. Sørensen, and Gonzalo Giribet. 2019. “Revisiting Metazoan Phylogeny with Genomic Sampling of All Phyla.” Proceedings. Biological Sciences / The Royal Society 286 (1906): 20190831.
Levy Karin, Eli, Milot Mirdita, and Johannes Söding. 2020. “MetaEuk-Sensitive, High-Throughput Gene Discovery, and Annotation for Large-Scale Eukaryotic Metagenomics.” Microbiome 8 (1): 48.
Li, Heng. 2018. “Minimap2: Pairwise Alignment for Nucleotide Sequences.” Bioinformatics 34 (18): 3094–3100.
Li, Heng, and Richard Durbin. 2010. “Fast and Accurate Long-Read Alignment with Burrows-Wheeler Transform.” Bioinformatics 26 (5): 589–95.
Li, Heng, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, and 1000 Genome Project Data Processing Subgroup. 2009. “The Sequence Alignment/Map Format and SAMtools.” Bioinformatics 25 (16): 2078–79.
Liu, Fuyun, Yuli Li, Hongwei Yu, Lingling Zhang, Jingjie Hu, Zhenmin Bao, and Shi Wang. 2021. “MolluscDB: An Integrated Functional and Evolutionary Genomics Database for the Hyper-Diverse Animal Phylum Mollusca.” Nucleic Acids Research 49 (D1): D988–97.
Li, W., and A. Godzik. 2006. “Cd-Hit: A Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences.” Bioinformatics. https://doi.org/10.1093/bioinformatics/btl158.
Lomsadze, Alexandre, Vardges Ter-Hovhannisyan, Yury O. Chernoff, and Mark Borodovsky. 2005. “Gene Identification in Novel Eukaryotic Genomes by Self-Training Algorithm.” Nucleic Acids Research 33 (20): 6494–6506.
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15 (12): 550.
Manni, Mosè, Matthew R. Berkeley, Mathieu Seppey, Felipe A. Simão, and Evgeny M. Zdobnov. 2021. “BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes.” Molecular Biology and Evolution 38 (10): 4647–54.
Marlétaz, Ferdinand, Katja T. C. A. Peijnenburg, Taichiro Goto, Noriyuki Satoh, and Daniel S. Rokhsar. 2019. “A New Spiralian Phylogeny Places the Enigmatic Arrow Worms among Gnathiferans.” Current Biology: CB 29 (2): 312–18.e3.
Mistry, Jaina, Robert D. Finn, Sean R. Eddy, Alex Bateman, and Marco Punta. 2013. “Challenges in Homology Search: HMMER3 and Convergent Evolution of Coiled-Coil Regions.” Nucleic Acids Research. https://doi.org/10.1093/nar/gkt263.
Ou, Shujun, and Ning Jiang. 2018. “LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons.” Plant Physiology 176 (2): 1410–22.
Price, Alkes L., Neil C. Jones, and Pavel A. Pevzner. 2005. “De Novo Identification of Repeat Families in Large Genomes.” Bioinformatics 21 Suppl 1 (June): i351–58.
Putnam, Nicholas H., Brendan L. O’Connell, Jonathan C. Stites, Brandon J. Rice, Marco Blanchette, Robert Calef, Christopher J. Troll, et al. 2016. “Chromosome-Scale Shotgun Assembly Using an in Vitro Method for Long-Range Linkage.” Genome Research. https://doi.org/10.1101/gr.193474.115.
Putri, Givanna H., Simon Anders, Paul Theodor Pyl, John E. Pimanda, and Fabio Zanini. 2021. “Analysing High-Throughput Sequencing Data in Python with HTSeq 2.0.” arXiv [q-bio.GN]. arXiv. http://arxiv.org/abs/2112.00939.
Ruan, Jue, and Heng Li. 2020. “Fast and Accurate Long-Read Assembly with wtdbg2.” Nature Methods 17 (2): 155–58.
Sayers, Eric W., Evan E. Bolton, J. Rodney Brister, Kathi Canese, Jessica Chan, Donald C. Comeau, Ryan Connor, et al. 2022. “Database Resources of the National Center for Biotechnology Information.” Nucleic Acids Research 50 (D1): D20–26.
Seppey, Mathieu, Mosè Manni, and Evgeny M. Zdobnov. 2019. “BUSCO: Assessing Genome Assembly and Annotation Completeness.” Methods in Molecular Biology 1962: 227–45.
Sigwart, Julia D., and David R. Lindberg. 2015. “Consensus and Confusion in Molluscan Trees: Evaluating Morphological and Molecular Phylogenies.” Systematic Biology 64 (3): 384–95.
Simão, Felipe A., Robert M. Waterhouse, Panagiotis Ioannidis, Evgenia V. Kriventseva, and Evgeny M. Zdobnov. 2015. “BUSCO: Assessing Genome Assembly and Annotation Completeness with Single-Copy Orthologs.” Bioinformatics 31 (19): 3210–12.
Stanke, Mario, Mark Diekhans, Robert Baertsch, and David Haussler. 2008. “Using Native and Syntenically Mapped cDNA Alignments to Improve de Novo Gene Finding.” Bioinformatics 24 (5): 637–44.
Telatin, Andrea, Piero Fariselli, and Giovanni Birolo. 2021. “SeqFu: A Suite of Utilities for the Robust and Reproducible Manipulation of Sequence Files.” Bioengineering (Basel, Switzerland) 8 (5). https://doi.org/10.3390/bioengineering8050059.
UniProt Consortium. 2008. “The Universal Protein Resource (UniProt).” Nucleic Acids Research 36 (Database issue): D190–95.
Methods
Sample preparation and genome sequencing
We isolated one Berghia juvenile from the Lyons lab culture prior to mating to minimize genomic contamination. While isolated, we fed the animal ~½ of a medium Exaiptasia diaphana (defined by Taraporevala et al. 2022) each day for 34 days. We then starved the animal for 44 days prior to shipping. To minimize residual food in the gut diverticula, cerata were removed with forceps and the remaining body was blotted on a kimwipe to remove excess water, then the animal was placed in a cryotube and flash frozen in liquid nitrogen and stored at -80 until shipping to Dovetail Genomics (now Catana Bio, Scotts Valley, CA). Technicians at Dovetail Genomics used an input of ~101mg into a slow CTAB protocol, which yielded 12.1µg using the Qubit 2.0 Fluorometer (Life Technologies, Carlsbad, CA, USA) High Sensitivity Kit. They then used a Mini Column for clean up and resuspended the pellet in 75 µl TE. They then quantified DNA samples using the Qubit. They constructed the PacBio SMRTbell library (~20kb) for PacBio Sequel using SMRTbell Express Template Prep Kit V 2.0 (PacBio, Menlo Park, CA, USA) using the manufacturer recommended protocol. They then bound the library to polymerase using the Sequel II Binding Kit 2.0 (PacBio) and loaded onto PacBio Sequel II (PacBio) on 8M SMRT cells with approximately 160x sequence coverage (SRA Accession TBD).
Short read RNA sample collection and sequencing
We obtained Berghia adult tissue samples, including the: (1) brain, (2 samples; SRA Accessions TBD) (2) oral tentacles (3 samples; SRR12072210, SRA Accessions TBD), (3) rhinophores (3 samples; SRR12072209, SRA Accessions TBD), (4) foot (2 samples; SRR12072206, SRA Accessions TBD), (5) tail (2 samples; SRR12072205, SRA Accessions TBD), and (6) proximal (3 samples; SRR12072208, SRA Accessions TBD) and (7) distal ceras (3 samples; SRR12072207, SRA Accessions TBD). We also obtained earlier stage transcriptome data from: (8) multiple embryonic stages (bulk sample of 500-600 individual embryos from each time point reared at 27°C (12, 24, 36, 48, 60 hpo and 4, 7, and 9 dpo; SRR12072213) and (9) juveniles 15 dpo at 27°C (500 individuals from 3 egg masses laid the same day; SRR12072212). We starved adults for ~1 week prior to removal of some tissues (all but the brain) to reduce symbiont presence and minimize contamination. We extracted total RNA from most adult tissues (minus the brain) using the RNeasy Kit (QIAGEN, Redwood City, CA) and submitted the extracted total RNA to Novogene Ltd. (Sacramento, CA) for quality assessment, library preparation and sequencing (Illumina NovaSeq 6000; 150bp paired-end reads). We prepared the adult brain total RNA using the Clontech SmartSeq v4 Ultra-Low Input RNA Kit (Takara). We prepared libraries with the Nextera XT DNA Library Preparation Kit and 96-Sample Index Kit (Illumina, San Diego, CA) and quantified them using Qubit (ThermoFisher Scientific, Waltham, MA) and assessed quality using a Bioanalyzer (Agilent, Santa Clara, CA). We sequenced the brain sample on the Illumina NextSeq 500 (75bp paired-end reads) at the Genomics Resource Laboratory, University of Massachusetts, Amherst. For the first two samples (bulk embryonic stages and juveniles), total RNA was extracted using TRIzol (Ambion) following the standard protocol, quality was assessed using Tapestation (Agilent) and sent to the IGM UCSD Genomic Center for library preparation (TruSeq mRNA stranded library) and sequencing (Illumina NovaSeq 6000; 150bp paired-end reads).
Reference Transcriptome Construction
Berghia samples used for reference transcriptome construction included samples: single samples from multiple adult tissues, selected at random, including the (1) brain (SRR12072211), (2) oral tentacle (SRR12072210), (3) rhinophore (SRR12072209), (4) foot (SRR12072206), (5) tail (SRR12072205), (6) proximal (SRR12072208) and (7) distal ceras (SRR12072207), as well as samples (8) embryos (SRR12072213), and (9) juveniles (SRR12072212). We merged all FASTQ output files for a subset of short read RNA-seq samples into two files (Read 1 and Read 2) for downstream analysis. We used default parameters for all programs unless otherwise specified. We trimmed and filtered reads using fastp (version 0.20.0; [1]), and assembled transcripts using Trinity (version 2.9.1; [2]). We predicted open reading frames (ORFs) with TransDecoder (version 5.5.0; [3]). Duplication levels were quite high (~56%), so we clustered predicted ORFs using CD-HIT-EST (version 4.8.1; [4, 5]) at 95% identity and word size of 11 (-c 0.95, -n 11). Post-clustering, we filtered transcripts with alien_index (https://github.com/josephryan/alien_index); based on an algorithm described in [6]). We constructed alien index databases using previously constructed metazoan and non-metazoan databases (obtained from http://ryanlab.whitney.ufl.edu/downloads/alien_index) and all “Symbiodinium” sequences present on UniProt [7] as of 31 March 2020. We removed all sequences with an alien index greater than 45 from the transcriptome. We then compiled full transcripts for each predicted ORF sequence remaining from the assembled transcriptome using a custom Python script (full_transcripts.py). We then scanned the transcriptome for vectors and possible contaminants via the NCBI VecScreen (https://www.ncbi.nlm.nih.gov/tools/vecscreen/). We removed vectors using a small script (trim_adapters.pl) available through the Trinity Community Codebase (https://github.com/trinityrnaseq/trinity_community_codebase). We removed or trimmed sequences containing contamination using the Contaminants.txt file provided by NCBI and a custom script (remove_contamination.py). Custom scripts are available at https://github.com/lyons-lab/berghia_reference_transcriptome). We assessed transcriptome quality across all steps using BUSCO (version 4.0.5; [8, 9]) scores by comparing assembled transcripts to the metazoa_odb10 and mollusca_odb10 databases.
Long read RNA sample collection and sequencing
We obtained Berghia adult tissue samples from animals starved for at least four weeks to minimize gut contaminants, including the (1) head (one animal), (2) oral tentacles (two animals), (3) rhinophores (three animals), (4) cerata (one animal), (5) mantle (one animal), and (6) homogenized mid-body tissue (one animal). We also collected two developmental samples, including (1) embryos from the trochophore (72 hours post oviposition; 300 animals) and eyed veliger stages (9-10 days post oviposition; 120 animals), and (2) post-metamorphic and post-feeding juveniles (34 animals). We extracted total RNA using the standard TRIzol Reagent (Life Technologies, Carlsbad, CA, USA) protocol, with some modifications: After the addition of chloroform, we centrifuged samples for 20 minutes at max speed (16,000 RCF) and precipitated samples in 100% isopropanol for ~1 hour at -20°C. We assessed total RNA sample quality on a 1% agarose gel and quantified the RNA in each sample with a Qubit 2.0 High Sensitivity kit (ThermoFisher Scientific, Waltham, MA). We then pooled developmental stages (embryos and juveniles; DEV), adult rhinophore and oral tentacle samples (RHOT), and adult mantle and cerata samples (MCE) in equivalent amounts. We sent these five total RNA samples (DEV, MCE, RHOT, head, mid-body) to the Roy J. Carver Biotechnology Center at the University of Illinois at Urbana-Champaign for IsoSeq library construction (5 libraries) and sequencing. They performed sequencing (on the five pooled libraries) on a single SMRT 8M cell with the PacBio Sequel II (PacBio, Menlo Park, CA) and a 30h movie. They then clustered the raw subreads using the IsoSeq v3 clustering workflow (https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-clustering.md).
Genome assembly and scaffolding
Bioinformaticians at Dovetail used 167 gigabase-pairs of PacBio CLR reads as an input to WTDBG2 v2.5 [10] with genome size 2.0g, minimum read length 20000, and minimum alignment length 8192. Additionally, they enabled realignment with the -R option and set read type with the option -x sq. They then used BLASTn results of the WTDBG2 output assembly (asm.cns.fa) against the nt database as input for blobtools v1.1.1, and scaffolds identified as possible contamination were removed from the assembly (filtered.asm.cns.fa). Finally, they used purge_dups v1.2.3 [11] to remove haplotigs and contig overlaps (purged.fa).
For scaffolding, Dovetail fixed chromatin in place with formaldehyde in the nucleus for extraction and analysis via Dovetail® Omni-C® proximity ligation. They then digested the fixed chromatin with DNAse I, repaired the chromatin ends and ligated to a biotinylated bridge adapter followed by proximity ligation of adapter containing ends. After proximity ligation, they reversed crosslinks and purified the DNA. They treated purified DNA to remove biotin that was not internal to ligated fragments. They generated sequencing libraries using NEBNext Ultra enzymes and Illumina-compatible adapters. They then isolated biotin-containing fragments using streptavidin beads before PCR enrichment of each library. Technicians then sequenced the library on an Illumina HiSeqX platform to produce approximately 30x sequence coverage. They then used HiRise MQ>50 reads for scaffolding (see "read-pair" above for figures).
Bioinformaticians at Dovetail used input de novo assembly and Dovetail Omni-C library reads as input data for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies [12]. They then aligned Dovetail Omni-C library sequences to the draft input assembly using BWA with default parameters [13, 14]. They then analyzed separations of Dovetail Omni-C read pairs mapped within draft scaffolds by Hi-Rise to produce a likelihood model for genomic distance between read pairs, and used the model to identify and break putative misjoins, to score prospective joins, and make joins above a threshold.
We initially filtered the Berghia stephanieae genome with purge_dups v1.2.5 [11] to automatically identify and remove haplotigs and contig/scaffold overlaps from heterozygous sites. Following duplicate purging, we assessed completeness with BUSCO v5.1.2 [9] by comparing to metazoa_odb10 and mollusca_odb10. BUSCO further used the programs HMMER v3.1 [15] and MetaEuk v4.a0f584d [16] for gene prediction and analysis. We then used Nucleotide-Nucleotide BLAST 2.11.0+ [17, 18] to compare our scaffolds to the nt database (downloaded April 2021), and mapped the original PacBio reads used for assembly via minimap2 v2.18-r1015 [19]. With these results, we used BlobToolKit (Challis et al. 2020)(blobtools2 filter option) to remove additional scaffolds considered contamination. Scaffold selection for removal was based on BLASTn hits (we removed no-hit and bacterial contamination) and a minimum size threshold (150 kb). This size threshold was selected because it was the point at which the removal of sequences would not change the BUSCO score, as determined by the use of BlobToolKit Viewer v1.1 [20]. The removed sequences contained differences in GC content and coverage among those that were retained in the final annotated genome. We also performed a Nucleotide-Nucleotide BLAST 2.11.0+ to compare the removed sequences with the final genome to assess duplication rates. Of those sequences removed from the final genome, 92.8% hit to one of the final 18 scaffolds (98.7% of which with an e-value of 0.0), and 1.6% of removed sequences were obvious contaminants.
Gene prediction and annotation
We analyzed filtered genome scaffolds with RepeatModeler v2.0.1 [21], which used Tandem Repeat Finder (TRF) v4.09 [22], RECON v1.08 [23], RepeatScout v1.0.6 [24], and RepeatMasker v4.1.2 (https://www.repeatmasker.org), to construct a de novo repeat library for Berghia stephanieae. This included the ---LTRStruct flag to run an LTR Structural Analysis, which used GenomeTools v1.6.1 (http://genometools.org), LTR_Retriever v2.9.0 [25], Ninja v1.10.2 (https://github.com/ninja-build/ninja), MAFFT v7.480 [26], and CD-HIT v4.8.1 [4, 5]. We then used this species-specific library to detect repeat sequences (via both soft and hardmasking) with RepeatMasker in the Berghia genome.
Following repeat masking, we mapped all short RNA-seq reads (unfiltered) to the hardmasked version of the genome using two separate read mapping software programs, HiSat2 v2.2.1 [27] and STAR v2.7.9a [28]. This was intended to account for mapping bias in order to maximize the possibility for support in our gene annotations. For long read (IsoSeq) mapping, we first obtained Full Length Non-Concatemer (FLNC) reads from step three of the IsoSeq v3 workflow. These FLNC reads were mapped directly to the non-repeatmasked genome using minimap2 v2.22-r1105-dirty [19] with the recommended options according to PacBio (https://github.com/Magdoll/cDNA_Cupcake/wiki/Best-practice-for-aligning-Iso-Seq-to-reference-genome:-minimap2,-deSALT,-GMAP,-STAR,-BLAT). Post-mapping, sam output files were reformatted into bam files and indexed using samtools v1.11[29]. We used BRAKER v2.1.6 [30–32] for preliminary gene prediction of the Berghia genome, which uses GeneMark-EP+ v4 [33, 34], DIAMOND v2.0.8 [35], spaln v2.4.3 [36, 37], and Augustus v3.4.0 [38]. We used long and short read RNA-seq mapping results used as expression support input. We also used a protein hints file generated by combining the mollusca_odb10 database with B.stephanieae sequences identified as BUSCO hits from our initial mollusca_odb10 BUSCO run. We ran BRAKER with the additional flags --etpmode, --gff3 and --softmasking. After initial gene prediction, we generated a filtered predicted gene set using a script included with the BRAKER installation (selectSupportedSubsets.py) and the --anySupport flag to only include genes at least partially supported by hints. Both unfiltered and filtered gene prediction results are provided in Dryad (Dryad DOI TBD), but we only used the filtered set (braker_annotations_anysupport.gff3) in subsequent functional annotation and clade-specific gene analyses.
For functional annotation of predicted genes, we used Protein-Protein BLAST 2.11.0+ (BLASTP) and InterProScan v.5.52-86.0. For BLASTP analyses, we used an e-value cutoff of 1e-3 with -max_target_seqs of one against three databases: (1) UniProtKB/Swiss-Prot, (2) RefSeq, and (3) Trembl (all downloaded April 2021). We then combined the hits to all three databases in a single blast annotation file. For InterProScan analyses, we used the default parameters with some additional flags, including: -goterms to look up gene ontology, -dp to disable pre-calculated match lookup, and -t p to indicate protein sequences. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) Expanse at the San Diego Supercomputer Center (SDSC) through allocation IDs TG-BIO210019 and TG-BIO210138 [39].
Assessment of clade-specific genes and their expression
To determine the distribution of clade-specific genes for Berghia, we created a proteome dataset containing 47 metazoan species using both published genomes and transcriptomes (Additional file 4: Table S3). Our final dataset included 36 molluscs, including fourteen nudibranchs (five from Anthobranchia and nine from Cladobranchia). We downloaded predicted proteomes from genome datasets from MolluscDB [40]. We downloaded the transcriptomes from the NCBI Sequence Read Archive (SRA), which we then filtered with fastp v0.20.0 [1] and assembled with Trinity v2.9.1; [2]. We predicted ORFs using Transdecoder v5.5.0 (https://github.com/TransDecoder/TransDecoder), with default parameters. Our final proteomes ranged from 17,606 to 72,541 proteins (x̅ = 33,691; Additional file 4: Table S3). We identified orthologous gene families among our metazoan proteomes using the OrthoFinder package (Emms and Kelly, 2019) with default parameters and a user generated species tree as input (Additional File 11). Our user generated species tree topology was based on recent metazoan phylogenies [41–44], the MolluscDB phylogeny provided on the website [40], and recent Mollusca [45] and nudibranch [46–48] phylogenetic analyses. We then analyzed orthologous groups using the program KinFin v1.0.3 [49] to determine which predicted genes in Berghia stephanieae are clade-specific (meaning that they only cluster with sequences from a particular clade). We used the default parameters, with additional flags (--infer_singletons --plot_tree -r phylum,class,order,superfamily).
To determine the expression patterns of clade-specific genes, we mapped our short read RNAseq data (from the brain, oral tentacles, rhinophores, foot, tail, and proximal and distal ceras) to the Berghia genome using STAR v2.7.9a [28] with default parameters plus additional flags (--readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --twopassMode Basic -sjdbGTFfeatureExon 'CDS'). We counted reads using the command htseq-count from the HTSeq framework v1.99.2 [50], which is a Python package for analysis of high-throughput sequencing data. We analyzed counts using the DESeq function from DESeq2 v1.26.0 [51] to perform differential analysis, and generated the results using the results function.