The iron-responsive genome of the chiton Acanthopleura granulata
Data files
Jan 12, 2021 version files 565.22 MB
-
GenomeAnalyses.zip
-
GenomeAnnotations.zip
-
GenomeAssembly.zip
-
Links_to_Raw_Data.zip
-
OtherProtocols.zip
-
README.docx
Abstract
Molluscs biomineralize structures that vary in composition, form, and function, prompting questions about the genetic mechanisms responsible for their production and the evolution of these mechanisms. Chitons (Mollusca, Polyplacophora) are a promising system for studies of biomineralization because they build a range of calcified structures including shell plates and spine- or scale-like sclerites. Chitons also harden the calcified teeth of their rasp-like radula with a coat of iron (as magnetite). Here we present the genome of the West Indian fuzzy chiton Acanthopleura granulata, the first from any aculiferan mollusc. The A. granulata genome contains homologs of many genes associated with biomineralization in conchiferan molluscs. We expected chitons to lack genes previously identified from pathways conchiferans use to make biominerals like calcite and nacre because chitons do not use these materials in their shells. Surprisingly, the A. granulata genome has homologs of many of these genes, suggesting that the ancestral mollusc may have had a more diverse biomineralization toolkit than expected. The A. granulata genome has features that may be specialized for iron biomineralization, including a higher proportion of genes regulated directly by iron than other molluscs. A. granulata also produces two isoforms of soma-like ferritin: one is regulated by iron and similar in sequence to the soma-like ferritins of other molluscs, and the other is constitutively translated and is not found in other molluscs. The A. granulata genome is a resource for future studies of molluscan evolution and biomineralization.
Methods
Specimen collection and preservation
We collected a single male specimen of Acanthopleura granulata from Harry Harris State Park in the Florida Keys (Special Activity License #SAL-17-1983-SR). We cut the majority of the foot into ~1 mm2 cubes and froze them at -80°C. We froze additional pieces of foot, girdle (dissected such that the tissue sample would not contain shell secretory tissue), ctenidia, gonad, and radula in RNAlater and stored them at -80°C as well.
Genome and transcriptome sequencing
We extracted high molecular weight DNA from frozen samples of foot tissue from A. granulata using a CTAB-phenol chloroform method (available on Dryad). We cleaned DNA for short read generation with the Zymo Clean and Concentrator Kit. For library preparation and sequencing, we sent cleaned DNA to the Genomics Services Lab at HudsonAlpha (Huntsville, AL), where it was sheared with a Covaris M220 to an average fragment size of 350 bp. These fragments were used to prepare an Illumina TruSeq DNA PCR-Free library, which was sequenced using one lane of an Illumina HiSeq X (2 X 150 bp paired-end reads).
For long-read sequencing, we cleaned DNA and enriched it for higher-molecular weight fragments by performing two sequential purifications using 0.4X AmPureXP magnetic beads. We generated long reads with four flow cells on an Oxford Nanopore Technologies GridION. We prepared two sequencing libraries with ligation kit LSK-108 and sequenced them on FloMin106 (R9.4.1) flow cells. We prepared the other two sequencing libraries with the updated ligation kit LSK-109 and sequenced them on R9.4.1RevD flow cells. We generated 2.19Gb, 4.41Gb, 7.87 Gb, and 8.4 Gb respectively across the four flow cells, for a total of 22.87 Gb, or >20x coverage with long-reads. Reads were basecalled with Guppy 4.0. We trimmed long reads with PoreChop (Wick 2018), which was set to remove chimeras (approximately 0.0005% of reads) and all residual adapter sequences.
To generate transcriptomes, we used the Omega Bio-tek EZNA Mollusc RNA Kit to extract RNA from girdle, ctenidia, gonad, foot, and four regions of radula (representative of visibly different stages of iron mineralization) of the same individual of A. granulata we used for genome sequencing. We synthesized and amplified complementary DNA (cDNA) from each tissue using the SmartSeq v4 Ultra Low-input RNA kit (Clontech) from 1 ng of input RNA with 17 cycles of PCR. We created eight dual-indexed sequencing libraries with the Illumina Nextera XT kit, using 1 ng of input cDNA. We sent the eight libraries to Macrogen (Seoul, South Korea) where they were pooled and sequenced on one lane of an Illumina HiSeq 4000 (2 x 100 bp paired-end reads). Using a similar approach, we generated transcriptomes from several additional species of chitons and from a variety of tissues: Chiton marmoratus (decalcified valve), Chiton tuberculatus (radula), Acanthopleura gemmata (girdle and mantle), Cryptoplax larvaeformis (girdle), Callochiton sp. (whole animal), Chaetopleura apiculata (girdle), Hanleya hanleyi (mantle), Leptochiton asellus (mantle) Nutallochiton mirandus (girdle), Tonicia schrammi (decalcified valve), Katharina tunicata (radula), and Lepidozona mertensi (radula).
Genome and transcriptome assembly and quality assessment
We initially assembled the chiton genome with MaSuRCA v. 3.3.5 (Zimin et al. 2013), which consolidates paired-end data into super reads and then uses long-read data to scaffold and gap-fill. This produced an assembly with 2,858 contigs. We filtered and collapsed heterozygous contigs with Redundans v. 0.14a (Pryszcz & Gabaldón 07 08, 2016), decreasing the assembly to 1,285 contigs. To ensure that no contigs were incorrectly removed, we verified that all pre-Redundans contigs mapped to the post-Redundans assembly with bowtie2 (Langmead & Salzberg 2012); all contigs mapped and thus non-redundant data were not deleted. To help decontaminate reads and contigs, we used the Blobtools2 Interface to create blob plots (Blaxter & Challis 2018). Because Blobtools uses the NCBI nucleotide database to determine the identity of each scaffold, and chordate sequences vastly outnumber molluscan sequences in NCBI, Blobtools identified a large proportion of scaffolds as chordate. We identified contaminants as sequences that differed from the majority of scaffolds in both GC content and coverage and used BLAST to verify these sequences as bacterial before removing them from the assembly.
We scaffolded this reduced assembly with one lane of Bionano SAPHYR optical mapping, using two enzymes (BssSI and DLE1) and Bionano Solve v3.4’s scaffolding software, which resulted in 87 scaffolds. We ran REAPR v. 1.0.18 (Hunt et al. 2013), which map short read data and collect mapping statistics simultaneously, to determine accuracy of the assembly overall relative to all short-read data generated, and found despite reducing heterozygosity in the final assembly, 85.31% of paired-end reads map perfectly back to the genome assembly, indicating a complete genome assembly relative to the paired-end data.
To assess our genome assembly, we ran QUAST v. 5.0.2 (Gurevich et al. 2013). We assessed genome completeness with BUSCO v. 4.0.2 (Simão et al. 2015), using the proportions of nuclear protein-coding genes thought to be single-copy in the genomes of diverse metazoans (Metazoa odb9 dataset) and estimating the proportion of those that were complete, duplicated, fragmented, and absent.
We assembled the eight A. granulata transcriptomes and the transcriptomes of other chiton tissues listed above with Trinity v. 2.84 (Grabherr et al. 2011), using the --trimmomatic and --normalize reads flags. We ran CD-Hit v. 4.8.1 (Fu et al. 2012) on each transcriptome separately to cluster isoforms. We also generated a composite transcriptome of all of the A. granulata tissues (eight total transcriptomes including four separate radula regions) by combining reads and then following the same process described above. We used this composite transcriptome for annotation.
Genome annotation
To annotate the A. granulata genome, we first generated a custom repeat library with RepeatModeler v. 2.0 (Smit & Hubley 2008-2015), which was used in all subsequent analyses. We trained MAKER v. 2.31.10 (Cantarel et al. 2008-1) on the composite transcriptome described above as well as predicted protein sequences from the tissues of other species of chitons listed above. Using the highest quality gene models from the first as a maker-input gff3 (AED <0.5), we ran a second round of MAKER. From these resulting gene models, we used those with an AED <0.25 to train Augustus v3.0.3 (Stanke et al. 2006): we extracted gene models from the genomic scaffolds along with 1,000 bp of flanking sequence on either side to ensure complete genes, and ran them through BUSCO to produce an Augustus model (.hmm) file. Separately, we ran PASA 2.4.1 (Haas et al. 2003) on our composite transcriptome to maximize mapping transcripts to the genome assembly. We were unable to use Evidence Modeler (EVM; Haas et al. 2008) to combine lines of evidence because high levels of alternative splicing caused EVM to consistently reduce the number of ‘passing’ gene models to under 5000. We instead combined results from PASA and a trained Augustus run using the intersect tool in BEDtools v. 2.29.2 (Quinlan & Hall 2010), which removed identical sequences. This yielded a set of 81,691 gene models. When we ran a BUSCO v. 3.9 analysis (Metazoa odb9 dataset), we found a 15.2% duplication rate. To decrease duplications caused by transcripts predicted for the same locus by both Augustus and PASA that varied in length (and thus were not removed by the BEDtools intersect tool), we clustered the first set of gene models using cdhit-EST v. 4.8.1 (Fu et al. 2012), which we ran with the slow-but-accurate (-g) flag and with a cluster threshold value of 0.8. This produced a set of 20,470 genes. All commands we used are available in Appendix 1.
To identify annotated proteins in A. granulata, we first used Transdecoder (Douglas 2018) to produce peptide files of predicted proteins. We ran Interproscan on the set of 20,470 genes referenced above to identify GO terms and Pfam matches for proteins where possible. We used GHOSTX in the Kaas pipeline (Moriya et al. 2007) to identify KEGG pathways via comparisons to all the available molluscan taxa (Lottia gigantea, Pomacea canaliculata, Crassostrea gigas, Mizuhopecten yessoensis, and Octopus bimaculoides). Finally, we looked for shared GO terms between specific taxa with OrthoVenn(Xu et al. 2019), comparing A. granulata to Lottia gigantea, Chrysomallon squaminiferon, Octopus bimaculoides, and Crassostrea gigas.
Hox gene annotation and genomic comparisons
We located the Hox cluster of A. granulata by first creating a BLAST database of the A. granulata scaffolds and then querying this database with available chiton Hox sequences (Wanninger & Wollesen 2019). We marked A. granulata sequences with a BLAST hit at e-value 1e-8 as potential Hox sequences. We found one clear match for each previously identified chiton Hox gene, all in a single cluster within one scaffold. To verify the absence of Post1, we queried the A. granulata database with Post1 sequences from five other molluscs (Wanninger & Wollesen 2019). All matched with low support to the existing A. granulata Post2 sequence, so we concluded that Post1 is absent from the A. granulata genome assembly. We further verified the absence by querying all A. granulata transcriptomes for Post1 sequences and returned only hits that matched already-annotated Hox sequences (highest sequence similarity to Post 2).
To graphically examine synteny between A. granulata and other molluscan genome assemblies, we loaded each assembly and annotation into the online COGE SynMap2 (Haug-Baltzell et al. 2017) server and compared A. granulata to eight other annotated genomes with default SynMap2 settings. These default settings are strict; to note two regions as syntenic, five or more genes must share an order with fewer than 20 additional genes between them. We exported dotplots for each pair of genomes to visualize syntenic regions (or lack thereof), where dots then represent shared regions of five or more orthologous genes without major changes in surrounding gene content. Scaffolds in each dotplot were sorted by length, but differing assembly qualities made some dotplots difficult to read due to a high number of very small scaffolds. We assessed heterozygosity of several molluscan genomes and A. granulata by downloading raw paired-end data when possible and using GenomeScope2 online (Vurture et al. 2017). We began with a k-mer of 21 for all taxa; if this model failed to converge, or if the homozygous and heterozygous peaks collapsed in the model, we re-ran the analysis with k-mer values of 17, 19, and 31. One taxon (Lottia) did not produce separate peaks at any k-mer value; for all others, we selected the optimal k-mer size based on the best-fitting model (minimizing error percentage).
To permit direct comparisons of repeat content within A. granulata and other molluscs, we ran RepeatModeler v2.2(Smit & Hubley 2015) on the scaffolds of a subset of genome assemblies and A. granulata. We used the same default parameters for each run and quantified the number of elements in each repeat family identified by RepeatModeler for each genome assembly we analyzed (LINEs, SINEs, etc.).
Orthology inference
To identify orthologous genes shared between A. granulata and other molluscs, we used OrthoFinder v. 2.3.7 (Emms & Kelly 2015). We analyzed three separate sets of data: 1) A. granulata and genomes of nineteen other lophotrochozoans, including fourteen other molluscs, two annelids, one brachiopod, one phoronid, and one nemertean; 2) A. granulata, a subset of the above molluscan genomes, and Lingula anatina for detailed comparisons of biomineralization genes, and; 3) A. granulata and an expanded set of data including both genomes and transcriptomes, including several transcriptomes from aculiferans other than A. granulata. For all three analyses we used the unclustered 81,691 gene set for A. granulata, knowing that duplicated gene models would cluster together. We removed sequences from our orthogroups that were identical to longer sequences where they overlapped, as well as fragmented sequences shorter than 100 amino acids. We retained orthogroups that had a minimum of four taxa, aligned the sequences within them with MAFFT (Katoh et al. 2002), and cleaned mistranslated regions with HmmCleaner (Di Franco et al. 2019). We used AlignmentCompare (https://github.com/kmkocot/basal_metazoan_phylogenomics_scripts_01-2015) to delete sequences that did not overlap with all other sequences by at least 20 AAs (starting with the shortest sequence meeting this criterion).
Phylogenetic analyses
For species tree reconstruction, in cases where two or more sequences were present for any taxon in a single-gene alignment, we used PhyloPyPruner 0.9.5 (https://pypi.org/project/phylopypruner/) to reduce the alignment to a set of strict orthologs. This tool uses single-gene trees to screen putative orthogroups for paralogy. To build single-gene trees based on orthologs, we trimmed alignments with BMGE v1.12.2 (Criscuolo & Gribaldo 2010) and constructed approximately maximum likelihood trees for each alignment with FastTree2(Price et al. 2010) using the “slow” and “gamma” options. We then used these alignments in PhyloPyPruner with the following settings: --min-len 100 --min-support 0.75 --mask pdist --trim-lb 3 --trim-divergent 0.75 --min-pdist 0.01 --trim-freq-paralogs 3 --prune MI. For datasets 1 (“genomes”) and 3 (“all_taxa”), only orthogroups sampled for at least 85% of the total number of taxa were retained for concatenation. For dataset 2 (“biomin_subset”), only orthogroups sampled for all eight taxa were retained. Phylogenetic analyses were conducted on the supermatrix produced by PhyloPyPruner v. 1.0 in IQ-TREE v. 1.6.12 (Nguyen et al. 2015) using the PMSF model (Wang et al. 2018) with a guide tree based on the LG model. Topological support was assessed with 1,000 rapid bootstraps.
Screening for known biomineralization genes
To identify biomineralization genes in the chiton genome, we began by running OrthoFinder v. 2.3.7 on the gene models of select genomes (corresponding to taxon set #2 described in the Orthology inference section above): Lingula anatina, Pinctada fucata, Lottia gigantea, Haliotis rufescenes, Scapharca broughtonii, Bathymodiolus platifrons, and Acanthopleura granulata. Once we obtained this set of orthogroups, we created a local BLAST protein database from the orthogroups and queried it for known biomineralization genes. We used a previously identified protein sequence for each gene of interest from NCBI (Supplementary Table 9) as a query and set an e-value cutoff of 1e-8 to identify the OrthoFinder orthogroup(s) that contain that particular biomineralization gene of interest. We aligned the putative matching orthogroup, the query sequence for the gene of interest, and the sequences of other orthogroups that had sequence similarity high enough to be pulled by our blast search with MAFFT with the default settings. We then constructed a phylogenetic tree in IQ-TREE v.1.3.11.1 to verify that the query sequence (the known protein sequence) fell within the orthogroup identified rather than within an orthogroup with a similar protein sequence due to conserved domains.
Silk-like proteins share similar amino acid composition throughout Metazoa, but the genes that code for them are difficult to identify in genomes because their highly-repetitive sequences are often missed by traditional gene annotation tools (McDougall et al. 2016). We looked for silk-like proteins with SilkSlider (McDougall et al. 2016), run with default settings but using SignalP v. 4.01 (Nielsen 2017), which identifies potential silk-like proteins by locating low-complexity repetitive domains and signal peptides. The 31 proteins identified as silk-like by SilkSlider were then uploaded to the SignalP 5.0 webserver (Almagro Armenteros et al. 2019) for further predictions of signal peptides associated with extracellular localization.
To locate and quantify iron response elements (IREs), we screened the 20,470-gene A. granulata gene model set using the SIREs 2.0 (Campillos et al. 2010) web server. We also ran SIRE on the subset of genomes used for biomineralization analyses (see OrthoFinder above) for comparison. We compensated for differences in annotation methods by first clustering all coding sequences from each genome with CD-Hit-EST (Fu et al. 2012) with a cluster threshold of 0.8 (to match the threshold value we used earlier to reduce redundancy in the annotations of the A. granulata genome). We then ran SIRE on each of these sets of predicted transcripts. We only accepted predicted IREs scored as “high quality” according to the SIRE metric (indicating both sequence and structural characteristics of a functional IRE). We pulled chiton genes containing a high quality IRE from the eight different tissue transcriptomes generated for genome annotation and assessed expression by mapping each back to the genome with Salmon v. 0.11.3 (Patro et al. 2017-4) to generate quantifications of reads per transcript, and running these quantifications through edgeR (Robinson et al. 2010) to account for transcript length (TPM) and permit direct comparisons of gene expression. We also separated 3’ and 5’ IREs by subsetting the high quality IREs based on whether the IRE was located at the beginning or end of the sequence. We made heatmaps with log-transformed data to compensate for outliers in expression levels with R package prettyheatmap (Kolde 2012). We then analyzed the GO terms enriched in the separate sets of 5’- and 3’-containing genes that were highly expressed in the radula with GOrilla (Eden et al. 2009), using the complete protein set as a background dataset and the sets of IRE-containing genes as the target list.
Usage notes
This repository contains supplemental data for:
Varney RM et al. 2020. The iron-responsive genome of the chiton Acanthopleura granulata. Genome Biology and Evolution.
############################
NCBI Accession numbers for the genome and all generated chiton transcriptomes are in a spreadsheet under Links_to_Raw_Data.
Configuration files for the genome assembly, as well as a copy of the final genome scaffolds, is under GenomeAssembly.
Multiple genome annotations can be found under GenomeAnnotations, a larger gene set that likely contains redundant isoforms and a more stringent gene set that may be missing some genes. Both are provided to provide either a highly-complete or closer-to-accurate option for the chiton gene set.
Laboratory protocol for DNA extraction is under OtherProtocols.
Raw input/output files from RepeatModeler, SilkSlider, SIREs, OrthoVenn, OrthoFinder, Phylogenetic analyses, KEGG, SwissProt. and GO annotations, as well as the phylogenetic analyses of each separate biomineralization gene investigated, are under GenomeAnalyses.