Genome assemblies and gene models of of the zoantharians Palythoa mizigama and Palythoa umbrosa
Data files
Jul 19, 2024 version files 271.34 MB
-
Pmiz_final.hm11.20240419.gtf.gz
5.87 MB
-
Pmiz_final.hm11.contigs.20240419.fa.gz
117.33 MB
-
Pmiz_final.hm11.transcript.20240419.fa.gz
14.03 MB
-
Pmiz_final.protein.20240419.fa.gz
8.52 MB
-
Pumb_final.hm11.20240419.gtf.gz
5.08 MB
-
Pumb_final.hm11.contigs.20240419.fa.gz
102.08 MB
-
Pumb_final.hm11.transcript.20240419.fa.gz
11.51 MB
-
Pumb_final.protein.longest.20240419.fa.gz
6.92 MB
-
README.md
1.86 KB
Abstract
Anthozoan hexacorals are an important animal group in many marine environments and include at least ~3,500 extant species. Zoantharia is an order among the Hexacorallia (Anthozoa: Cnidaria) and is a sister group to a clade consisting of four orders: Actiniaria, Antipatharia, Corallimorpharia, and Scleractinia. Previously reported genomes from scleractinian corals and actiniarian sea anemones have illuminated part of the hexacorallian diversification. However, little is known about zoantharian genomes and the early evolution of hexacorals. We generated two Palythoa genomes from the order Zoantharia within Hexacorallia, providing novel insights into early hexacorallian evolution by comparing with genomes of diversified scleractinian corals and actiniarian sea anemones.
Draft genomes generated from ultra-low input PacBio sequencing totaled 373 Mbp and 319 Mbp for Palythoa mizigama (Pmiz) and Palythoa umbrosa (Pumb), respectively. 30,394 and 24,800 protein-coding genes were predicted in genomes of Pmiz and Pumb, respectively. Comparative genomic analyses identified 3,035 conserved gene families, which were found in all analyzed hexacoral genomes. Some of the genes related to toxins, chitin degradation, and prostaglandin biosynthesis were expanded in these two Palythoa genomes and many of which aligned tandemly. Extensive gene family loss was not detected in the Palythoa lineage and five of ten putatively lost gene families (GRIN1, TAAR7E, FEZF2, NRXN3, and SPATA7) likely had neuronal function, suggesting biased gene loss in Palythoa.
These first available gene-sets (gene models) from zoantharians demonstrate genome conservation with restricted neuronal gene loss. Overall, our analyses imply that lineage-specific tandem duplication of enzyme genes have occurred in the genome evolution of Zoantharia.
README: Genome assemblies and gene models of of the zoantharians Palythoa mizigama and Palythoa umbrosa
https://doi.org/10.5061/dryad.j0zpc86p9
We have submitted the genome assemblies and gene models of two palythoa (P. mizigama and P. umbrosa) as the following 8 files.
Genome assembly of the zoantharian Palythoa mizigama in FASTA format
File name: Pmiz_final.hm11.contigs.20240419.fa.gz
Gene models of the zoantharian Palythoa mizigama in General Transfer Format (GTF)
File name: Pmiz_final.hm11.20240419.gtf.gz
mRNA sequences of protein-coding genes of the zoantharian Palythoa mizigama in FASTA format
File name: Pmiz_final.hm11.transcript.20240419.fa.gz
Amino acid sequences of protein-coding genes of the zoantharian Palythoa mizigama in FASTA format
File name: Pmiz_final.protein.20240419.fa.gz
Genome assembly of the zoantharian Palythoa umbrosa in FASTA format
File name: Pumb_final.hm11.contigs.20240419.fa.gz
Gene models of the zoantharian Palythoa umbrosa in General Transfer Format (GTF)
File name: Pumb_final.hm11.20240419.gtf.gz
mRNA sequences of protein-coding genes of the zoantharian Palythoa umbrosa in FASTA format
File name: Pumb_final.hm11.transcript.20240419.fa.gz
Amino acid sequences of protein-coding genes of the zoantharian Palythoa umbrosa in FASTA format
File name: Pumb_final.protein.longest.20240419.fa.gz
Sharing/Access information
Raw genomic DNA sequence data of Palythoa mizigama and Palythoa umbrosa have been submitted at DDBJ Sequence Read Archive (DRA) under the accession DRR546399-DRR546402 (BioProjectID: PRJDB18008), respectively. Mitochondrial and genomic assemblies have been deposited at DDBJ under the accession AP031564, AP031565, BAACCD010000001-BAACCD010004032, and BAACCE010000001-BAACCE010002838.
Methods
Sample collection
The two target species of the genus Palythoa (P. mizigama and P. umbrosa) were collected from different sites in southern Japan. P. mizigama was sampled at Mizugama, Okinawa Island, Okinawa, Japan on October 9, 2023. The sampling location of P. umbrosa was at Nakano, Iriomote Island, Okinawa, Japan on September 16, 2023. Each Palythoa polyp was added into 50 mL tubes and they were chopped into small pieces with scissors. Fragmented tissues were put into 3.3x PBS and were kept for 30 min. By 5 min vortexing during this period, we dissociated the tissue into individual cells. Dissociated cells were centrifuged at 1,000 g for 5 min. After discarding the supernatant, they were quickly frozen in liquid nitrogen and were kept in a -80 °C freezer until processing.
DNA isolation and sequencing
High molecular weight DNAs from dissociated cells were extracted using innuPREP Plant DNA Kit I (Analytik Jena AG, Germany) following the manufacturer’s protocol. Sequencing libraries from the genomic DNAs of each species were constructed according to HiFi Ultra-Low Input DNA protocol from Pacific Bioscience (Menlo Park, California, USA). The library construction included a whole-genome amplification step by PCR methods (Schneider et al. 2021) and some amplification bias may be caused. Sequencing was performed on the PacBio Sequel II systems (Pacific Bioscience, Menlo Park, CA, USA) using two SMRT (Single Molecule Real-Time) cells for each species.
Genome assembly and gene prediction
The mitochondrial genomes were first assembled from the HiFi reads with MitoHiFi pipeline v3.2.1 (Uliano-Silva et al. 2023). HiFi reads that mapped to the mitochondrial genomes were removed from further steps. HiFi reads possibly originating from archaea, bacteria, viral, plasmid, and UniVec Core were identified with KRAKEN2 v2.1.3 (Wood et al. 2019) and its standard database (January 2024 update) and were also removed from further steps. Prior to nuclear genome assembly, the k-mer profile (k=41 and 61) was performed with Meryl v1.4.1 (Rhie et al. 2020) and was used for genome size estimation with GenomeScope v2.0 (Ranallo-Benavidez et al. 2020). Contaminants- and mitochondrial-free HiFi reads were then assembled with Hifiasm v0.19.8-r603 (Cheng et al. 2021). Primary contigs assembled with Hifiasm were used for further analyses. Error-correction of the assembly was performed using Inspector v1.0.1 (Chen et al. 2021) with cleaned HiFi reads. Possible haplotypes were merged with HaploMerger2 v20180603 (Huang et al. 2017). This haplotype merging step was repeated for several times together with error collection using Inspector. Repetitive elements in the assembly were identified de novo with RepeatModeler v2.0.4 (Flynn et al. 2020) with option (-LTRStruct). Repeats identified de novo and known repeats in RepeatMasker database were concatenated and were soft-masked using RepeatMasker v4.1.5 (Smit et al. 2015). Simple tandem repeat detection was further performed by tantan v49 (Frith 2011). BUSCO completeness of the genome assembly was assessed with compleasm v0.2.5 (Huang and Li 2023) and “metazoa_odb10” database (n=954). To evaluate sequencing depth, cleaned HiFi reads were mapped to contigs with Minimap2 v2.26-r1175 and mean depth for each contig was calculated with BamDeal v0.27 (https://github.com/BGI-shenzhen/BamDeal?tab=readme-ov-file). GC content for each contig was calculated with SeqKit v2.8.0 (Shen et al. 2016). These were visualized using the “ggplot2” package (Ginestet 2011) in R v4.3.2 (R Foundation for Statistical Computing 2018).
Gene prediction was performed with BRAKER pipeline v3.0.8 (Hoff et al. 2019; Gabriel et al. 2021; Bruna et al. 2023). Softmasked genome and proteomes retrieved from OrthoDB v11 (metazoa) and proteomes from selected cnidarians registered in RefSeq ranging five orders were used as input in the pipeline. Gene with length larger than 100 bp were retained. Each gene was named ‘g’ plus a number from one end of each contig, e.g., c0001.g001 is the first gene of contig c0001. Gene models were assessed using BUSCO v5.7.0 (Manni et al. 2021) with “metazoa_odb10” database (2024-01-08, n=954). To ensure that no possible contamination was included, gene models were also assessed by OMArk v0.3.0 (Nevers et al. 2024) with OMAmer “Metazoa v2.0.0” database (n=3,244). Functional annotation of gene models were performed with BLASTP (E-value cutoff: 1e-3) (Camacho et al. 2009) against the Swiss-Prot database (7 June 2023).
References
Bruna T, Lomsadze A, Borodovsky M. 2023. GeneMark-ETP: Automatic Gene Finding in Eukaryotic Genomes in Consistency with Extrinsic Data. bioRxiv:2023.2001.2013.524024.
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. 2009. BLAST+: architecture and applications. BMC bioinformatics 10:1-9.
Chen Y, Zhang Y, Wang AY, Gao M, Chong Z. 2021. Accurate long-read de novo assembly evaluation with Inspector. Genome biology 22:1-21.
Cheng H, Concepcion GT, Feng X, Zhang H, Li H. 2021. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature methods 18:170-175.
Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, Smit AF. 2020. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences 117:9451-9457.
Frith MC. 2011. A new repeat-masking method enables specific detection of homologous sequences. Nucleic acids research 39:e23-e23.
Gabriel L, Hoff KJ, Brůna T, Borodovsky M, Stanke M. 2021. TSEBRA: transcript selector for BRAKER. BMC bioinformatics 22:1-12.
Ginestet C. 2011. ggplot2: elegant graphics for data analysis. In: Oxford University Press.
Hoff KJ, Lomsadze A, Borodovsky M, Stanke M. 2019. Whole-genome annotation with BRAKER. Gene prediction: methods and protocols:65-95.
Huang N, Li H. 2023. compleasm: a faster and more accurate reimplementation of BUSCO. Bioinformatics 39:btad595.
Huang S, Kang M, Xu A. 2017. HaploMerger2: rebuilding both haploid sub-assemblies from high-heterozygosity diploid genome assembly. Bioinformatics 33:2577-2579.
Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. 2021. BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Molecular biology and evolution 38:4647-4654.
Nevers Y et al. 2024. Quality assessment of gene repertoire annotations with OMArk. Nat. Biotechnol. doi: 10.1038/s41587-024-02147-w.
R Foundation for Statistical Computing R. 2018. R: a language and environment for statistical computing. RA Lang Environ Stat Comput.
Ranallo-Benavidez TR, Jaron KS, Schatz MC. 2020. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature communications 11:1432.
Rhie A, Walenz BP, Koren S, Phillippy AM. 2020. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome biology 21:1-27.
Schneider C et al. 2021. Two high-quality de novo genomes from single ethanol-preserved specimens of tiny metazoans (Collembola). Gigascience. 10. doi:10.1093/gigascience/giab035.
Shen W, Le S, Li Y, Hu F. 2016. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. Plos One 11:e0163962.
Smit A, Hubley R, Green P. 2015. RepeatMasker Open-4.0. 2013–2015. https://www.repeatmasker.org/RepeatMasker/
Uliano-Silva M, Ferreira JGR, Krasheninnikova K, Formenti G, Abueg L, Torrance J, Myers EW, Durbin R, Blaxter M. 2023. MitoHiFi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads. BMC bioinformatics 24:288.
Wood DE, Lu J, Langmead B. 2019. Improved metagenomic analysis with Kraken 2. Genome biology 20:1-13.