Data from: High-resolution chromosome-level genome of Scylla paramamosain provides molecular insights into adaptive evolution in crab
Data files
Nov 20, 2024 version files 1.27 GB
-
README.md
768 B
-
SP2022.Chr.bed
23.67 KB
-
SP2022.fasta
1.21 GB
-
SP2022.genome.gff
40.89 MB
-
SP2022.genome.pep.fasta
14.54 MB
Abstract
Crabs exhibit adaptability across diverse ecosystems, encompassing shallow coral reefs, hydrothermal vents, as well as freshwater, seawater, and terrestrial habitats. Our study presents a thorough genomic analysis of the mud crab leveraging ultralong sequencing technologies. We achieved a high-quality chromosome-level assembly, covering 1.21 Gb (scaffold N50 23.61 Mb) with 33,662 protein-coding genes. Anchoring contigs into 49 pseudochromosomes revealed Chromosome 06 as a significant sex chromosome. Notably, the Hox genes are implicated in abdomen limb morphology, with the Abd-A gene identified as a linchpin in pleopod development, shedding light on brachyurization. fru gene, negatively regulated by novel-miR35, plays critical role in ovarian maturation. Additionally, we elucidated neo-functionalization of the elovl6 gene and identified tissue-specific splice variants, suggesting a potential role for Elovl6 in long-chain polyunsaturated fatty acid biosynthesis. These findings contribute significantly to our understanding of crab adaptability and evolutionary dynamics, offering a robust foundation for future investigations.
https://doi.org/10.5061/dryad.kwh70rzc3
The sequences and the annotation files of the green mud crab genome. The dataset consists of SP2022.Chr.bed, SP2022.fasta, SP2022.genome.gff, and SP2022.genome.pep.fasta files.
Description of the data and file structure
The sequences file was in fasta format and the annotation file was in gff including genome sequencing, assembly, and annotation of green mud crab.
Sharing/Access information
Links to other publicly accessible locations of the data:
Sample preparation and genomic DNA isolation
Genomic DNA for sequencing was extracted from the testis of a wild adult male green mud crab S. paramamosain with stage III testis, which was caught off the coast of Shantou, China. A Grandomics Genomic BAC-long DNA Kit was used to isolate ONT ultralong DNA according to the manufacturer's guidelines. The total DNA quantity and quality were evaluated using a NanoDrop One UV-Vis Spectrophotometer (Thermo Fisher Scientific, Waltham, MA) and a Qubit 3.0 Fluorometer (Invitrogen Life Technologies, Carlsbad, CA). Large DNA fragments were obtained through gel cutting with the Blue Pippin system (Sage Science, Beverly, MA). Qualification of DNA involved visual inspection for foreign matter, assessment of degradation and size via 0.75 % agarose gel electrophoresis, check on purity (OD260/280 between 1.8-2.0; OD260/230 between 2.0-2.2) using Nanodrop, and precise quantification with Qubit 3.0 Fluorometer (Invitrogen, USA).
Library construction and sequencing
Approximately 8-10 µg of genomic DNA (> 50 kb) was selected using the SageHLS HMW library system (Sage Science, Beverley MA, USA) and processed with the Ligation Sequencing 1D Kit (Oxford Nanopore Technologies, Shanghai, China) following the manufacturer's instructions. Library construction and sequencing were conducted on the PromethION (Oxford Nanopore Technologies) at the Genome Center of Grandomics (Wuhan, China). After quality inspection, large DNA fragments were recovered using the BluePippin automatic nucleic acid recovery instrument. Terminal repair, A-tailing, and ligation were performed using the LSK109 connection kit. Qubit was employed to assess the constructed DNA library precisely. The library was loaded into a flow cell, transferred to the PromethION sequencer, and subjected to real-time single-molecule sequencing. In the ONT sequencing platform, base calling, the conversion of nanopore-generated signals to base sequences (69), was executed using the Guppy toolkit (Oxford Nanopore Technologies). Pass reads with a mean qscore_template value greater than or equal to 7 were obtained and directly used for subsequent assembly (https://github.com/nanoporetech/taiyaki).
Genome assembly, evaluation and correction
Filtered reads post quality control were employed for a pure three-generation assembly using NextDenovo software (reads_cutoff:1k, seed_cutoff:28k) (https://github.com/Nextomics/NextDenovo.git). The NextCorrect module was employed to correct the original data, yielding a consistency sequence (CNS sequence) after 13 Gb of error correction. De novo assembly using the NextGraph module produced a preliminary assembly of the genome. ONT three-generation data and Pb HiFi three-generation data were utilized with Nextpolish software (https://github.com/Nextomics/NextPolish.git) for genome correction. The corrected genome (Polish Genome) was obtained after three rounds of correction for both ONT and Pb HiFi data. Bwa mem default parameters were used to compare next-generation data to the genome, and Pilon was iteratively calibrated three times to derive the final genomic sequence. GC depth analysis and BUSCO prediction (https://busco.ezlab.org/) were used to assess genome quality and completeness.
Chromosome anchoring by Hi-C sequencing
To anchor hybrid scaffolds onto chromosomes, genomic DNA for Hi-C library construction was extracted from green mud crab testes. Cell samples from the tissue used for genomic DNA sequencing were employed for Hi-C library preparation. The process involved cross-linking cells with formaldehyde, lysing cells, resuspending nuclei, and subsequent steps leading to proximity ligation. After overnight ligation, cross-linking was reversed, and chromatin DNA manipulations were performed. DNA purification and shearing to 400 bp lengths were followed by Hi-C library preparation using the NEBNext Ultra II DNA library Prep Kit for Illumina. Sequencing on the Illumina NovaSeq/MGI-2000 platform completed the Hi-C procedure. Briefly, cell samples were fixed with formaldehyde and subjected to lysis and extraction for sample quality assessment. After passing the quality test, the Hi-C fragment preparation process involved chromatin digestion, biotin labeling, end ligation, DNA purification, and library construction. The library was sequenced on the MGI-2000 platform, and data were processed to extract high-quality reads. The analysis included filtering for adapters, removing low-quality reads, and eliminating reads with an N content exceeding 5. Reads were aligned using Bowtie2 (70), and contig clustering was performed using LACHESIS software (71).
Gene annotation
Tandem repeats were annotated using GMATA (https://sourceforge.net/projects/gmata/?source=navbar) and Tandem Repeats Finder (TRF) (http://tandem.bu.edu/trf/trf.html), identifying simple repeat sequences (SSRs) and all tandem repeat elements. Transposable elements (TEs) were identified through an ab initio and homology-based approach, with RepeatMasker (https://github.com/rmhubley/RepeatMasker) used for searching known and novel TEs. Gene prediction employed three methods: GeMoMa (http://www.jstacs.de/index.php/GeMoMa) for homolog prediction, PASA (https://github.com/PASApipeline/PASApipeline) for RNAseq-based prediction, and Augustus (https://github.com/Gaius-Augustus/Augustus) for de novo prediction. EVidenceModeler (EVM) (http://evidencemodeler.github.io/) integrated gene sets, which underwent further filtering for transposons and erroneous genes. UTRs and alternative splicing regions were determined using PASA based on RNA-seq assemblies. Functional annotation involved comparisons with public databases, including SwissProt, NR, KEGG, KOG, and Gene Ontology. InterProScan identified putative domains and GO terms. BLASTp (https://blast.ncbi.nlm.nih.gov/Blast.cgi) against public protein databases was used to assess gene function information. The noncoding RNA (ncRNA) prediction entailed using tRNAscan-SE (http://lowelab.ucsc.edu/tRNAscan-SE/) and Infernal cmscan (http://eddylab.org/infernal/) for tRNAs and other noncoding RNAs. BUSCO was employed for gene prediction evaluation, aligning annotated protein sequences to evolution-specific BUSCO databases.
Evolutionary analysis
The evolutionary analysis entailed the examination of gene families, construction of phylogenetic trees, estimation of divergence times, exploration of gene expansion and contraction phenomena, identification of orthologous genes, scrutiny of positively selected genes, and investigation of whole-genome duplications. To ascertain homologous relationships between S. paramamosain and other animal species, protein sequences were acquired and aligned using OrthMCL (https://orthomcl.org/orthomcl/). Initially, protein sets were gathered from 14 sequenced animal species, and the longest transcripts for each gene were selected, excluding miscoded and prematurely terminated genes. Subsequently, pairwise alignment of these extracted protein sequences was conducted to identify conserved orthologs, employing Blastp with an E-value threshold of ≤ 1 × 10-5. Further identification of orthologous intergenome gene pairs, paralogous intragenome gene pairs, and single-copy gene pairs was achieved using OrthMCL. Species-specific genes, encompassing S. paramamosain-specific unique genes and unclustered genes, were extracted. Functional annotation and enrichment tests of species-specific genes were performed utilizing information from homologs in the online Gene Ontology (http://www.geneontology.org/) and KEGG (Kyoto Encyclopedia of Genes and Genomes) (https://www.genome.jp/kegg/) databases.
Building upon the orthologous gene sets identified with OrthMCL, molecular phylogenetic analysis was executed using shared single-copy genes. Coding sequences were extracted from single-copy families, followed by multiple alignment of each ortholog group using MAFFT (https://mafft.cbrc.jp/alignment/software/). Gblocks were applied to eliminate poorly aligned sequences, and the GTRGAMMA substitution model of RAxML (https://cme.h-its.org/exelixis/web/software/raxml/hands_on.html) was employed for phylogenetic tree construction with 1000 bootstrap replicates. The resulting tree file was visualized using Figtree/MEGA (http://tree.bio.ed.ac.uk/software/figtree/). The RelTime tool (https://www.megasoftware.net/) of MEGA-CC was then utilized to compute mean substitution rates along each branch and estimate species divergence times, with three fossil calibration times obtained from the TimeTree (http://www.timetree.org/) database serving as temporal controls, including the divergence times of S. paramamosain.
The detection of significant expansions or contractions in specific gene families, often indicative of adaptive divergence in closely related species, was carried out based on OrthoMCL results. CAFE (https://github.com/hahnlab/CAFE), employing a birth and death process to model gene gain and loss over a phylogeny, was used for this purpose. Furthermore, in accordance with the neutral theory of molecular evolution, the ratio of the nonsynonymous substitution rate (Ka) to the synonymous substitution rate (Ks) of protein-coding genes was calculated. The average Ka/Ks values were determined, and a branch-site likelihood ratio test using Codeml (http://abacus.gene.ucl.ac.uk/software/) from the PAML package was conducted to identify positively selected genes within the S. paramamosain lineage. Genes with a p value < 0.05 under the branch-site model were considered positively selected.
Whole-genome duplication events in the green mud crab genome were investigated using four-fold synonymous third-codon transversion (4DTv) and synonymous substitution rate (Ks) estimation. Initial steps involved extracting protein sequences and conducting all-vs.-all paralog analysis through self-Blastp in these plants. After filtering by identity and coverage, the Blastp results underwent analysis with MCScanX (71), and the respective collinear blocks were identified. Subsequently, Ks and 4DTV were calculated for the syntenic block gene pairs using KaKs-Calculator (https://sourceforge.net/projects/kakscalculator2/), and potential WGD events in each genome were evaluated based on their Ks and 4DTv distribution.