Data from: The spotted parrotfish genome provides evolutionary insight into the ecological adaptation of a keystone dietary specialist
Data files
Mar 04, 2024 version files 1.60 GB
-
PF_annotation-run1_species_fgenesh_cds.fa
55.49 MB
-
PF_annotation-run1_species_fgenesh_mrna.fa
55.49 MB
-
PF_annotation-run1_species_fgenesh_proteins.fa
28.03 MB
-
PF_annotation-run1_species_fgenesh.gff3
96.11 MB
-
PF_genome-assembly_final.fa
1.37 GB
-
README.md
1.72 KB
Abstract
With over 600 valid species, the wrasses (family Labridae) are among the largest and most successful of the marine teleosts. They feature prominently on coral reefs where they are known not only for their impressive diversity in colouration and form, but also in their functional specialization and ability to occupy a wide variety of trophic guilds. Among the wrasses, the parrotfishes (tribe Scarini) display some one of the most dramatic examples of trophic specialization. Using abrasion-resistant biomineralized teeth, parrotfishes are able to mechanically extract protein-rich micro-photoautotrophs growing in and amongst reef carbonate material, a dietary niche that is inaccessible to most other teleost fishes. This ability to exploit an otherwise untapped trophic resource is thought to have played a role in the diversification and evolutionary success of the parrotfishes. In order to better understand the key evolutionary innovations leading to the success of these dietary specialists, we sequenced and analysed the genome of a representative species, the spotted parrotfish (Cetoscarus ocellatus). We find significant expansion, selection, and duplication within several detoxification gene families and a novel poly-glutamine expansion in the enamel protein ameloblastin, and we consider their evolutionary implications. Our genome provides a useful resource for comparative genomic studies investigating the evolutionary history of this highly specialized teleostean radiation.
README: The spotted parrotfish genome provides evolutionary insight into the ecological adaptation of a keystone dietary specialist
This dataset contains the Cetoscarus ocellatus genome assembly, and the accompanying annotation files. Assembly and annotation methods can be found in the associated manuscript.
Description of the data and file structure
The filename below is a FASTA file containing the assembled genome scaffolds:
-
PF_genome-assembly_final.fa
The filename below is an annotation file corresponding to the genome assembly, in a standard GFF3 format:
-
PF_annotation-run1_species_fgenesh.gff3
The filenames below are FASTA files for the annotated genes, extracted from the genome assembly, presented as coding sequences (CDS) (cds), mRNA sequences, and protein sequences respectively:
-
PF_annotation-run1_species_fgenesh_cds.fa
-
PF_annotation-run1_species_fgenesh_mrna.fa
-
PF_annotation-run1_species_fgenesh_proteins.fa
The filename below is a FASTA file for the assembled genome:
- 'PF_genome-assembly_final.fa'
The annotation files were produced using FGENESH++ v7.2.2. The tab-delimited 'General Feature Format' annotation files (i.e. the 'gff3' files) represent a standard file format for genome assemblies;
they contains information for every feature in the associated reference genome. The contig/scaffold sequence names in the gff3 files correspond to the sequence names in the associated genome assembly files.
The protein, mRNA and CDS fasta files are based on the annotations detailed in the gff3 files.
Sharing/Access information
All sequence data that were utilised to generate these three genomes are available on NCBI under BioProject PRJNA1081164
Methods
The spotted parrotfish genomes (Cetoscarus ocellatus) was sequenced and assembled to investigate the evolution of this coral reef fish group, and to provide genomic resources for studies on the Scarini. This genome was assembled using a combination of long-read, linked-read, and Hi-C data (the raw seqeunce data are avalable on the SRA database under BioProject accession PRJNA1081164). Assembly methods are outlined in the associated manuscript. Briefly, an initial de novo assembly of the PacBio long-read data was performed using Canu v.2.1.1 with default settings and an estimated genome size of 1.4 Gb (based on published labrid genomes). TELL-seq linked reads were used to scaffold the draft de novo long-read assembly and improve its contiguity using Long Ranger basic v2.2.2, ARCS v1.2, and LINKS v1.8.7. Finally, Hi-C reads were aligned to the ARCS/LINKS-scaffolded draft assembly. The genome was annotated using FGENESH++.