Skip to main content
Dryad

Supplementary data for: Chromosome-level genome assembly and circadian gene repertoire of the Patagonia blennie Eleginops maclovinus

Cite this dataset

Rivera-Colón, Angel (2023). Supplementary data for: Chromosome-level genome assembly and circadian gene repertoire of the Patagonia blennie Eleginops maclovinus [Dataset]. Dryad. https://doi.org/10.5061/dryad.qbzkh18nt

Abstract

This dataset contains the genome assembly and associated annotation of the Patagonian Blennie (Eleginops maclovinus), the closest extant taxon to the Antarctic notothenioid radiation. In addition to the characterization of the E. maclovinus genome, the dataset includes a description of circadian rhythm orthologs for E. maclovinus, other notothenenioid taxa, and teleost outgroups, as well as a copy of the bioinformatic scripts used for the assembly, annotation, and other downstream analysis.

Methods

An E. maclovinus specimen was collected from the Puerto Natales, Chile in January 2018. HMW DNA was extracted and sequenced using PacBio Sequel II and a Hi-C library. A contig-level genome assembly was first generated using wtdgb2 (a.k.a. redbean) v2.5 (Ruan & Li 2020), and scaffolded with juicer v1.6.2 (Durand et al. 2016). PacBio and HiC raw data is available under NCBI BioProject PRJNA857989. For annotation, the RNA-seq data generated by Bilyk et al. (2018) was aligned to the genome, and processed using BRAKER v2.1.6 (Brůna et al. 2021). The generated annotation was then further processed using TSEBRA v1.0.1 (Gabriel et al. 2021). Using a custom Python script (see scripts section), we curated the TSEBRA output to guarantee consistency in the naming of genes and transcripts, as well as incorporating gene names and description based on the corresponding zebrafish orthologs.

A conserved synteny analysis using synolog (Catchen et al. 2009; Small et al. 2016) was employed for the manual curation of the assemblies. For example, we identified missasemblies in structural variants limited to contig boundaries or merged scaffolds belonging to the same chromosome sequences. We used a custom Python program to propagate these changes through the constituent assembly files.

For the circadian rhythm comparative analysis, assemblies and annotations were downloaded from genomic databases (e.g., ENSEMBL, NCBI). Circadian gene orthologs were identified using synolog and extracted using custom Python scripts.

A detailed step-by-step description of the methods is available in the associated publication (Cheng et al.).

Usage notes

All assembly and annotation files are gzipped, but are otherwise standard bioinformatic formats (i.e., FASTA for genome assembly and coding/amino acid sequences, GTF for annotation, AGP for scaffolding). In addition, bioinformatic scripts for data generation and analysis are in Python (*.py) or Bash (*.sh, but might require the installation of additional, open-source software (e.g., wtdbg2, BRAKER)

See links for a description of the FASTA (http://www.ncbi.nlm.nih.gov/blast/fasta.shtml), and GTF (https://useast.ensembl.org/info/website/upload/gff.html), and AGP (https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_Specification/) file format specifications.

File format Specification

File Suffix1 Description
*.fa Genome assembly in nucleotide FASTA format. 
*.agp Assembly structure in AGP format.
*.gtf Genome annotation in GTF format.
*.cds.fa Genomic sequence for all annotated protein-coding genes in nucleotide FASTA format.
*.protein.fa Protein sequence for all annotated protein-coding genes in amino acid FASTA format.

1Does not include the gzipped compression suffix (*.gz).

Funding

National Science Foundation, Award: 1645087

National Science Foundation, Award: 11-42158