A highly contiguous reference genome for the Steller's jay (Cyanocitta stelleri)
Data files
May 17, 2023 version files 404.83 MB
-
bCyaSte1.0.p_RepeatLibrary.fa
1.24 MB
-
GCA_026167965.1_bCyaSte1.0.p_liftoff.gff
403.58 MB
-
README.md
620 B
Abstract
The Steller's jay is a familiar bird of western forests from Alaska south to Nicaragua. Here, we report a draft reference assembly for the species generated from PacBio HiFi long read and Omni-C chromatin-proximity sequencing data as part of the California Conservation Genomics Project (CCGP). Sequenced reads were assembled into 352 scaffolds totaling 1.16 Gb in length. Assembly metrics indicate a highly contiguous and complete assembly with a contig N50 of 7.8 Mb, scaffold N50 of 25.8 Mb, and BUSCO completeness score of 97.2%. Repetitive elements span 16.6% of the genome including nearly 90% of the W chromosome. Compared with high quality assemblies from other members of the family Corvidae, the Steller's jay genome contains a larger proportion of repetitive elements than four crow species (Corvus), but a lower proportion of repetitive elements than the California scrub-jay (Aphelocoma californica). This reference genome will serve as an essential resource for future studies on speciation, local adaptation, phylogeography, and conservation genetics in this species of significant biological interest.
Methods
We performed de novo repeat annotation of the draft Steller's jay reference assembly using the program RepeatModeler2 with the ltrstruct option selected to improve identification of LTR elements (Flynn et al. 2020). We next prioritized LTR and unclassified elements for manual curation that were at least 1000 base pairs in length. For each LTR consensus sequence, we used blastn (Camacho et al. 2009) to identify other members of each TE family in the genome, added 2000 bp of flanking sequence to both ends of each blastn hit, aligned extended sequences with mafft (Katoh & Standley 2013), and visualized the alignment in Aliview (Larsson et al. 2014). We confirmed the completeness of LTR elements based on the presence of canonical 5' TG and 3' CA dinucleotides at the termini of LTRs. A consensus sequence of the trimmed multiple sequence alignment was then generated using the cons tool in EMBOSS (Rice et al. 2000). For sequences labeled as unclassified, we used blastn to check for significant hits with protein coding genes and removed elements from the repeat library that had high sequence similarlity with a known gene for over 80% of their length. The program TE-Aid (https://github.com/clemgoub/TE-Aid) was also used to explore structural properties and presence of open reading frames for the expected proteins characteristic of each class of TE element. Finally, we used cd-hit-est (Li et al. 2006) to cluster sequences belonging to the same family within the Steller's jay repeat library following the 80-80-80 rule (Wicker et al. 2007). Specifically, this rule considers consensus sequences to be the same family if they are 80 base pairs in length and share >80% similarity over >80% of their length. This produced a final repeat library with consensus sequences for 436 elements.