A high-quality genome assembly and annotation of the dark-eyed junco Junco hyemalis, a recently diversified songbird
Friis, Guillermo (2022), A high-quality genome assembly and annotation of the dark-eyed junco Junco hyemalis, a recently diversified songbird, Dryad, Dataset, https://doi.org/10.5061/dryad.c59zw3r87
The dark-eyed junco (Junco hyemalis) is one of the most common passerines of North America, and has served as a model organism in studies related to ecophysiology, behavior and evolutionary biology for over a century. It is composed by at least six distinct, geographically structured forms of recent evolutionary origin presenting remarkable variation in phenotypic traits, migratory behavior and habitat. Here we report a high-quality genome assembly and annotation of the dark-eyed junco generated using a combination of shotgun libraries and proximity ligation ChicagoTM and Dovetail HiCTM libraries. The final assembly is 1,031,523,571 bp long, with 98.3% of the sequence located in 30 full or nearly full chromosome scaffolds, and with a N50/L50 of 71,3 Mb/5 scaffolds. We identified 19,026 functional genes combining gene prediction and similarity approaches, of which 15,967 were associated to GO terms. Genome assembly and annotated set of genes yielded 95.4% and 96.2% completeness scores, respectively, when compared with the BUSCO avian dataset. This new assembly for J. hyemalis provides a valuable resource for genome evolution analysis, as well as for identifying functional genes involved in adaptive processes and speciation.
Genome sequencing and assembly
A high-quality genome was produced combining newly generated shotgun reads and sequence data from proximity ligation libraries. Preparation of proximity ligation libraries ChicagoTM and HiCTM, as well as scaffolding with the software pipeline HiRise (Putnam et al. 2016; https://dovetailgenomics.com) was conducted at Dovetail Genomics, LLC. The sequenced sample consisted of muscle tissue obtained from a female J. hyemalis carolinensis, collected at Mountain Lake Biological Station in Pembroke, Virginia, USA (37.3751°N, 80.5228°W), currently deposited at the Moore Laboratory of Zoology, Occidental College, Los Angeles, California (voucher number: MLZ:bird: 69236). Briefly, a de novo draft assembly was first built using shotgun, paired-end libraries (mean insert size ~350 bp) and the Meraculous pipeline (Chapman et al. 2011). For the ChicagoTM and the Dovetail HiCTM library preparation, chromatin was fixed with formaldehyde. Fixed chromatin was then digested with DpnII and free blunt ends were ligated. Crosslinks were reversed and the DNA purified from protein. Resulting nucleic material was then sheared to ~350 bp mean fragment size and sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters. Sequencing of the libraries was carried out on an Illumina HiSeq X platform. The shotgun reads, Chicago library reads, and Dovetail Hi-C library reads were then used as input data for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies (Putnam et al. 2016).
The mitochondrial genome was assembled using NOVOplasty2.7.2 (Dierckxsens et al. 2017) and the shotgun data. The NADH dehydrogenase subunit 2 (ND2) mitochondrial gene sequenced for a previous study (Friis et al. 2016) and available in NCBI (GenBank accession no. KX461682.1) was used for the input seed sequence.
Identification of repetitive regions
We first created a repeat library for the junco genome by modelling ab initio repeats using Repeat Modeler 1.0.11 (Smit and Hubley 2019) in scaffolds longer than 100 Kb with default options. The resulting repeat library was merged with known bird repeat libraries from the RepBase database (RepBase-20181026) (Bao et al. 2015), Dfam_Consensus-20181026 and repeats from the zebra finch. Then, we used Repeat Masker 4.0.7 (Smit et al. 2015) to identify and mask repeat regions in the whole-genome assembly, and classified the repeat distribution by chromosome.
Gene prediction and functional annotation
Gene prediction was conducted using BRAKER v2.1.5 (Hoff et al. 2019) and GeMoMa v1.7.1 (Keilwagen et al. 2019). We used the repeat soft-masked genome assembly and we first trained Augustus with the conserved orthologous genes from BUSCO Aves_odb10 as proteins from short evolutionary distance (Stanke et al. 2006; Gremme et al. 2005; see Figure 3B from Hoff et al. 2019). The predicted proteins resulting from Augustus training were combined with homology-based annotations using the zebra finch (GCF_008822105.2; Warren et al. 2010) (GCF_008822105.2; Warren et al. 2010) and chicken (GCF_000002315.6; International Chicken Genome Sequencing Consortium 2004) annotated genes with the GeMoMa pipeline, obtaining the final reported gene models.
We applied a similarity-based search approach to conduct the functional annotation of the junco predicted proteins. We first used BLASTP against the UniProt SwissProt database and the annotated proteins from the zebra finch genome (Warren et al. 2010; UniProt Consortium 2019) (E-value 10-5). We only considered as positives those hits covering at least 2/3 of the query sequence length or 80% of the total subject sequence. We also used InterProScan v5.31 (Jones et al. 2014) in order to identify specific protein-domain signatures in the predicted genes. The functional annotation, including Gene Ontology terms, was integrated from all searches providing a curated set of junco coding genes (Fig. S2, Supp. Inf.). We used GenomeTools (Gremme et al. 2013) to calculate the number and mean length of genes, exons, introns and CDS (Coding Sequence) from the annotation file in general feature format (GFF).
Gene completeness assessment and genome synteny
We assessed gene completeness and gene annotation in the genome assembly using BUSCO (Benchmarking Universal Single-Copy Orthologs) v4.0.5 (--auto-lineage-euk option; Waterhouse et al. 2018). BUSCO evaluations were conducted using the 255 and 8,338 single-copy orthologous genes in Eukaryota_odb10 and Aves_odb10 datasets, respectively. In addition, we used MUMmer (Delcher et al. 2003) to explore synteny with the zebra finch (Taeniopygia guttata) genome v87 available in Ensembl (Yates et al. 2016).
Bao, W., K.K. Kojima, and O. Kohany, 2015 Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile Dna 6 (1):1-6.
Consortium, I.C.G.S., 2004 Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432 (7018):695-716.
Delcher, A.L., S.L. Salzberg, and A.M. Phillippy, 2003 Using MUMmer to identify similar regions in large sequence sets. Current protocols in bioinformatics (1):10.13. 11-10.13. 18.
Dierckxsens, N., P. Mardulyn, and G. Smits, 2017 NOVOPlasty: de novo assembly of organelle genomes from whole genome data. Nucleic Acids Research 45 (4):e18-e18.
Friis, G., P. Aleixandre, R. Rodríguez‐Estrella, A.G. Navarro‐Sigüenza, and B. Milá, 2016 Rapid postglacial diversification and long‐term stasis within the songbird genus Junco: phylogeographic and phylogenomic evidence. Molecular Ecology 25 (24):6175-6195.
Gremme, G., V. Brendel, M.E. Sparks, and S. Kurtz, 2005 Engineering a software tool for gene structure prediction in higher organisms. Information and Software Technology 47 (15):965-978.
Gremme, G., S. Steinbiss, and S. Kurtz, 2013 GenomeTools: a comprehensive software library for efficient processing of structured genome annotations. IEEE/ACM Transactions on Computational Biology and Bioinformatics 10 (3):645-656.
Hoff, K.J., A. Lomsadze, M. Borodovsky, and M. Stanke, 2019 Whole-genome annotation with BRAKER, pp. 65-95 in Gene Prediction. Springer.
Jones, P., D. Binns, H.-Y. Chang, M. Fraser, W. Li et al., 2014 InterProScan 5: genome-scale protein function classification. Bioinformatics 30 (9):1236-1240.
Keilwagen, J., F. Hartung, and J. Grau, 2019 GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data. Methods in molecular biology (Clifton, NJ) 1962:161-177.
Putnam, N.H., B.L. O'Connell, J.C. Stites, B.J. Rice, M. Blanchette et al., 2016 Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome Research 26 (3):342-350.
Smit, A., and R. Hubley, 2019 RepeatModeler-1.0. 11. Institute for Sys-tems Biology. http://www. repeatmasker. org/RepeatModeler/. Accessed 15.
Smit, A., R. Hubley, and P. Green, 2015 RepeatMasker Open-4.0. 2013–2015.
Stanke, M., O. Schöffmann, B. Morgenstern, and S. Waack, 2006 Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC bioinformatics 7 (1):62.
UniProt Consortium, 2019 UniProt: a worldwide hub of protein knowledge. Nucleic Acids Research 47 (D1):D506-D515.
Warren, W.C., D.F. Clayton, H. Ellegren, A.P. Arnold, L.W. Hillier et al., 2010 The genome of a songbird. Nature 464 (7289):757.
Waterhouse, R.M., M. Seppey, F.A. Simão, M. Manni, P. Ioannidis et al., 2018 BUSCO applications from quality assessments to gene prediction and phylogenomics. Molecular Biology and Evolution 35 (3):543-548.
Yates, A., W. Akanni, M.R. Amode, D. Barrell, K. Billis et al., 2016 Ensembl 2016. Nucleic Acids Research:gkv1157.
The genome assembly here provided has also been deposited at GenBank along with raw sequence data from Chicago and Hi-C libraries under the accession QZWM00000000.2, Bioproject (accession: PRJNA493001; Biosample accession: SAMN10120167).
Ministerio de Ciencia e Innovación, Award: CGL‐2011‐25866