Gasterosteus aculeatus gynogenetic reference genome and functional annotations version 1 and raw PacBio and Illumina data
Data files
Jan 20, 2023 version files 47.12 GB
-
016562_filtered_reads.fastq.gz
-
016563_filtered_reads.fastq.gz
-
016564_filtered_reads.fastq.gz
-
016565_filtered_reads.fastq.gz
-
016566_filtered_reads.fastq.gz
-
016567_filtered_reads.fastq.gz
-
016568_filtered_reads.fastq.gz
-
016569_filtered_reads.fastq.gz
-
016570_filtered_reads.fastq.gz
-
016571_filtered_reads.fastq.gz
-
016572_filtered_reads.fastq.gz
-
016573_filtered_reads.fastq.gz
-
016574_filtered_reads.fastq.gz
-
016575_filtered_reads.fastq.gz
-
016576_filtered_reads.fastq.gz
-
016578_filtered_reads.fastq.gz
-
016579_filtered_reads.fastq.gz
-
016580_filtered_reads.fastq.gz
-
016581_filtered_reads.fastq.gz
-
016582_filtered_reads.fastq.gz
-
016583_filtered_reads.fastq.gz
-
016584_filtered_reads.fastq.gz
-
016585_filtered_reads.fastq.gz
-
016586_filtered_reads.fastq.gz
-
016587_filtered_reads.fastq.gz
-
016588_filtered_reads.fastq.gz
-
016589_filtered_reads.fastq.gz
-
016590_filtered_reads.fastq.gz
-
016591_filtered_reads.fastq.gz
-
016592_filtered_reads.fastq.gz
-
016593_filtered_reads.fastq.gz
-
016594_filtered_reads.fastq.gz
-
016595_filtered_reads.fastq.gz
-
016596_filtered_reads.fastq.gz
-
016597_filtered_reads.fastq.gz
-
016598_filtered_reads.fastq.gz
-
121117_I235_FCD0WEEACXX_L4_SZABPI017450-17_1.fastq.gz
-
121117_I235_FCD0WEEACXX_L4_SZABPI017450-17_2.fastq.gz
-
121117_I235_FCD0WEEACXX_L4_SZAIPI017449-21_1.fastq.gz
-
121117_I235_FCD0WEEACXX_L4_SZAIPI017449-21_2.fastq.gz
-
121117_I235_FCD0WEEACXX_L8_SZAMPI017448-23_1.fastq.gz
-
121117_I235_FCD0WEEACXX_L8_SZAMPI017448-23_2.fastq.gz
-
Gynogen_pchrom_assembly_all.fasta.gz
-
Gynogen_pchrom_assembly_all.gff.gz
-
README.md
Abstract
Whole genome sequencing enables us to ask fundamental questions about the genetic basis of adaptation, population structure, and epigenetic mechanisms, but usually requires a suitable reference genome for making sense of the sequence data. While the availability of reference genomes has significantly improvement in both taxonomic coverage and overall quality, this poses a challenge for researchers in determining which reference genome best suits their data. Here we compare the use of two different reference genomes for the three-spined stickleback (Gasterosteus aculeatus), one novel genome from a European individual and the published reference genome of a North American individual. Specifically, we investigate the impact of using a local reference versus one generated from a differentiated population on several commonly used metrics in population genomics. Through mapping genome resequencing data of 60 sticklebacks from across Europe and North America, we confirmed genome quality is an important factor in choosing a reference genome. A local reference genome did offer increased mapping efficiency and genotyping accuracy, likely stemming from the higher similarity in genome sequence and synteny. Despite comparable distributions of the metrics generated across the genome using SNP data (i.e., π, Tajima’s D, and FST), window-based statistics using different references resulted in different outlier genes and enriched gene functions. In contrast, the marker-based analysis utilising DNA methylation distributions had a considerably higher overlap in outlier genes and functions when using different reference genomes. Overall, our results highlight how using a local reference genome can increase the resolution of genome scans when multiple similar-quality reference genomes are available. Such results have implications in the detection of signatures of selection.
Methods
The induction of the diploid gynogen individual followed the protocol established for three-spined sticklebacks (Samonte-Padilla et al., 2011). In brief, fertilised stickleback eggs were mixed with UV-irradiated sperm (2 minutes of exposure) and then exposed to a heat-shock treatment of 34°C for 4 minutes, 5 minutes post-fertilisation. The treatment caused the genetic inactivation of the sperm, resulting in homozygote maternal offspring that lack paternal alleles (Samonte-Padilla et al., 2011). In order to increase the likelihood of embryo development, two siblings from the same family were used for this process. After 5 months post-hatching, a fish was sacrificed. DNA was extracted using the Qiagen high molecular weight extraction kit following the manufacturer’s protocol. Sequencing was then conducted at the Beijing Genomics Institute (BGI), taking place on a PacBio platform. In total, 2,805,993 reads were generated of which 214,285 were discarded before further analysis given their short length. In addition, Illumina 6 paired-end sequencing libraries in a HiSeq2500 platform were constructed with insert sizes of about 170 base pairs (bp), 500 bp and 800 bp.
Genome assembly was performed using Canu v1.6 assembler (Koren et al., 2017), followed by an internal polishing step using Quiver. To hybrid polish the PacBio assembly, a total of 316,797,342 high-quality Illumina reads were mapped to the contigs using BWA-MEM v0.7.15 (Li, 2013) and the alignment was then used for further polishing using Pilon v1.22 (Walker et al., 2014). Illumina raw reads were trimmed, and low quality and adaptor sequences were removed using Cutadapt v1.13 (Martin, 2011). To evaluate the PacBio de novo contig-level assembly and search for potential misassemblies, we used FRCbam v1.3.0 (Vezzi, Narzisi, & Mishra, 2012), whereas its completeness in terms of core orthologous genes was assessed using BUSCO v3.0.1 (Simão, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015) and the ‘actinopterygii’ data set. The 32.1kb contig tig00001189_pilon mapped to the mitochondrial genome of the North American G. aculeatus (Peichel et al., 2017) and was trimmed and labelled the mitochondrial genome. The sequence was trimmed to only include the 15.8kb that aligned to the North American mitochondrial genome given the size of the mitochondrial genomes is moderately conserved even across long phylogenetic distances (Gissi, Iannelli, & Pesole, 2008). The excess 16.3kb of contig tig00001189_pilon were labelled and kept in the assembly, making up two additional contigs.
To scaffold the European G. aculeatus contig-level gynogen assembly into pseudo-chromosomes we used Chromosemble from the Satsuma2 package (Grabherr et al., 2010). Here, we ordered and oriented the contigs based on synteny with the North American G. aculeatus chromosome-level genome (Peichel et al., 2017), excluding the unmapped scaffolds. We tested for the effect of contig size on alignment rates, and only retained contigs with an alignment rate of 70% or above for further assembly (supplementary methods 1). Gynogen contigs not scaffolded onto a pseudo-chromosome were concatenated in size order into an unmapped scaffold for population genomic analyses, separating each contig by 1,000 N’s. Synteny between the European and North American G. aculeatus assemblies was calculated using the SatsumaSynteny2 function (Grabherr et al., 2010), and plotted using the Circos v0.69 visualisation tool (Krzywinski et al., 2009). It should be noted that the reference genomes have been created using distinctly different methodologies. As such, significant reference origin by population origin interactions that do not clearly exhibit a reciprocal effect in the test statistic likely include a genome quality effect.
Repetitive sequences were identified de novo in the European pseudo-chromosome assembly using Repeat Modeler v.2.0.1 while Repeat Masker v.4.1.1 (Smit, Hubley, & Green, 2015) was used to mask the genome using the three-spined stickleback and zebrafish libraries in two separately rounds. The results of each round were then analyzed together, complex repeats were separated, to produce the final repeat annotation. Genome annotation was performed on the European repeat-masked pseudo-chromosome genome assembly using MAKER2 v.2.31.9 (Holt & Yandell, 2011). Subsequent genome annotation was performed following a two-round approach. For the first round, the repeat annotation data (release 95) as well as G. aculeatus transcriptome and protein sequences from ENSEMBL and UniProt/SwissProt databases were used as evidence sets for the prediction of gene models, while est2genome and protein2genome setting were set as 1. For the second round, SNAP (Korf, 2004), with ADE of 0.25 and length of 50, and AUGUSTUS v.3.2.3 with default values (Stanke, Diekhans, Baertsch, & Haussler, 2008) were trained on the gene model predicted from the first round. Functional annotation was performed using BlastP against UniProt proteins with an E-value threshold of 1e-5, and InterProScan v.5.4-47 (P. Jones et al., 2014) was used for domain annotation. The resulting gene models were filtered to retain those with AED value of 0.5 or less, having PFAM annotations and significant hits to known proteins against UniProt DB (E-value 1e-5).