Data from: De novo genome assemblies of four rainbow trout genetic lines reveal structural variants in pursuit of a pangenome reference
Data files
Jun 17, 2026 version files 102.32 MB
-
AR_merged_filtered_WR_is_Ref.vcf.gz
17.44 MB
-
KC_merged_filtered_WR_is_Ref.vcf.gz
12.67 MB
-
KC_merged_filtered.vcf.gz
16.93 MB
-
README.md
3.60 KB
-
SW_merged_filtered_WR_is_Ref.vcf.gz
18.77 MB
-
SW_merged_filtered.vcf.gz
19.02 MB
-
WR_merged_filtered.vcf.gz
17.49 MB
Abstract
Rainbow trout (Oncorhynchus mykiss) exhibit extensive genomic diversity shaped by domestication, life history, and geographic origin. To advance the development of a comprehensive pangenome reference, we present new de novo genome assemblies of two genetically and ecologically distinct lines: Whale Rock (WR; wild, landlocked, Central California) and Keithley Creek (KC; wild, resident, interior Columbia Basin), along with previously published assemblies of Arlee (domesticated, Northern California) and Swanson (semi-domesticated, resident, Alaska). All assemblies provide nearly complete coverage of known genes (BUSCO 95.8–99.7 %) and are similar in genome size (~2.3 Gb), with scaffold N50 values between 3.4 Mb (KC) and 52.4 Mb (Swanson). Comparative whole-genome alignments reveal high sequence conservation (97–98 % identity) among assemblies, but also evidence of extensive structural variation of at least 50 bp in length. Structural variant (SV) profiling identified tens of thousands of deletions, insertions, and complex rearrangements largely in noncoding sequences. In an initial assessment of the utility of having multiple de novo genome assemblies for rainbow trout, we found that two strains (Arlee and Swanson; domesticated) share SVs enriched in genes linked with growth, reproduction, and domestication, such as GTP binding and ECM-receptor interaction. In comparison, the other two strains (WR and KC; wild origin) share SVs associated with reproductive timing such as GnRH signaling pathway. Both Arlee and WR also have unique SVs potentially related to their geographic origin and unique life history. This dataset contain six VCF files with information on all the SVs detected in pairwise alignments using either the Arlee or WR genome assembly as the reference.
Dataset DOI: 10.5061/dryad.1rn8pk17m
Description of the data and file structure
Pairwise comparative whole genome alignments of de-novo genome assemblies from four rainbow trout genetic lines revealed extensive structural variation of at least 50 bp in length. Structural variant (SV) profiling identified tens of thousands of deletions, insertions, and complex rearrangements largely in noncoding sequences. To identify SVs (≥50 bp in length) among rainbow trout strains, we used MUMmer (v3.23) and SyRI (v1.6.3) for whole-genome alignment and variant calling. NUCmer (NUCleotide MUMmer) version 3.1 was applied for pairwise alignments of genomes using the --maxmatch option under the following: minimum cluster length (-c) of 500 bp, break length (-b) of 500 bp, and minimum match length (-l) of 100 bp. For each line alignment with the Arlee or WR genome assembly as the reference, chromosome-level VCF files that were generated by SyRI were compressed, indexed, and then concatenated with bcftools concat (v1.3.1). The six VCF files with filtered SVs for each line based on alignment to Arlee or WR as the reference were deposited in this database.
Files and variables
File: AR_merged_filtered_WR_is_Ref.vcf.gz
Description: Filtered SVs detected from alignment of the Arlee line genome to the WR reference genome.
File: KC_merged_filtered.vcf.gz
Description: Filtered SVs detected from alignment of the KC line genome to the Arlee reference genome.
File: SW_merged_filtered.vcf.gz
Description: Filtered SVs detected from alignment of the Swanson line genome to the Arlee reference genome.
File: SW_merged_filtered_WR_is_Ref.vcf.gz
Description: Filtered SVs detected from alignment of the Swanson line genome to the WR reference genome.
File: WR_merged_filtered.vcf.gz
Description: Filtered SVs detected from alignment of the WR line genome to the Arlee reference genome.
File: KC_merged_filtered_WR_is_Ref.vcf.gz
Description: Filtered SVs detected from alignment of the KC line genome to the WR reference genome.
Code/software
Variant Call Format (VCF) is a text file format for representing genomic variation. Using data compression, VCF stores information to facilitates data transfer and collaboration. To view or use genetic Variant Call Format (VCF) files, you need bioinformatics tools such as VCFtools, BCFtools, or SCI-VCF, which offer command-line functionalities for analysis and manipulation. For visualization and graphical interfaces, software like IGV (Integrative Genomics Viewer), VIVA, and the R package vcfR are suitable for summarizing, comparing, and inspecting variant data.
Access information
Data was derived from the following sources:
- USDA_OmykA_1.1: GCA_013265735.3; https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_013265735.2/
- Omyk_2.0: GCA_025558465.1; https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_025558465.1/
- USDA_OmykWR_1.0: GCA_029834435.1; https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_029834435.1/
- USDA_OmykKC_1.0: GCA_034753235.1; https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_034753235.1/
To identify SVs (≥50 bp in length) among rainbow trout strains, we used MUMmer (v3.23) and SyRI (v1.6.3) for whole-genome alignment and variant calling. NUCmer (NUCleotide MUMmer) version 3.1 was applied for pairwise alignments of genomes using the --maxmatch option under the following: minimum cluster length (-c) of 500 bp, break length (-b) of 500 bp, and minimum match length (-l) of 100 bp. Settings were selected based on prior studies, and all values are reported in DNA base pairs (bp). Arlee was used as the primary reference genome in this study, except for three cases where WR genome served as a reference to identify wild-specific syntenic blocks, WR un-aligned/unique regions, and Arlee-unique SVs. For downstream analysis using DNAdiff (version 1.3), delta files were filtered using delta-filter with the following identity (-i 90) and minimum alignment length (-l 100) of 90% and 100 bp, respectively. Filtered delta files were used as input to dnadiff for alignment statistics and genomic differences. For calling SVs, delta files were subsequently filtered with -m option to retain 1-to-1 alignments, and coordinates were retrieved using show-coords with -THrd option. SyRI was then run with filtered delta and coordinate files using the reference and query FASTA sequences for the detection and annotation of SVs. All SVs included in downstream analyses were ≥50 bp in length. SyRI analysis commands were executed with at least two computational threads (--nc 2) and the logging configured for debugging (--log DEBUG). All SyRI analyses were conducted within Anaconda environment for reproducibility and for managing dependencies.
For each line alignment with the Arlee or WR as the reference, chromosome-level VCF files that were generated by SyRI were compressed, indexed, and then concatenated with bcftools concat (v1.3.1). To functionally annotate SVs, we used an in-house Python pipeline to simultaneously annotate multiple VCF files using SnpEff v5.2. The script went through a specified directory for indexed VCF files. For each input VCF, the pipeline produced an annotated VCF as well as complementary HTML and CSV summary reports. Annotations were performed using a custom SnpEff database derived from the O. mykiss Arlee reference genome assembly (GCF_013265735.2) and corresponding GTF annotation file. SnpEff annotated the SVs as MODIFIER (non-coding/untranslated regions or RNA gene) or functional with LOW (synonymous or other minor changes), MODERATE (changes in coding sequence), or HIGH (severe changes including stop/start gains and feature truncations) effects. To identify SVs with potential functional consequences, we used SnpSift (version 5.2) to filter high-impact insertions, deletions, inversions, and duplications across the three query genomes. The six VCF files with filtered SVs for each line based on alignment to Arlee or WR as the reference were deposited in this database.
