Exome capture of Antarctic krill (Euphausia superba) for cost effective genotyping and population genetics with historical collections
Data files
May 26, 2025 version files 11.03 MB
-
gene2phylo_config.yaml
651 B
-
README.md
6.77 KB
-
sankey_plots.zip
10.17 MB
-
skim2mito_config.yaml
1.15 KB
-
skim2mito_gene.fasta
121.83 KB
-
skim2mito_samples.csv
12.10 KB
-
skim2mito_seed.fasta
167.90 KB
-
skim2mito_shao_et_al.zip
403.22 KB
-
skim2rrna_config.yaml
992 B
-
skim2rrna_gene.fasta
10.25 KB
-
skim2rrna_samples.csv
12.10 KB
-
skim2rrna_seed.fasta
10.68 KB
-
skim2rrna_shao_et_al.zip
121.02 KB
Abstract
Antarctic krill (Euphausia superba Dana) is a keystone species in the Southern Ocean ecosystem, with ecological and commercial significance. However, its vulnerability to climate change requires an urgent investigation of its adaptive potential to future environmental conditions. Historical museum collections of krill from the early 20th century represent an ideal opportunity to investigate how krill have changed over time due to predation, fishing, and climate change. However, there is currently no cost-effective method for implementing population scale collection genomics for krill given its genome size (48Gbp). Here, we assessed the utility of two inexpensive methods for population genetics using historical krill samples, specifically low-coverage shotgun sequencing (i.e., “genome-skimming”) and exome capture. Two full-length transcriptomes were generated and used to identify 166 putative gene targets for exome capture bait design. A total of 20 historical krill samples were sequenced using shotgun and exome capture. Mitochondrial and nuclear ribosomal sequences were assembled from both low-coverage shotgun and off-target of exome capture data demonstrating that endogenous DNA sequences could be assembled from historical collections. Although, mitochondrial and ribosomal sequences are variable across individuals from different populations, phylogenetic analysis does not identify any population structure. We find exome capture provides approximately 4,500-fold enrichment of sequencing targeted genes, suggesting this approach can generate the sequencing depth required to call identify a significant number of variants. Unlocking historical collections for genomic analyses using exome capture, will provide valuable insights into past and present biodiversity, resilience, and adaptability of krill populations to climate change.
https://doi.org/10.5061/dryad.v6wwpzh4p
Description of the data and file structure
This data includes the configuration files and reference data used for the analyses of genome skimming data from krill museum collections. In addition, mitochondrial and ribosomal sequences assembled from previously published raw sequence data are shared here. Finally, sankey plots to visualise where sequence data was retained or lost throughout each step of the pipeline are presented.
Files and variables
File: gene2phylo_config.yaml
Description: Configuration file for the gene2phylo pipeline used in this study. This YAML file contains all parameter settings and input specifications required to reproduce the phylogenetic analyses presented in the paper. Key configuration options include input directory paths, sequence alignment parameters (realignment settings, missing data thresholds, trimming methods), outgroup specification, and output plot dimensions.
File: skim2mito_config.yaml
Description: Configuration file for the skim2mito pipeline used in this study. This YAML file specifies all parameters and settings required to extract, assemble, and analyse mitochondrial genomes from genome skimming data. The configuration includes sample sheet specifications, sequencing adapter sequences, read processing options (deduplication settings), mitochondrial genome assembly parameters (GetOrganelle reference database selection), gene annotation settings (MITOS reference database and genetic code), sequence alignment and trimming parameters, outgroup designation for phylogenetic analysis, and output plot dimensions.
File: skim2mito_samples.csv
Description: Sample sheet for the skim2mito pipeline containing metadata and file paths for all sequencing samples analyzed in this study. The CSV file includes six columns: sample ID, forward read file path, reverse read file path, NCBI taxonomy ID (taxid), seed sequence reference, and target gene specification. Each row represents one paired-end sequencing sample with corresponding FASTQ file locations in the data directory. The taxonomy ID (6819) indicates the target taxonomic group for mitochondrial genome assembly. This sample sheet enables batch processing of multiple genome skimming datasets through the skim2mito pipeline.
Variables
- ID: Sample ID
- forward: Path to forward reads
- reverse: Path to reverse reads
- taxid: NCBI taxon ID
- seed: Path to GetOrganelle seed reference
- gene: Path to GetOrganelle gene reference
File: skim2mito_gene.fasta
Description: Reference gene dataset used for mitochondrial genome assembly and annotation with GetOrganelle. This FASTA file was generated using go_fetch.py (https://github.com/o-william-white/go_fetch) which downloads sequences from related taxa from NCBI.
File: skim2mito_seed.fasta
Description: Reference seed dataset used for mitochondrial genome assembly and annotation with GetOrganelle. This FASTA file was generated using go_fetch.py (https://github.com/o-william-white/go_fetch) which downloads sequences from related taxa from NCBI.
File: skim2rrna_config.yaml
Description: Configuration file for the skim2rrna pipeline used in this study. This YAML file specifies all parameters and settings required to extract, assemble, and analyse ribosomal sequences from genome skimming data. The configuration includes sample sheet specifications, sequencing adapter sequences, read processing options (deduplication settings), gene annotation settings (barrnap kingdom), sequence alignment and trimming parameters, outgroup designation for phylogenetic analysis, and output plot dimensions.
File: skim2rrna_samples.csv
Description: Sample sheet for the skim2rrna pipeline containing metadata and file paths for all sequencing samples analyzed in this study. The CSV file includes six columns: sample ID, forward read file path, reverse read file path, NCBI taxonomy ID (taxid), seed sequence reference, and target gene specification. Each row represents one paired-end sequencing sample with corresponding FASTQ file locations in the data directory. The taxonomy ID (6819) indicates the target taxonomic group for ribosomal gene assembly. This sample sheet enables batch processing of multiple genome skimming datasets through the skim2rrna pipeline.
Variables
- ID: Sample ID
- forward: Path to forward reads
- reverse: Path to reverse reads
- taxid: NCBI taxon ID
- seed: Path to GetOrganelle seed reference
- gene: Path to GetOrganelle gene reference
File: skim2rrna_gene.fasta
Description: Reference gene dataset used for ribosomal gene assembly and annotation with GetOrganelle. This FASTA file was generated using go_fetch.py (https://github.com/o-william-white/go_fetch) which downloads sequences from related taxa from NCBI.
File: skim2rrna_seed.fasta
Description: Reference seed dataset used for ribosomal gene assembly and annotation with GetOrganelle. This FASTA file was generated using go_fetch.py (https://github.com/o-william-white/go_fetch) which downloads sequences from related taxa from NCBI.
File: skim2rrna_shao_et_al.zip
Description: Zipped file containing ribosomal assemblies of raw sequence data from Shao et al. (2023) https://doi.org/10.1016/j.cell.2023.02.005. The assembled sequences are in FASTA format and samples are named following the sample names used by Shao et al. (2023). In total there are 78 FASTA files.
File: skim2mito_shao_et_al.zip
Description: Zipped file containing mitochondrial assemblies of raw sequence data from Shao et al. (2023) https://doi.org/10.1016/j.cell.2023.02.005. The assembled sequences are in FASTA format and samples are named following the sample names used by Shao et al. (2023). In total there are 78 FASTA files.
File: sankey_plots.zip
Description: Zipped file containing sankey plots to visualise where sequence data was retained or lost throughout each step of the pipeline. The sample names are the same as those presented in our study. The sankey plots can be visualised in the HTML files and the associated metadata used to build the plots is found in the folder sharing the same sample name. In total there are 40 HTML files.
