Data from: Refinement of the Antarctic fur seal (Arctocephalus gazella) reference genome increases continuity and completeness
Data files
Jul 17, 2024 version files 23.14 GB
-
arcGaz4_h1_annotation.gff3.gz
-
arcGaz4_h1_annotation.gtf.gz
-
busco_gerp_fst.tsv.gz
-
busco.tar.gz
-
de_novo_genomes.tar.gz
-
fst.tar.gz
-
gerp.tar.gz
-
go_terms.tar.gz
-
md5.txt
-
pinniped_set.hal
-
README.md
-
repo_content.txt
-
rerooted.tree
-
slim_zalcal_v1_on_arcgaz_anc_h1.psl.gz
-
win_gerp_fst.tsv.gz
Abstract
The Antarctic fur seal (Arctocephalus gazella) is an important top predator and indicator of the health of the Southern Ocean ecosystem. Although abundant, this species narrowly escaped extinction due to historical sealing and is currently declining as a consequence of climate change. Genomic tools are essential for understanding these anthropogenic impacts and for predicting long-term viability. However, the current reference genome (“arcGaz3”) shows considerable room for improvement in terms of both completeness and contiguity. We therefore combined PacBio sequencing, haplotype aware HiRise assembly and scaffolding based on Hi-C information to generate a refined assembly of the Antarctic fur seal reference genome (“arcGaz4_h1”). The new assembly is 2.53 Gb long, has a scaffold N50 of 55.6 Mb and includes 18 chromosome-sized scaffolds, which correspond to the 18 chromosomes expected in otariids. Genome completeness is greatly improved, with 23,408 annotated genes and a Benchmarking Universal Single-Copy Orthologs (BUSCO) score raised from 84.7% to 95.2%. We furthermore included the new genome in a reference-free alignment of the genomes of eleven pinniped species to characterize evolutionary conservation across the Pinnipedia using genome-wide genomic evolutionary rate profiling (GERP). We then implemented gene ontology (GO) enrichment analyses to identify biological processes associated with those genes showing the highest levels of either conservation or differentiation between the two major pinniped families, Otariidae and Phocidae. We show that processes linked to neuronal development, the circulatory system and osmoregulation are overrepresented both in conserved as well as in differentiated regions of the genome.
README: Data from "Refinement of the Antarctic fur seal (Arctocephalus gazella) reference genome increases continuity and completeness"
https://doi.org/10.5061/dryad.g1jwstqzn
Overview:
This repository contains the data used in the study "Refinement of the Antarctic fur seal (Arctocephalus gazella) reference genome increases continuity and completeness" by Hench et. al., which introduces the Antarctic fur seal reference genome version arcGaz4_h1.
The data set includes:
- the two initial de novo haplotype assemblies by Dovetail genomics (
de_novo_genomes.tar.gz
) - the genome annotation of the final assembly for the first haplotype in two formats ("arcGaz4_h1",
arcGaz4_h1_annotation.gff3.gz
andarcGaz4_h1_annotation.gtf.gz
) - a multi-species whole genome alignment of eleven pinniped species (
pinniped_set.hal
) - the inferred neutral tree for the eleven aligned species (
rerooted.tree
) - a one-on-one whole genome alignment of the Antarctic fur seal and the California sea lion (
slim_zalcal_v1_on_arcgaz_anc_h1.psl.gz
) - BUSCO scoring results for the genome arcGaz4_h1 (
busco.tar.gz
) - evolutionary conservation scores estimated using gerp++ (
gerp.tar.gz
) - genetic differentiation between the aligned otariid and phocid genomes (
fst.tar.gz
) - description of the gene ontology information linked to the BUSCO genes as available at the time of analysis (
go_terms.tar.gz
) - a summary of the conservation and differentiation scores within 50kb sliding windows with 25mb increments along arcGaz4_h1 (
win_gerp_fst.tsv.gz
) - a summary of the conservation and differentiation scores per complete BUSCO gene within arcGaz4_h1 (
busco_gerp_fst.tsv.gz
) - a inventory of all files, when unzipping the zipped folders (
repo_content.txt
) - check-sums for all data files (
md5.txt
) - this README file (
readme.md
)
[ dryad 21.56 GiB ]
├─ pinniped_set.hal │ ███████████████████████████████████│ 93% 20.07 GiB
├─ de_novo_genomes.tar.gz │ █│ 6% 1.40 GiB
├─ gerp.tar.gz │ │ 0% 27.47 MiB
├─ slim_zalcal_v1_on_arcg │ │ 0% 23.91 MiB
├─ fst.tar.gz │ │ 0% 17.13 MiB
├─ arcGaz4_h1_annotation. │ │ 0% 5.16 MiB
├─ win_gerp_fst.tsv.gz │ │ 0% 4.55 MiB
├─ arcGaz4_h1_annotation. │ │ 0% 3.76 MiB
├─ busco.tar.gz │ │ 0% 1.26 MiB
├─ busco_gerp_fst.tsv.gz │ │ 0% 774.47 KiB
├─ go_terms.tar.gz │ │ 0% 384.27 KiB
├─ repo_content.txt │ │ 0% 8.33 KiB
├─ readme.md │ │ 0% 4.81 KiB
├─ md5.txt │ │ 0% 705 B
└─ rerooted.tree │ │ 0% 309 B
The integrity of the files can be checked using the provided check-sums within the file md5.txt
using the following unix command:
md5sum -c md5.txt
The folders can be un-compressed using the unix command tar
, for example:
tar -xf gerp.tar.gz
File types:
After unpacking, most files are plain text files regardless of their suffix (eg .tsv
, .psl
) and can be read by any text editor like any other .txt
file.
The (non-txt) suffix was kept for consistency between the analysis and the repository.
An exception to this is the multi-species alignment pinniped_set.hal
, which is in the "Hierarchical Alignment Format" (hal
).
See Hickey et al. (2013) for a detailed description of the hal format:
Hickey, G. et al. (2013). HAL: a hierarchical format for storing and analyzing multiple genome alignments. Bioinformatics, 29(10), 1341–1342. https://doi.org/10.1093/bioinformatics/btt128
Files that are gzip-compressed (that additionally contain a .gz
suffix) can be un-compressed using the unix command gunzip
, for example:
gunzip busco_gerp_fst.tsv.gz
Relevant software:
For reading the files:
- GNU gzip
- GNU tar
Software originally used within the analysis is documented using conda
environments and apptainer
containers.
For details, please refer to the accompanying code repository at Zenodo (s. below).
Sharing/Access information
In addition to this data repository, there exist two additional repositories linked to the same study:
- Analysis code is deposited at Zenodo at the repository 10.5281/zenodo.10979149
- The final genome assemblies are deposited at NCBI, under the accession numbers (PRJNA1099198, haplotype 1) and (PRJNA1099197, haplotype 2)