Data from: Optimizing exome captures in species with large genomes using species-specific repetitive DNA blocker

Kesälahti, Robert 1 ; Kumpula, Timo1; Cervantes, Sandra1 ; Kujala, Sonja2 ; Mattila, Tiina1 ; Tyrmi, Jaakko3 ; Niskanen, Alina1 ; Rastas, Pasi4 ; Savolainen, Outi1 ; Pyhäjärvi, Tanja4

Published Nov 15, 2024 on Dryad. https://doi.org/10.5061/dryad.qfttdz0rw

Data files

Nov 15, 2024 version files 5.02 GB

contigs.agp
1.18 MB
liftover_gff3.awk
1.51 KB
pinus_tabuliformis_v1.0_masked_reference_genome.fa.gz
5.02 GB
README.md
2.62 KB

Abstract

Large and highly repetitive genomes are common. However, research interests usually lie within the non-repetitive parts of the genome, as they are more likely functional, and can be used to answer questions related to adaptation, selection, and evolutionary history. Exome capture is a cost-effective method for providing sequencing data from protein-coding parts of the genes. C0t-1 DNA blockers consist of repetitive DNA and are used in exome captures to prevent the hybridization of repetitive DNA sequences to capture baits or bait-bound genomic DNA. Universal blockers target repetitive regions shared by many species, while species-specific c0t-1 DNA is prepared from the DNA of the studied species, thus perfectly matching the repetitive DNA contents of the species. So far the use of species-specific c0t-1 DNA has been limited to a few model species. Here, we evaluated the performance of blocker treatments in exome captures of Pinus sylvestris, a widely distributed conifer species with a large (> 20 Gbp) and highly repetitive genome. We compared treatment with a commercial universal blocker to treatments with species-specific c0t-1 (30,000 ng and 60,000 ng). Species-specific c0t-1 captured more unique exons than the initial set of targets leading to increased SNP discovery and reduced sequencing of tandem repeats compared to the universal blocker. Based on our results, we recommend optimizing exome captures by using at least 60,000 ng species-specific c0t-1 DNA. It is relatively easy and fast to prepare and can also be used with existing bait set designs.

https://doi.org/10.5061/dryad.qfttdz0rw

Description of the data and file structure

The original Pinus tabuliformis reference genome (v1.0; Niu et al., 2022) was masked to increase mapping to the genome by correcting errors in the genome polishing, as many identical sequences were found at the ends of different chromosomes. To construct a masked version of the reference genome, chromosomes were first split back into contigs. Contigs were then aligned within chromosomes and between unplaced contigs using Minimap2 (Li, 2018). Alignments were then chained to longer ones using the ChainPaf module of Lep-Anchor (Rastas, 2020). Half of the aligning regions of > 10 kb were masked by masking the region in shorter of the two contigs involved in the alignment.

Files and variables

File: liftover_gff3.awk

Description: Liftover genomic coordinates based on an agp file. This script is used to transform genomic coordinates from the contig level to the chromosome level or backwards. This script can be used to convert P. tabuliformis gene space annotation coordinates (gff3 file; available on https://figshare.com/articles/dataset/Pinus_tabuliformis_gene_space_annotation/16847146/1?file=31149952) to contig level for the masked reference genome.

File: contigs.agp

Description: File describing the assembly of chromosomes from contigs in AGP format. Required by the liftover_gff3.awk to liftover genomic coordinates.

File: pinus_tabuliformis_v1.0_masked_reference_genome.fa.gz

Description: Tha masked version of P. tabuliformis reference genome in FASTA format

Code/software

Any text editor is able to view the masked FASTA file, AGP file and the awk script.

awk script usage (requires awk package installed):

#awk [-vinverse=1] -f liftover.awk ref.agp chr_pos_file >chr_pos_file.liftover

#liftover genomic coordinates based on an agp file

#if using vinverse=1, coordinates are mapped backwards

#coordinates not found from the agp are kept untouched

#the first part of the script reads the contigs.apg file used in the coordinate transformations

#and the second part edits your input file

#when using a gff3-file start and end position columns are $4 and $5

Access information

Data was derived from the following sources:

P. tabuliformis reference genome v1.0 https://www.ncbi.nlm.nih.gov/bioproject/PRJNA784915/