Data from: Improved genome assembly of the whiteleg shrimp Penaeus (Litopenaeus) vannamei using long- and short-read sequences from public databases
Data files
Mar 06, 2024 version files 2.42 MB
-
0.mtDNA_extraction.pl
-
README.md
-
Supplementary_Fig._1A.png
-
Supplementary_Fig._1B.png
-
Supplementary_Fig._2.png
-
Supplementary_Figure_1.docx
-
Supplementary_Figure_2.docx
-
Supplementary_Table_1.xlsx
-
Supplementary_Table_2.xlsx
-
Supplementary_Table_3.xlsx
Abstract
A genome assembly contains the complete DNA sequence of a particular organism. This information is necessary to understand the organism's gene functions and genetic variability of their populations. In this study, the genome of the Pacific whiteleg shrimp Penaeus (Litopenaeus) vannamei was assembled using databases from the GenBank, the repository of DNA sequences of the National Institute of Health of the USA, which is of worldwide public access. The three tables and two figures contain supplementary information of the article JOH-2023-155.R1. The information is relevant for the analysis of the new reference-guided genome assembly of the whiteleg shrimp. The Supplementary Table 1 compares observed to expected chromosome sizes. The location of genetic markers in Supplementary Table 2 will be particularly relevant for future genome-wide association studies, which will look for the association of markers and/or genes to traits of interest for aquaculture, such as disease resistance, growth or fecundity. The Supplementary Table 3 shows that many markers tend to align in several parts of the genome indicating the great number of repeated regions in the shrimp's genome. The Supplementary Figure 1 shows the results of genome size estimation based on counting k-mers (substrings of length k contained within a DNA sequence). The Supplementary Figure 2 depicts the linear correlation between the observed and expected length of the assembled chromosomes. The Supplementary Materials 1 file contains the Perl script necessary to extract from the raw-data database, the mitochondrial DNA sequences that are not necessary, and can eventually interfere, in the genome assembly.
README: Supplementary Tables. Improved genome assembly of the whiteleg shrimp Penaeus (Litopenaeus) vannamei using long- and short-read sequences from public databases
https://doi.org/10.5061/dryad.0k6djhb7n
Description of the data and file structure
The supplementary Tables from the article are contained in an Excel file.
The Supplementary Table 1 contains a comparison between the observed and expected shrimp’s chromosomes sizes in nucleotides. The expected size of the chromosomes was estimated by dividing the calculated genome size (Zhang et al. 2019) by the physical size of each chromosome reported by Campos-Ramos (1997). The observed length of each chromosome was calculated counting the number of nucleotides in the genome assembly contained in the fasta file (NCBI bioproject PRJNA1022566).
The Supplementary Table 2 indicates the location of each genetic marker (SNPs or microsatellites) relative to the shrimp chromosomes, derived from a Blast alignment. The information includes the ID of the markers, the chromosome number where the marker aligned to the genome assembly, the percentage of coverage and similitude, the probability of random alignment, the position in base pairs both of the aligned sequence of the query and the reference chromosome, the strand where the marker is aligned (+: left to right; - right to left), and the author who published the marker (some markers do not have authorship but GenBank accession number). Only markers that aligned in a single position are shown. Data can be filtered by chromosome, position, author, etc.
The Supplementary Table 3 contains the ID of markers that aligned more than once in the Blast alignment described in Supplementary Table 2. The markers are ordered from those that aligned most to those that aligned to two sites.
The Supplementary Figure 1 is the output from the analysis with the GenomeScope software which is found in https://github.com/tbenavi1/genomescope2.0
The Supplementary Figure 2 is a linear regression between observed and expected values in The Supplementary Table 1.
note: figures are also uploaded as .png files for easier reuse
The Supplementary Materials 1 contains the file in Perl format containing the instructions to remove mtDNA sequences from the raw data files.
Sharing/Access information
Links to other publicly accessible locations of the data:
- DOI: 10.1093/jhered/esae015
- https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1022566
To make the "genetic_markers.fasta" files for the alignment analysis of Supplementary Table 2, the DNA sequence of the markers have to be searched in the GenBank following the accession numbers cited in each authors' published article, which are referenced in the cited literature of the study.
Code/Software
Supplementary Tables 2 and 3. The blast analysis to align the published marker sequences (contained in a fasta format file “genetic_markers.fasta”) to the new assembly is done with the blastn program. To run it, the new assembly in fasta format (“new_reference-guided_genome.fasta”), in this case linked to bioproject PRJNA1022566 in the GenBank, is put into a database format, and then run with the desired parameters. The scripts are:
makeblastdb -in new_reference-guided_genome.fasta -input_type fasta -dbtype nucl
blastn -db new_reference-guided_genome.fasta -query genetic_markers.fasta -out genetic_markers_out.txt -evalue 1e-50 -outfmt "7 qseqid sseqid pident qcovs qcovhsp length qstart qend sstart send"
Supplementary Materials 1. Simply run in Linux the script contained in the file “0.mtDNA_extraction.pl” with the instruction:
This script uses an one-column file with IDs that are going to be used to remove entries from a fasta file.
Usage:
./0.mtDNA_extraction.pl [name of the fasta file, from which sequences are going to be removed] [name of the file with ids that are going to be removed]
Methods
Supplementary Table 1. The expected size of the chromosomes was estimated by dividing the calculated genome size (Zhang et al. 2019) by the physical size of each chromosome reported by Campos (1997). The obtained length of each chromosome was calculated counting the number of nucleotides in the genome assembly fasta file.
Supplementary Table 2. The location of each marker was obtained by a Blast alignment to the new reference-guided assembly. Retained markers were those that aligned only once to the genome. The script was:
blastn -db new_reference-guided_genome.fasta -query genetic_markers.fasta -out genetic_markers_out.txt -evalue 1e-50 -outfmt "7 qseqid sseqid pident qcovs qcovhsp length qstart qend sstart send"
Supplementary Table 3. The markers listed in this table are those that aligned more than once in the Blast alignment described in Supplementary Table 2. The markers are ordered from those that aligned most to those that aligned to two sites.
Supplementary Figure 1. The construction of the figures comes from the output of the GenomeScope software, which requires the raw sequencing reads used for the assembly and a arbitrary value of k (for the k-mers).
Supplementary Figure 2. The figure was made with the observed and expected data in Supplementary Table 1.
Supplementary Materials 1. File in Perl containing the instructions to remove mtDNA sequences from the raw data files.