A genome assembly contains the complete DNA sequence of a particular organism. This information is necessary to understand the organism's gene functions and genetic variability of their populations. In this study, the genome of the Pacific whiteleg shrimp Penaeus (Litopenaeus) vannamei was assembled using databases from the GenBank, the repository of DNA sequences of the National Institute of Health of the USA, which is of worldwide public access. The three tables and two figures contain supplementary information of the article JOH-2023-155.R1. The information is relevant for the analysis of the new reference-guided genome assembly of the whiteleg shrimp. The Supplementary Table 1 compares observed to expected chromosome sizes. The location of genetic markers in Supplementary Table 2 will be particularly relevant for future genome-wide association studies, which will look for the association of markers and/or genes to traits of interest for aquaculture, such as disease resistance, growth or fecundity. The Supplementary Table 3 shows that many markers tend to align in several parts of the genome indicating the great number of repeated regions in the shrimp's genome. The Supplementary Figure 1 shows the results of genome size estimation based on counting k-mers (substrings of length k contained within a DNA sequence). The Supplementary Figure 2 depicts the linear correlation between the observed and expected length of the assembled chromosomes. The Supplementary Materials 1 file contains the Perl script necessary to extract from the raw-data database, the mitochondrial DNA sequences that are not necessary, and can eventually interfere, in the genome assembly.

Supplementary Table 1. The expected size of the chromosomes was estimated by dividing the calculated genome size (Zhang et al. 2019) by the physical size of each chromosome reported by Campos (1997). The obtained length of each chromosome was calculated counting the number of nucleotides in the genome assembly fasta file.

Supplementary Table 2. The location of each marker was obtained by a Blast alignment to the new reference-guided assembly. Retained markers were those that aligned only once to the genome. The script was:

blastn -db new_reference-guided_genome.fasta -query genetic_markers.fasta -out genetic_markers_out.txt -evalue 1e-50 -outfmt "7 qseqid sseqid pident qcovs qcovhsp length qstart qend sstart send"

Supplementary Table 3. The markers listed in this table are those that aligned more than once in the Blast alignment described in Supplementary Table 2. The markers are ordered from those that aligned most to those that aligned to two sites.

Supplementary Figure 1. The construction of the figures comes from the output of the GenomeScope software, which requires the raw sequencing reads used for the assembly and a arbitrary value of k (for the k-mers).

Supplementary Figure 2. The figure was made with the observed and expected data in Supplementary Table 1.

Supplementary Materials 1. File in Perl containing the instructions to remove mtDNA sequences from the raw data files.

Data from: Improved genome assembly of the whiteleg shrimp Penaeus (Litopenaeus) vannamei using long- and short-read sequences from public databases

Data files

Abstract

Description of the data and file structure

Sharing/Access information

Code/Software

Data from: Improved genome assembly of the whiteleg shrimp Penaeus (Litopenaeus) vannamei using long- and short-read sequences from public databases

Data files

Abstract

README: Supplementary Tables. Improved genome assembly of the whiteleg shrimp Penaeus (Litopenaeus) vannamei using long- and short-read sequences from public databases

Description of the data and file structure

Sharing/Access information

Code/Software

Methods

Works referencing this dataset