Skip to main content

Copy Number variants in Mallotus villosus

Cite this dataset

Cayuela, Hugo; Bernatchez, Louis (2021). Copy Number variants in Mallotus villosus [Dataset]. Dryad.


Taking advantage of recent developments allowing CNV analysis from RAD-seq data (10.1111/mec.15565), we investigated how variation in fitness-related traits, local environmental conditions and demographic history are associated with CNVs, and how subsequent copy number variation drives population genetic structure in a marine fish, the capelin (Mallotus villosus). We collected 1536 DNA samples from 35 sampling sites in the north Atlantic Ocean and identified 6620 putative CNVs. Raw sequencing data for GBS libraries have been published in a previous study (Cayuela et al. 2020,  10.1111/mec.15499) and are available under accession no. PRJNA631144. Here, we provide the files of normalized read depth for the 6620 putative CNVs and environmental data used in our CNV analyses.


Sampling area, phenotypic analyses

We sampled 1536 capelins from 35 spawning sites in the Northwest Atlantic, in Canadian and Greenland waters (Fig.1 and Supplementary material S1, Table S1). Three sites were sampled in the ARC lineage, four sites in GRE lineage, and 28 sites in the NWA lineage. Within the NWA lineage, sampling sites included 18 beach spawning sites, six demersal spawning sites, and four offshore sites. In the whole dataset, the median sample size was 46.5 (range: 19 to 50) individuals per site. The fish were collected and immediately frozen, and a piece of fin was preserved in RNAlater. The gonadosomatic index was measured at the laboratory on 843 individuals (302 females and 541 males) from the 18 beach spawning sites of the NWA lineage. Because male gonads were very small resulting in large measurement error of the gonadosomatic index (data not shown), we focused our analyses on females only.

DNA sequencing, genotyping, and discovery of putative CNVs

Both DNA extractions and library preparations were performed following protocols fully described in Cayuela et al. (2020). Libraries were size-selected using a BluePippin prep (Sage Science), amplified by PCR and sequenced on the Ion Proton P1v2 chip (single-end sequencing). Eighty-two individuals were sequenced per chip.Barcodes were removed using cutadapt (Martin 2011) and trimmed to 80 bp, allowing for an error rate of 0.2. They were then demultiplexed using the ‘process_radtags’ module of Stacks v1.48 (Catchen et al. 2013) and aligned to the capelin draft genome (Cayuela et al. 2020) assembly using bwa-mem (Li 2013) with default parameters. Next, aligned reads were processed with Stacks v.1.48 for SNP calling and genotyping. The ‘pstacks’ module was used with a minimum depth of 3 and up to 3 mismatches were allowed in the catalog creation. We then ran the ‘populations’ module to produce a vcf file that was further filtered using python ( and bash scripts. SNPs were kept if they displayed a read depth greater than 4 and less than 70 (Dorant et al. 2020). Then, we kept SNPs present in at least 70 % of individuals in each sampling location.

We identified putative CNVs using the HDplot approach proposed by McKinney et al. (2017). The duplicated loci detected by this method can be variant or invariant in copy number among the 35 spawning sites of our study area. A duplicated locus is considered as a CNV only if a correlation between the normalized read depth of a marker and a phenotypic trait or an environmental variable was detected. For this reason, we systematically use the term of “putative CNVs” when we refer to the markers identified using the HDplot method. We used a refined version of this approach (Dorant et al. 2020) that discriminates “singleton” SNPs (i.e., non-duplicated) from “duplicated” SNPs (i.e., combining duplicated SNPs and SNPs with a high coverage) using 5 parameters: (1) proportion of heterozygotes (PropHet), (2) Fis, (3) median of allele ratio for heterozygotes (MedRatio), (4) median of SNP read depth for heterozygotes (MedDepthHet) and (5) median of SNP read depth for homozygotes (MedDepthHom). The parameters were calculated using a custom python script (available at parsing the filtered VCF file. The five parameters were plotted pairwise to visualize their distribution across all loci (see Supplementary material, Figures S1 and S2). Based on the graphical demonstration by McKinney et al (2017) and the approach proposed by Dorant et al. (2020), we considered different combinations of parameters and graphically fixed the cut-off of the four categories of SNPs (singleton SNPs, duplicated SNPs, high coverage SNPs, and low confidence). We then kept one single marker per locus (the one with the highest minor allele frequency) and extracted the read depth of duplicated loci to construct the dataset of putative CNVs using vcftools. For more simplicity, the term “putative CNV” is used to define all loci classified as “duplicated” and “high coverage”. Following the procedure described in Dorant et al. (2020), read counts of putative CNV loci were normalized to account for differences in sequencing effort across all samples. Normalization was performed using the Trimmed mean of M-values method originally described for RNA-seq count normalization and implemented in the R package edgeR (Robinson & Oshlack 2010). The correction accounts for the fact that for an individual with a higher copy number at a given locus, that locus will contribute proportionally more to the sequencing library than it will for an individual with lower copy number at that locus. This procedure was applied to the whole dataset (i.e., the 1538 capelins from 35 spawning sites) to detect the putative CNVs present throughout the study area.

Usage notes

Three files are available:

The normalized read depth matrix : gdepth_normalized_CNVs.txt

The file with environmental (temperature, cholorophyll concentration, and salinity for the beach spawning sites within the NWA lineage: environment_data.txt

The file with GSI data for both sexes: GSI_data.txt