Copy number variation (CNV) makes a major contribution to overall genetic variation and is suspected to play an important role in adaptation. However, aside from a few model species, the extent of CNV in natural populations has seldom been investigated. Here, we report on CNV in the pea aphid Acyrthosiphon pisum, a powerful system for studying the genetic architecture of host plant adaptation and speciation thanks to multiple host races forming a continuum of genetic divergence. Recent studies have highlighted the potential importance of chemosensory genes, including the gustatory and olfactory receptor gene families (Grs and Ors, respectively), in the process of host race formation. We used targeted re-sequencing to achieve a very high depth of coverage, and thereby revealed the extent of CNV of 434 genes, including 150 chemosensory genes, in 104 individuals distributed across eight host races of the pea aphid. We found that CNV was widespread in our global sample, with a significantly higher occurrence in multigene families, especially in Ors, and a decrease in the probability of complete gene duplication or deletion (CDD) with increase in coding sequence length. Genes with CDD variants were usually more polymorphic for copy number, especially in the P450 gene family where toxin resistance may be related to gene dosage. We found that Grs were over-represented among genes discriminating host races, as were CDD genes and pseudogenes. Our observations shed new light on CNV dynamics and are consistent with CNV playing a role in both local adaptation and speciation.
01 - Output files from Picard "CalculateHsMetrics"
Raw coverage estimate per subtarget (aka "baits") per individual estimated by the Picard tool "CalculateHsMetrics". For each individual, two files are available. 1) A "*.HsMetrics" file containing the global Hsmetrics information (see http://picard.sourceforge.net/picard-metric-definitions.shtml#HsMetrics). 2) A "*.HsMetrics.Targets" file containing the coverage information per subtarget.
RawCoverageCount_PicardCalculateHsMetrics.rar
02 - R matrix of normalized read count data
Matrix of normalized read count data in R format. Few formating steps have been performed here (see main MBE paper and script "./01_GetData/script.R" in the pipeline). 1) Normalisation of the read counts. 2) Remove targets with more than 5% of reads with PHRED score<10. Note that these data are those BEFORE the squared root and polynomial transformations.
NormalizedCoverageData.Rdata
03 - Squared root and transformed read count data
R list resulting from the script "./02_PreProcessing/script.R" in the pipeline). This list contain several sets of data and information including the matrix of squared root and polynomially transformed read count data (i.e. the data used in the estimation of CNV).
02_PreProcessing.Rdata
04 - Matrix of raw alpha values
Matrix of raw alpha values (i.e. CNV estimates) obtained from the function "findOptimalSegmentations" (package "optimalCaptureSegmentation"). The values are not rounded here.
RawAlphaMatrix_OptimalSegmentation.R
05 - Data for and results of GLMM1
R file containing the data frame and GLMM results of the first GLMM of the paper. Response variable: CNV presence/absence - a gene was considered to show CNV in a race if at least one of its subtargets presented a CN variant (i.e. CN≠1X) in at least one individual of this race.
14_Binom_Polym_Div-Rel.Rdata
06 - Data for and results of GLMM2
R file containing the data frame and GLMM results of the second GLMM of the paper. Response variable: complete duplication or deletion (CDD) vs partial duplication/deletion - a gene was considered CDD if all of its subtargets showed CN variants in at least one individual in the race.
15_Binom_Dup-CpDup_Div-Rel.Rdata
07 - Data for and results of GLMM3
R file containing the data frame and GLMM results of the third GLMM of the paper. Response variable: CNV frequency - the proportion of individuals in a race with CN≠1X.
16_Binom_fqcy_Div-Rel.Rdata
09 - NJ tree of CNV per capture pools and sequencing lanes
NJ tree based on CNV data (data used: square rooted and polynomial transformed read counts, alpha rounded to the closest half unit). Capture pools and sequencing lanes are shown. The distance matrix was computed using usual Euclidean distance (by opposition to the distance matrix obtained from the Random Forest analysis).
Res_CNV-NJ_PoolsAndLanes.pdf
08 - RandomForest results and gene family importance
R file contatining all the information concerning the RF analysis and the estimation of gene family importance in discriminating host races. Although this piece of information has been realease for transparency, beware that the R files contain many objects and is quite difficult to investigate. Please refer to the script "./08_GeneFamImptceTest/script.R" or contact the main authors for more information.
08_GeneFamImptceTest.Rdata
10 - ExtendedFigure1_CNValongCHR_AllTargetsIncluded
Extended Figure 1. CNV along chromosomes for all targeted loci. Each line represents an individual (one colour per race plus Medicago standard in purple). For clarity, values of alpha represented here are those before rounding to the closest half unit (the red box represents the area in which alpha values were rounded to one). Vertical light grey shaded areas represent targets as originally designed whereas bottom dark grey and gold boxes represent subtargets excluded and retained for final analyses, respectively. Retained subtargets are linked by full lines when from the same target and by dotted lines where not. Gene names (and scaffold numbers) are indicated above each plot. Control genes are shown with their alias names. The “P” in gene names stands for pseudogene. The “*” symbol indicates genes partially represented due to the absence of targets upstream or downstream filtered out during cleaning steps.
Link to FigS2_CNV_along_CHR.pdf