# Standing genetic variation in laboratory populations of insecticide-susceptible Phlebotomus papatasi and Lutzomyia longipalpis (Diptera: Psychodidae: Phlebotominae) for the evolution of resistance

## Cite this dataset

Hudson, Spencer; Denlinger, David; Bernhardt, Scott; Gompert, Zachariah (2021). Standing genetic variation in laboratory populations of insecticide-susceptible Phlebotomus papatasi and Lutzomyia longipalpis (Diptera: Psychodidae: Phlebotominae) for the evolution of resistance [Dataset]. Dryad. https://doi.org/10.5061/dryad.9cnp5hqh3

## Abstract

Insecticides can exert strong selection on insect pest species, including those that vector diseases, and have led to rapid evolution of resistance. Despite such rapid evolution, relatively little is known about standing genetic variation for resistance in insecticide-susceptible populations of many species. To help fill this knowledge gap, we generated genotyping-by-sequencing data from insecticide-susceptible Phlebotomus papatasi and Lutzomyia longipalpis sand flies that survived or died from a sub-diagnostic exposure to either permethrin or malathion using a modified version of the Centers for Disease Control and Prevention bottle bioassay. Multi-locus genome-wide association mapping methods were used to quantify standing genetic variation for insecticide resistance in these populations and to identify specific alleles associated with insecticide survival. For each insecticide treatment, we estimated the proportion of the variation in survival explained by the genetic data (i.e. 'chip' heritability) and the number and contribution of individual loci with measurable effects. For all treatments, survival to an insecticide exposure was heritable with a polygenic architecture. Both P. papatasi and L. longipalpis had alleles for survival that resided within many genes throughout their genomes. The implications for resistance conferred by many alleles, as well as inferences made about the utility of laboratory insecticide resistance association studies compared to field observations, are discussed in the manuscript that accompanies this data.

## Methods

Custom perl scripts were used to first demultiplex pooled DNA sequences, wherein identifier barcodes served to assign DNA sequences to individual sand flies. Reference genomes were obtained for *L*.* longipalpis* (*Lutzomyia longipalpis jacobina* = LlonJ11; 154.2 Mbp) and *P*.* papatasi* (*Phlebotomus papatasi Israel* = PpapI1; 363.8 Mbp) from the center for invertebrate vectors of human pathogens (VectorBase). We used the “aln” algorithm from “bwa” (version 0.7.5) to align the DNA sequences to the *L*.* longipalpis* or *P*.* papatasi* reference genome. We allowed for a maximum of four nucleotide differences, no more than two mismatches in a 20 bp seed, and a quality threshold for read trimming set to 10. Along with these parameters, only reads with a single best match were aligned. A small number of sand flies with few aligned sequences were removed before subsequent analyses.

Using “samtools” and “bcftools” (version 0.1.19), sequence alignments were sorted and indexed for variant calling. Treatment groups of each sand fly species were combined for this process. As recommended for Illumina HiSeq data, coefficients to cap mapping quality and the number of reads per position were set to 50. Bases with a quality score below 15 and reads with a mapping quality below 10 were ignored. The prior for *θ* was set to 0.001, and only single nucleotide variants (SNVs) where the posterior probability of an invariant nucleotide was below 0.01 were retained. Each SNV set was filtered based on a 128‐read minimum for overall coverage, a four‐read minimum for the non‐reference allele, a minimum mapping quality of 30, and a maximum of 20% of individuals with missing data.

We estimated allele frequencies for each species and insecticide treatment. Maximum likelihood allele frequency estimates were obtained using an expectation‐maximization algorithm that accounts for uncertainty in genotypes. Genotype estimates are required for association mapping. Thus, we next used a Bayesian approach to estimate genotypes for each SNP and individual. Our empirical Bayesian approach uses the allele frequency estimates to define prior probabilities for genotypes. Posterior probabilities were then obtained according to Bayes rule. We then obtained point estimates (posterior means) of genotypes that take on values between 0 and 2 (copies of the non‐reference allele) but that are not constrained to be integer valued). Pairwise linkage disequilibrium (LD) was calculated in each species from our genotype estimates using the “geno‐r2” function “vcftools” (version 0.1.15). Specifically, we measured LD as the squared correlation between genotypes at pairs of SNPs and computed LD for all pairs of SNPs in 100 kb windows.

Binary Bayesian sparse linear mixed models (BSLMMs) were fit with “gemma” (version 0.98) to estimate genetic contributions to variation in insecticide survival and to identify SNVs with this phenotypic variation. Here, survival outcomes were modeled as a function of a polygenic term (denoted *u*) and a vector of the potential measurable SNV effects (denoted β). A Markov chain Monte Carlo (MCMC) algorithm with variable selection was used to infer posterior inclusion probabilities (PIPs) for SNVs with a non‐zero measurable effect on insecticide susceptibility, and model~~ ~~-average point estimates (MAPEs) were derived from those PIPs. The polygenic term in each BSLMM represents expected deviations from a phenotypic mean based on all SNVs while accounting for phenotypic covariance that arise between sand flies due to relatedness or genetic similarity (i.e., observed kinship). Relatedness was also considered when estimating individual SNV effects (β) and their PIPs with kinship matrices. Parameters for estimating genetic architecture were derived from the hierarchical structure of the BSLMM. Altogether, the parameters indicate the proportion of the phenotypic variance explained (PVE) by additive genetic effects (based on β and the polygenic term), the proportion of PVE explained by measurable‐effect SNVs (PGE) or those implicated by LD (β alone), and the number of SNVs with effects that explain phenotypic variance (*n*‐γ).

Thirty independent MCMC chains were run for binary BSLMMs, wherein a probit link function was used to connect the binary response (survival outcome) to a latent quantitative risk variable. MCMC chains included 100,000 burn‐in steps, 1 million sampling steps, and a thinning interval of 10. We assessed convergence to the posterior distribution by calculating the Gelman–Rubin potential scale reduction diagnostic for PVE, PGE and *n*‐γ in R with the “CODA” package (version 0.19.3); values of this statistic for were generally less than 1.1 consistent with convergence. To reduce bias in estimation, inferences were carried out using the combined values from all iterations across chains.

We used five‐fold cross‐validation to evaluate the predictive power of the genome‐wide association mapping models. To do this, we refit the BSLMM model five times for each data set (species and insecticide treatment). In each case, we used a random 80% of the observations as a training set to fit the model and the other 20% to evaluate the model. We fit the BSLMM models via MCMC with 100,000 steps as a burn‐in, followed by 1 million sampling steps with a thinning interval of 10. The fit model was used to predict the survival phenotype of the test individuals, that is to obtain genomic‐estimated breeding values for each of the test individuals based on the additive effects of genes were captured by both β and *u* in the BSLMMs. We used the full set of predictions across the five‐fold cross‐validation sets to assess predictive performance. This was done using the R package “ROCR” (version 1.0.7); receiver operator characteristic (ROC) curves were constructed to interpret the area under the curve (AUC) and determine the predictive power in correctly classifying survival outcomes.