Genetic assignment of individuals to source populations using network estimation tools
Kuismin, Markku et al. (2019), Genetic assignment of individuals to source populations using network estimation tools, Dryad, Dataset, https://doi.org/10.5061/dryad.gqnk98sh8
Dispersal, the movement of individuals between populations, is crucial in many ecological and genetic processes. However, direct identification of dispersing individuals is difficult or impossible in natural populations. By using genetic assignment methods, individuals with unknown genetic origin can be assigned to source populations. This knowledge is necessary in studying many key questions in ecology, evolution and conservation.
We introduce a network-based tool BONE (Baseline Oriented Network Estimation) for genetic population assignment, which borrows concepts from undirected graph inference. In particular, we use sparse multinomial Least Absolute Shrinkage and Selection Operator (LASSO) regression to estimate probability of the origin of all mixture individuals and their mixture proportions without tedious selection of the LASSO tuning parameter. We compare BONE with three genetic assignment methods implemented in R packages radmixture, assignPOP and RUBIAS.
Probability of the origin and mixture proportion estimates of both simulated and real data (an insular house sparrow metapopulation and Chinook salmon populations) given by BONE are competitive or superior compared to other assignment methods. Our examples illustrate how the network estimation method adapts to population assignment, combining the efficiency and attractive properties of sparse network representation and model selection properties of the L1 regularization. As far as we know, this is the first approach showing how one can use network tools for genetic identification of individuals' source populations.
BONE is aimed at any researcher performing genetic assignment and trying to infer the genetic population structure. Compared to other methods, our approach also identifies outlying mixture individuals that could originate outside of the baseline populations. BONE is a freely available R package under the GPL license and can be downloaded at GitHub. In addition to the R package, a tutorial for BONE is available at https://github.com/markkukuismin/BONE/.
The house sparrow study metapopulation consists of 18 islands located in an archipelago off the coast of Helgeland in northern Norway. Monitoring of the metapopulation, that covers more than 1600 km2 started in 1993 and is still ongoing. Population-level data on population sizes, and individual data on year and island of birth, information on individual survival, and blood samples for DNA were collected annually in the breeding season (May-August) and post-breeding season (September-November). Most nestlings, fledged juveniles and adults in the study metapopulation have been ringed with a numbered metal ring and a unique combination of plastic colour rings. This, in combination with intensive recapture and re-sighting efforts, has provided high recapture rates (mean: 0.75) and good ecological dispersal data, that can be used to, e.g., test the accuracy of genetic assignment methods.
SNP-genotyping was carried out on house sparrows present on 8 islands that differ in microenvironments and habitat types depending on the occurrence of livestock and dairy farms (five farm islands, and three non-farm islands) and distance from the mainland. On the five farm islands, all ringed adult individuals present from 1998 to 2013 were genotyped, whereas all ringed adult individuals present from 2003 or 2004 to 2013 were genotyped on the three non-farm islands. A total of 3269 adults were genotyped on a custom 200K Affymetrix Axiom SNP array. These SNPs are evenly distributed throughout the genome. After rigorous quality control, 183,145 SNPs with minor allele frequency above 0.01, and 3116 individuals with genotyping rate above 0.90 were found suitable. This is data of adults present in the year 2012. The data is anonymized.
The PED file is a white-space (space or tab) delimited file. The first six columns are:
- Family ID
- Individual ID (anonymized)
- Patternal ID (0=missing)
- Maternal ID (0=missing)
- Sex (0=missing)
Rest columns are genotypes.
REF file contains the island information of each individual.
The MAP file is a file, and one line per variant with the following fields:
- Chromosome code
- Variant identifier
- Position in morgans or centimorgans
- Base-pair coordinate
Norges Forskningsråd, Award: 223257
Norges Forskningsråd, Award: 221956
Norges Forskningsråd, Award: 274930