Data from: Simulated data for genomic selection and genome-wide association studies using a combination of coalescent and gene drop methods
Hickey, John M.1; Gorjanc, Gregor2
- University of New England
- University of Ljubljana
Published Jul 25, 2014
on Dryad.
https://doi.org/10.5061/dryad.nm290
Data files
Jul 25, 2014 version files 3.41 GB
-
FileS1.zip
-
README_for_SimulatedData_Part1.txt
-
README_for_SimulatedData_Part2.txt
-
README_for_SimulatedData_Part3.txt
-
README_for_SimulatedData_Part4.txt
-
README_for_SimulatedData_Part5.txt
-
SimulatedData_Part1.zip
-
SimulatedData_Part2.zip
-
SimulatedData_Part3.zip
-
SimulatedData_Part4.zip
-
SimulatedData_Part5.zip
Abstract
An approach is described for simulating data sequence, genotype, and phenotype data to study genomic selection and genome-wide association studies (GWAS). The simulation method, implemented in a software package called AlphaDrop, can be used to simulate genomic data and phenotypes with flexibility in terms of the historical population structure, recent pedigree structure, distribution of quantitative trait loci effects, and with sequence and single nucleotide polymorphism-phased alleles and genotypes. Ten replicates of a representative scenario used to study genomic selection in livestock were generated and have been made publicly available. The simulated data sets were structured to encompass a spectrum of additive quantitative trait loci effect distributions, relationship structures, and single nucleotide polymorphism chip densities.
Usage notes
File S1
1) AlphaDrop: executable for Linux
2) macs: MaCS executable for linux
3) msformatter: MaCS executable for linux
4) Seed.txt: a file containing a random seed for initialising AlphaDrop
5) RunMacs.sh: a shell script called by AlphaDrop when it runs MaCS
6) AlphaDropSpec.txt: the specification file for AlphaDrop
7) Pedigree.txt: an example externally supplied pedigree file
8) MaCsSimulationParameters.xlsx: an excel sheet with which MaCS parameters can be calculated
9) Ne100.sh: example of what to put into RunMacs.sh (Ne100 population of Hickey et al., 2011 Genetics Selection Evolution)
10) Ne1000.sh: example of what to put into RunMacs.sh (Ne1000 population of Hickey et al., 2011 Genetics Selection Evolution)
FileS1.zip
Simulated Data - Part 1
Ten replicates of a livestock data structure were simulated. The structure was designed to cover a spectrum of QTL distributions, relationship structures, and SNP densities and to mimic some of the scenarios where genomic selection is applied. In each replicate sequence data for 4000 base haplotypes for each of thirty chromosomes was simulated using the Markovian Coalescence Simulator (MaCS) (Chen et al., 2009). The thirty chromosomes were each 100 cM in length comprising approximately 108 base pairs and were simulated using a per site mutation rate of 2.5*10-8 and an effective population size (Ne) of 100 in the final generation of the sequence simulation. The reduction of Ne in the preceding generations was modeled with a Ne 1,000 years ago of 1,256, a Ne 10,000 years ago of 4,350, and a Ne 100,000 years ago of 43,500 with linear changes in between. This reflects estimates by Villa-Angulo et al. (2009) for the Holstein population. A pedigree was simulated comprising 10 generations of individuals, with 50 sires per generation, 10 dams per sire, and 2 offspring per dam. Base individuals in the pedigree had their gametes randomly sampled from the 4000 haplotypes of the sequence simulation allowing for recombination according to the genetic distance using 1% probability of a recombination event per cM. Subsequent generations in the pedigree had their gametes generated through Mendelian inheritance with recombination. The total number of segregating sites across the resulting genome was approximately 1,670,000. A random sample of 60,000 segregating sites was selected from the sequence to be used as SNP on a 60,000 SNP array. In addition a set of 9,000 segregating sites were randomly selected from the sequence to be used as candidate QTL loci in two different ways, one a randomly sampled set and the other being a randomly sampled set with the restriction that their minor allele frequency could not exceed 0.30. Four different traits were simulated assuming an additive genetic model. The first pair of traits was generated using the 9,000 unrestricted candidate QTL loci. For the first trait (PolyUnres) the allele substitution effect at each QTL locus was sampled from a normal distribution with a mean of zero and standard deviation of one unit. For the second trait (GammaUnres) a random subset of 900 of the candidate QTL loci were selected and their allele substitution effects at each QTL locus were sampled from a gamma distribution with a shape of 0.4 and scale of 1.66 (Meuwissen et al., 2001) and a 50% chance of being positive or negative. The second pair of traits (PolyRes and GammaRes) was generated in the same way as the first pair except that the candidate QTL loci comprised the 9,000 with the restriction that their minor allele frequency could not exceed 0.30. Phenotypes with a heritability of 0.25 were generated for each trait. To ensure that the heritability of the four traits remained constant the residual variance was scaled relative to the variance of the breeding values of individuals in the base generation, which was given by a'a/(n-1), where a is a vector of breeding value of individuals in the base generation and n is the number of individuals in that generation. Ten replicates of each scenario were simulated.
Training and validation data sets Subsets of the data were extracted for training and validation. The training set comprised the 2000 individuals in generations 4 and 5. Three validation sets were extracted. The first (Gen6) comprised 500 individuals sampled at random from generation 6. The second (Gen8) comprised 500 individuals sampled at random from generation 8. The third (Gen10) comprised 500 individuals sampled at random from generation 10.
SimulatedData_Part1.zip
Simulated Data - Part 2
Ten replicates of a livestock data structure were simulated. The structure was designed to cover a spectrum of QTL distributions, relationship structures, and SNP densities and to mimic some of the scenarios where genomic selection is applied. In each replicate sequence data for 4000 base haplotypes for each of thirty chromosomes was simulated using the Markovian Coalescence Simulator (MaCS) (Chen et al., 2009). The thirty chromosomes were each 100 cM in length comprising approximately 108 base pairs and were simulated using a per site mutation rate of 2.5*10-8 and an effective population size (Ne) of 100 in the final generation of the sequence simulation. The reduction of Ne in the preceding generations was modeled with a Ne 1,000 years ago of 1,256, a Ne 10,000 years ago of 4,350, and a Ne 100,000 years ago of 43,500 with linear changes in between. This reflects estimates by Villa-Angulo et al. (2009) for the Holstein population. A pedigree was simulated comprising 10 generations of individuals, with 50 sires per generation, 10 dams per sire, and 2 offspring per dam. Base individuals in the pedigree had their gametes randomly sampled from the 4000 haplotypes of the sequence simulation allowing for recombination according to the genetic distance using 1% probability of a recombination event per cM. Subsequent generations in the pedigree had their gametes generated through Mendelian inheritance with recombination. The total number of segregating sites across the resulting genome was approximately 1,670,000. A random sample of 60,000 segregating sites was selected from the sequence to be used as SNP on a 60,000 SNP array. In addition a set of 9,000 segregating sites were randomly selected from the sequence to be used as candidate QTL loci in two different ways, one a randomly sampled set and the other being a randomly sampled set with the restriction that their minor allele frequency could not exceed 0.30. Four different traits were simulated assuming an additive genetic model. The first pair of traits was generated using the 9,000 unrestricted candidate QTL loci. For the first trait (PolyUnres) the allele substitution effect at each QTL locus was sampled from a normal distribution with a mean of zero and standard deviation of one unit. For the second trait (GammaUnres) a random subset of 900 of the candidate QTL loci were selected and their allele substitution effects at each QTL locus were sampled from a gamma distribution with a shape of 0.4 and scale of 1.66 (Meuwissen et al., 2001) and a 50% chance of being positive or negative. The second pair of traits (PolyRes and GammaRes) was generated in the same way as the first pair except that the candidate QTL loci comprised the 9,000 with the restriction that their minor allele frequency could not exceed 0.30. Phenotypes with a heritability of 0.25 were generated for each trait. To ensure that the heritability of the four traits remained constant the residual variance was scaled relative to the variance of the breeding values of individuals in the base generation, which was given by a'a/(n-1), where a is a vector of breeding value of individuals in the base generation and n is the number of individuals in that generation. Ten replicates of each scenario were simulated.
Training and validation data sets Subsets of the data were extracted for training and validation. The training set comprised the 2000 individuals in generations 4 and 5. Three validation sets were extracted. The first (Gen6) comprised 500 individuals sampled at random from generation 6. The second (Gen8) comprised 500 individuals sampled at random from generation 8. The third (Gen10) comprised 500 individuals sampled at random from generation 10.
SimulatedData_Part2.zip
Simulated Data - Part 3
Ten replicates of a livestock data structure were simulated. The structure was designed to cover a spectrum of QTL distributions, relationship structures, and SNP densities and to mimic some of the scenarios where genomic selection is applied. In each replicate sequence data for 4000 base haplotypes for each of thirty chromosomes was simulated using the Markovian Coalescence Simulator (MaCS) (Chen et al., 2009). The thirty chromosomes were each 100 cM in length comprising approximately 108 base pairs and were simulated using a per site mutation rate of 2.5*10-8 and an effective population size (Ne) of 100 in the final generation of the sequence simulation. The reduction of Ne in the preceding generations was modeled with a Ne 1,000 years ago of 1,256, a Ne 10,000 years ago of 4,350, and a Ne 100,000 years ago of 43,500 with linear changes in between. This reflects estimates by Villa-Angulo et al. (2009) for the Holstein population. A pedigree was simulated comprising 10 generations of individuals, with 50 sires per generation, 10 dams per sire, and 2 offspring per dam. Base individuals in the pedigree had their gametes randomly sampled from the 4000 haplotypes of the sequence simulation allowing for recombination according to the genetic distance using 1% probability of a recombination event per cM. Subsequent generations in the pedigree had their gametes generated through Mendelian inheritance with recombination. The total number of segregating sites across the resulting genome was approximately 1,670,000. A random sample of 60,000 segregating sites was selected from the sequence to be used as SNP on a 60,000 SNP array. In addition a set of 9,000 segregating sites were randomly selected from the sequence to be used as candidate QTL loci in two different ways, one a randomly sampled set and the other being a randomly sampled set with the restriction that their minor allele frequency could not exceed 0.30. Four different traits were simulated assuming an additive genetic model. The first pair of traits was generated using the 9,000 unrestricted candidate QTL loci. For the first trait (PolyUnres) the allele substitution effect at each QTL locus was sampled from a normal distribution with a mean of zero and standard deviation of one unit. For the second trait (GammaUnres) a random subset of 900 of the candidate QTL loci were selected and their allele substitution effects at each QTL locus were sampled from a gamma distribution with a shape of 0.4 and scale of 1.66 (Meuwissen et al., 2001) and a 50% chance of being positive or negative. The second pair of traits (PolyRes and GammaRes) was generated in the same way as the first pair except that the candidate QTL loci comprised the 9,000 with the restriction that their minor allele frequency could not exceed 0.30. Phenotypes with a heritability of 0.25 were generated for each trait. To ensure that the heritability of the four traits remained constant the residual variance was scaled relative to the variance of the breeding values of individuals in the base generation, which was given by a'a/(n-1), where a is a vector of breeding value of individuals in the base generation and n is the number of individuals in that generation. Ten replicates of each scenario were simulated.
Training and validation data sets Subsets of the data were extracted for training and validation. The training set comprised the 2000 individuals in generations 4 and 5. Three validation sets were extracted. The first (Gen6) comprised 500 individuals sampled at random from generation 6. The second (Gen8) comprised 500 individuals sampled at random from generation 8. The third (Gen10) comprised 500 individuals sampled at random from generation 10.
SimulatedData_Part3.zip
Simulated Data - Part 4
Ten replicates of a livestock data structure were simulated. The structure was designed to cover a spectrum of QTL distributions, relationship structures, and SNP densities and to mimic some of the scenarios where genomic selection is applied. In each replicate sequence data for 4000 base haplotypes for each of thirty chromosomes was simulated using the Markovian Coalescence Simulator (MaCS) (Chen et al., 2009). The thirty chromosomes were each 100 cM in length comprising approximately 108 base pairs and were simulated using a per site mutation rate of 2.5*10-8 and an effective population size (Ne) of 100 in the final generation of the sequence simulation. The reduction of Ne in the preceding generations was modeled with a Ne 1,000 years ago of 1,256, a Ne 10,000 years ago of 4,350, and a Ne 100,000 years ago of 43,500 with linear changes in between. This reflects estimates by Villa-Angulo et al. (2009) for the Holstein population. A pedigree was simulated comprising 10 generations of individuals, with 50 sires per generation, 10 dams per sire, and 2 offspring per dam. Base individuals in the pedigree had their gametes randomly sampled from the 4000 haplotypes of the sequence simulation allowing for recombination according to the genetic distance using 1% probability of a recombination event per cM. Subsequent generations in the pedigree had their gametes generated through Mendelian inheritance with recombination. The total number of segregating sites across the resulting genome was approximately 1,670,000. A random sample of 60,000 segregating sites was selected from the sequence to be used as SNP on a 60,000 SNP array. In addition a set of 9,000 segregating sites were randomly selected from the sequence to be used as candidate QTL loci in two different ways, one a randomly sampled set and the other being a randomly sampled set with the restriction that their minor allele frequency could not exceed 0.30. Four different traits were simulated assuming an additive genetic model. The first pair of traits was generated using the 9,000 unrestricted candidate QTL loci. For the first trait (PolyUnres) the allele substitution effect at each QTL locus was sampled from a normal distribution with a mean of zero and standard deviation of one unit. For the second trait (GammaUnres) a random subset of 900 of the candidate QTL loci were selected and their allele substitution effects at each QTL locus were sampled from a gamma distribution with a shape of 0.4 and scale of 1.66 (Meuwissen et al., 2001) and a 50% chance of being positive or negative. The second pair of traits (PolyRes and GammaRes) was generated in the same way as the first pair except that the candidate QTL loci comprised the 9,000 with the restriction that their minor allele frequency could not exceed 0.30. Phenotypes with a heritability of 0.25 were generated for each trait. To ensure that the heritability of the four traits remained constant the residual variance was scaled relative to the variance of the breeding values of individuals in the base generation, which was given by a'a/(n-1), where a is a vector of breeding value of individuals in the base generation and n is the number of individuals in that generation. Ten replicates of each scenario were simulated.
Training and validation data sets Subsets of the data were extracted for training and validation. The training set comprised the 2000 individuals in generations 4 and 5. Three validation sets were extracted. The first (Gen6) comprised 500 individuals sampled at random from generation 6. The second (Gen8) comprised 500 individuals sampled at random from generation 8. The third (Gen10) comprised 500 individuals sampled at random from generation 10.
SimulatedData_Part4.zip
Simulated Data - Part 5
Ten replicates of a livestock data structure were simulated. The structure was designed to cover a spectrum of QTL distributions, relationship structures, and SNP densities and to mimic some of the scenarios where genomic selection is applied. In each replicate sequence data for 4000 base haplotypes for each of thirty chromosomes was simulated using the Markovian Coalescence Simulator (MaCS) (Chen et al., 2009). The thirty chromosomes were each 100 cM in length comprising approximately 108 base pairs and were simulated using a per site mutation rate of 2.5*10-8 and an effective population size (Ne) of 100 in the final generation of the sequence simulation. The reduction of Ne in the preceding generations was modeled with a Ne 1,000 years ago of 1,256, a Ne 10,000 years ago of 4,350, and a Ne 100,000 years ago of 43,500 with linear changes in between. This reflects estimates by Villa-Angulo et al. (2009) for the Holstein population. A pedigree was simulated comprising 10 generations of individuals, with 50 sires per generation, 10 dams per sire, and 2 offspring per dam. Base individuals in the pedigree had their gametes randomly sampled from the 4000 haplotypes of the sequence simulation allowing for recombination according to the genetic distance using 1% probability of a recombination event per cM. Subsequent generations in the pedigree had their gametes generated through Mendelian inheritance with recombination. The total number of segregating sites across the resulting genome was approximately 1,670,000. A random sample of 60,000 segregating sites was selected from the sequence to be used as SNP on a 60,000 SNP array. In addition a set of 9,000 segregating sites were randomly selected from the sequence to be used as candidate QTL loci in two different ways, one a randomly sampled set and the other being a randomly sampled set with the restriction that their minor allele frequency could not exceed 0.30. Four different traits were simulated assuming an additive genetic model. The first pair of traits was generated using the 9,000 unrestricted candidate QTL loci. For the first trait (PolyUnres) the allele substitution effect at each QTL locus was sampled from a normal distribution with a mean of zero and standard deviation of one unit. For the second trait (GammaUnres) a random subset of 900 of the candidate QTL loci were selected and their allele substitution effects at each QTL locus were sampled from a gamma distribution with a shape of 0.4 and scale of 1.66 (Meuwissen et al., 2001) and a 50% chance of being positive or negative. The second pair of traits (PolyRes and GammaRes) was generated in the same way as the first pair except that the candidate QTL loci comprised the 9,000 with the restriction that their minor allele frequency could not exceed 0.30. Phenotypes with a heritability of 0.25 were generated for each trait. To ensure that the heritability of the four traits remained constant the residual variance was scaled relative to the variance of the breeding values of individuals in the base generation, which was given by a'a/(n-1), where a is a vector of breeding value of individuals in the base generation and n is the number of individuals in that generation. Ten replicates of each scenario were simulated.
Training and validation data sets Subsets of the data were extracted for training and validation. The training set comprised the 2000 individuals in generations 4 and 5. Three validation sets were extracted. The first (Gen6) comprised 500 individuals sampled at random from generation 6. The second (Gen8) comprised 500 individuals sampled at random from generation 8. The third (Gen10) comprised 500 individuals sampled at random from generation 10.
SimulatedData_Part5.zip