Data from: Simulated data for genomic selection and genome-wide association studies using a combination of coalescent and gene drop methods

Hickey JM, Gorjanc G

Date Published: July 25, 2014

DOI: http://dx.doi.org/10.5061/dryad.nm290

 

Files in this package

Content in the Dryad Digital Repository is offered "as is." By downloading files, you agree to the Dryad Terms of Service. To the extent possible under law, the authors have waived all copyright and related or neighboring rights to this data. CC0 (opens a new window) Open Data (opens a new window)

Title File S1
Downloaded 83 times
Description 1) AlphaDrop: executable for Linux 2) macs: MaCS executable for linux 3) msformatter: MaCS executable for linux 4) Seed.txt: a file containing a random seed for initialising AlphaDrop 5) RunMacs.sh: a shell script called by AlphaDrop when it runs MaCS 6) AlphaDropSpec.txt: the specification file for AlphaDrop 7) Pedigree.txt: an example externally supplied pedigree file 8) MaCsSimulationParameters.xlsx: an excel sheet with which MaCS parameters can be calculated 9) Ne100.sh: example of what to put into RunMacs.sh (Ne100 population of Hickey et al., 2011 Genetics Selection Evolution) 10) Ne1000.sh: example of what to put into RunMacs.sh (Ne1000 population of Hickey et al., 2011 Genetics Selection Evolution)
Download FileS1.zip (2.089 Mb)
Details View File Details
Title Simulated Data - Part 1
Downloaded 160 times
Description Ten replicates of a livestock data structure were simulated. The structure was designed to cover a spectrum of QTL distributions, relationship structures, and SNP densities and to mimic some of the scenarios where genomic selection is applied. In each replicate sequence data for 4000 base haplotypes for each of thirty chromosomes was simulated using the Markovian Coalescence Simulator (MaCS) (Chen et al., 2009). The thirty chromosomes were each 100 cM in length comprising approximately 108 base pairs and were simulated using a per site mutation rate of 2.5*10-8 and an effective population size (Ne) of 100 in the final generation of the sequence simulation. The reduction of Ne in the preceding generations was modeled with a Ne 1,000 years ago of 1,256, a Ne 10,000 years ago of 4,350, and a Ne 100,000 years ago of 43,500 with linear changes in between. This reflects estimates by Villa-Angulo et al. (2009) for the Holstein population. A pedigree was simulated comprising 10 generations of individuals, with 50 sires per generation, 10 dams per sire, and 2 offspring per dam. Base individuals in the pedigree had their gametes randomly sampled from the 4000 haplotypes of the sequence simulation allowing for recombination according to the genetic distance using 1% probability of a recombination event per cM. Subsequent generations in the pedigree had their gametes generated through Mendelian inheritance with recombination. The total number of segregating sites across the resulting genome was approximately 1,670,000. A random sample of 60,000 segregating sites was selected from the sequence to be used as SNP on a 60,000 SNP array. In addition a set of 9,000 segregating sites were randomly selected from the sequence to be used as candidate QTL loci in two different ways, one a randomly sampled set and the other being a randomly sampled set with the restriction that their minor allele frequency could not exceed 0.30. Four different traits were simulated assuming an additive genetic model. The first pair of traits was generated using the 9,000 unrestricted candidate QTL loci. For the first trait (PolyUnres) the allele substitution effect at each QTL locus was sampled from a normal distribution with a mean of zero and standard deviation of one unit. For the second trait (GammaUnres) a random subset of 900 of the candidate QTL loci were selected and their allele substitution effects at each QTL locus were sampled from a gamma distribution with a shape of 0.4 and scale of 1.66 (Meuwissen et al., 2001) and a 50% chance of being positive or negative. The second pair of traits (PolyRes and GammaRes) was generated in the same way as the first pair except that the candidate QTL loci comprised the 9,000 with the restriction that their minor allele frequency could not exceed 0.30. Phenotypes with a heritability of 0.25 were generated for each trait. To ensure that the heritability of the four traits remained constant the residual variance was scaled relative to the variance of the breeding values of individuals in the base generation, which was given by a'a/(n-1), where a is a vector of breeding value of individuals in the base generation and n is the number of individuals in that generation. Ten replicates of each scenario were simulated. Training and validation data sets Subsets of the data were extracted for training and validation. The training set comprised the 2000 individuals in generations 4 and 5. Three validation sets were extracted. The first (Gen6) comprised 500 individuals sampled at random from generation 6. The second (Gen8) comprised 500 individuals sampled at random from generation 8. The third (Gen10) comprised 500 individuals sampled at random from generation 10.
Download SimulatedData_Part1.zip (929.0 Mb)
Download README.txt (3.729 Kb)
Details View File Details
Title Simulated Data - Part 2
Downloaded 100 times
Description Ten replicates of a livestock data structure were simulated. The structure was designed to cover a spectrum of QTL distributions, relationship structures, and SNP densities and to mimic some of the scenarios where genomic selection is applied. In each replicate sequence data for 4000 base haplotypes for each of thirty chromosomes was simulated using the Markovian Coalescence Simulator (MaCS) (Chen et al., 2009). The thirty chromosomes were each 100 cM in length comprising approximately 108 base pairs and were simulated using a per site mutation rate of 2.5*10-8 and an effective population size (Ne) of 100 in the final generation of the sequence simulation. The reduction of Ne in the preceding generations was modeled with a Ne 1,000 years ago of 1,256, a Ne 10,000 years ago of 4,350, and a Ne 100,000 years ago of 43,500 with linear changes in between. This reflects estimates by Villa-Angulo et al. (2009) for the Holstein population. A pedigree was simulated comprising 10 generations of individuals, with 50 sires per generation, 10 dams per sire, and 2 offspring per dam. Base individuals in the pedigree had their gametes randomly sampled from the 4000 haplotypes of the sequence simulation allowing for recombination according to the genetic distance using 1% probability of a recombination event per cM. Subsequent generations in the pedigree had their gametes generated through Mendelian inheritance with recombination. The total number of segregating sites across the resulting genome was approximately 1,670,000. A random sample of 60,000 segregating sites was selected from the sequence to be used as SNP on a 60,000 SNP array. In addition a set of 9,000 segregating sites were randomly selected from the sequence to be used as candidate QTL loci in two different ways, one a randomly sampled set and the other being a randomly sampled set with the restriction that their minor allele frequency could not exceed 0.30. Four different traits were simulated assuming an additive genetic model. The first pair of traits was generated using the 9,000 unrestricted candidate QTL loci. For the first trait (PolyUnres) the allele substitution effect at each QTL locus was sampled from a normal distribution with a mean of zero and standard deviation of one unit. For the second trait (GammaUnres) a random subset of 900 of the candidate QTL loci were selected and their allele substitution effects at each QTL locus were sampled from a gamma distribution with a shape of 0.4 and scale of 1.66 (Meuwissen et al., 2001) and a 50% chance of being positive or negative. The second pair of traits (PolyRes and GammaRes) was generated in the same way as the first pair except that the candidate QTL loci comprised the 9,000 with the restriction that their minor allele frequency could not exceed 0.30. Phenotypes with a heritability of 0.25 were generated for each trait. To ensure that the heritability of the four traits remained constant the residual variance was scaled relative to the variance of the breeding values of individuals in the base generation, which was given by a'a/(n-1), where a is a vector of breeding value of individuals in the base generation and n is the number of individuals in that generation. Ten replicates of each scenario were simulated. Training and validation data sets Subsets of the data were extracted for training and validation. The training set comprised the 2000 individuals in generations 4 and 5. Three validation sets were extracted. The first (Gen6) comprised 500 individuals sampled at random from generation 6. The second (Gen8) comprised 500 individuals sampled at random from generation 8. The third (Gen10) comprised 500 individuals sampled at random from generation 10.
Download SimulatedData_Part2.zip (619.4 Mb)
Download README.txt (3.729 Kb)
Details View File Details
Title Simulated Data - Part 3
Downloaded 100 times
Description Ten replicates of a livestock data structure were simulated. The structure was designed to cover a spectrum of QTL distributions, relationship structures, and SNP densities and to mimic some of the scenarios where genomic selection is applied. In each replicate sequence data for 4000 base haplotypes for each of thirty chromosomes was simulated using the Markovian Coalescence Simulator (MaCS) (Chen et al., 2009). The thirty chromosomes were each 100 cM in length comprising approximately 108 base pairs and were simulated using a per site mutation rate of 2.5*10-8 and an effective population size (Ne) of 100 in the final generation of the sequence simulation. The reduction of Ne in the preceding generations was modeled with a Ne 1,000 years ago of 1,256, a Ne 10,000 years ago of 4,350, and a Ne 100,000 years ago of 43,500 with linear changes in between. This reflects estimates by Villa-Angulo et al. (2009) for the Holstein population. A pedigree was simulated comprising 10 generations of individuals, with 50 sires per generation, 10 dams per sire, and 2 offspring per dam. Base individuals in the pedigree had their gametes randomly sampled from the 4000 haplotypes of the sequence simulation allowing for recombination according to the genetic distance using 1% probability of a recombination event per cM. Subsequent generations in the pedigree had their gametes generated through Mendelian inheritance with recombination. The total number of segregating sites across the resulting genome was approximately 1,670,000. A random sample of 60,000 segregating sites was selected from the sequence to be used as SNP on a 60,000 SNP array. In addition a set of 9,000 segregating sites were randomly selected from the sequence to be used as candidate QTL loci in two different ways, one a randomly sampled set and the other being a randomly sampled set with the restriction that their minor allele frequency could not exceed 0.30. Four different traits were simulated assuming an additive genetic model. The first pair of traits was generated using the 9,000 unrestricted candidate QTL loci. For the first trait (PolyUnres) the allele substitution effect at each QTL locus was sampled from a normal distribution with a mean of zero and standard deviation of one unit. For the second trait (GammaUnres) a random subset of 900 of the candidate QTL loci were selected and their allele substitution effects at each QTL locus were sampled from a gamma distribution with a shape of 0.4 and scale of 1.66 (Meuwissen et al., 2001) and a 50% chance of being positive or negative. The second pair of traits (PolyRes and GammaRes) was generated in the same way as the first pair except that the candidate QTL loci comprised the 9,000 with the restriction that their minor allele frequency could not exceed 0.30. Phenotypes with a heritability of 0.25 were generated for each trait. To ensure that the heritability of the four traits remained constant the residual variance was scaled relative to the variance of the breeding values of individuals in the base generation, which was given by a'a/(n-1), where a is a vector of breeding value of individuals in the base generation and n is the number of individuals in that generation. Ten replicates of each scenario were simulated. Training and validation data sets Subsets of the data were extracted for training and validation. The training set comprised the 2000 individuals in generations 4 and 5. Three validation sets were extracted. The first (Gen6) comprised 500 individuals sampled at random from generation 6. The second (Gen8) comprised 500 individuals sampled at random from generation 8. The third (Gen10) comprised 500 individuals sampled at random from generation 10.
Download SimulatedData_Part3.zip (618.8 Mb)
Download README.txt (3.729 Kb)
Details View File Details
Title Simulated Data - Part 4
Downloaded 77 times
Description Ten replicates of a livestock data structure were simulated. The structure was designed to cover a spectrum of QTL distributions, relationship structures, and SNP densities and to mimic some of the scenarios where genomic selection is applied. In each replicate sequence data for 4000 base haplotypes for each of thirty chromosomes was simulated using the Markovian Coalescence Simulator (MaCS) (Chen et al., 2009). The thirty chromosomes were each 100 cM in length comprising approximately 108 base pairs and were simulated using a per site mutation rate of 2.5*10-8 and an effective population size (Ne) of 100 in the final generation of the sequence simulation. The reduction of Ne in the preceding generations was modeled with a Ne 1,000 years ago of 1,256, a Ne 10,000 years ago of 4,350, and a Ne 100,000 years ago of 43,500 with linear changes in between. This reflects estimates by Villa-Angulo et al. (2009) for the Holstein population. A pedigree was simulated comprising 10 generations of individuals, with 50 sires per generation, 10 dams per sire, and 2 offspring per dam. Base individuals in the pedigree had their gametes randomly sampled from the 4000 haplotypes of the sequence simulation allowing for recombination according to the genetic distance using 1% probability of a recombination event per cM. Subsequent generations in the pedigree had their gametes generated through Mendelian inheritance with recombination. The total number of segregating sites across the resulting genome was approximately 1,670,000. A random sample of 60,000 segregating sites was selected from the sequence to be used as SNP on a 60,000 SNP array. In addition a set of 9,000 segregating sites were randomly selected from the sequence to be used as candidate QTL loci in two different ways, one a randomly sampled set and the other being a randomly sampled set with the restriction that their minor allele frequency could not exceed 0.30. Four different traits were simulated assuming an additive genetic model. The first pair of traits was generated using the 9,000 unrestricted candidate QTL loci. For the first trait (PolyUnres) the allele substitution effect at each QTL locus was sampled from a normal distribution with a mean of zero and standard deviation of one unit. For the second trait (GammaUnres) a random subset of 900 of the candidate QTL loci were selected and their allele substitution effects at each QTL locus were sampled from a gamma distribution with a shape of 0.4 and scale of 1.66 (Meuwissen et al., 2001) and a 50% chance of being positive or negative. The second pair of traits (PolyRes and GammaRes) was generated in the same way as the first pair except that the candidate QTL loci comprised the 9,000 with the restriction that their minor allele frequency could not exceed 0.30. Phenotypes with a heritability of 0.25 were generated for each trait. To ensure that the heritability of the four traits remained constant the residual variance was scaled relative to the variance of the breeding values of individuals in the base generation, which was given by a'a/(n-1), where a is a vector of breeding value of individuals in the base generation and n is the number of individuals in that generation. Ten replicates of each scenario were simulated. Training and validation data sets Subsets of the data were extracted for training and validation. The training set comprised the 2000 individuals in generations 4 and 5. Three validation sets were extracted. The first (Gen6) comprised 500 individuals sampled at random from generation 6. The second (Gen8) comprised 500 individuals sampled at random from generation 8. The third (Gen10) comprised 500 individuals sampled at random from generation 10.
Download SimulatedData_Part4.zip (619.5 Mb)
Download README.txt (3.729 Kb)
Details View File Details
Title Simulated Data - Part 5
Downloaded 94 times
Description Ten replicates of a livestock data structure were simulated. The structure was designed to cover a spectrum of QTL distributions, relationship structures, and SNP densities and to mimic some of the scenarios where genomic selection is applied. In each replicate sequence data for 4000 base haplotypes for each of thirty chromosomes was simulated using the Markovian Coalescence Simulator (MaCS) (Chen et al., 2009). The thirty chromosomes were each 100 cM in length comprising approximately 108 base pairs and were simulated using a per site mutation rate of 2.5*10-8 and an effective population size (Ne) of 100 in the final generation of the sequence simulation. The reduction of Ne in the preceding generations was modeled with a Ne 1,000 years ago of 1,256, a Ne 10,000 years ago of 4,350, and a Ne 100,000 years ago of 43,500 with linear changes in between. This reflects estimates by Villa-Angulo et al. (2009) for the Holstein population. A pedigree was simulated comprising 10 generations of individuals, with 50 sires per generation, 10 dams per sire, and 2 offspring per dam. Base individuals in the pedigree had their gametes randomly sampled from the 4000 haplotypes of the sequence simulation allowing for recombination according to the genetic distance using 1% probability of a recombination event per cM. Subsequent generations in the pedigree had their gametes generated through Mendelian inheritance with recombination. The total number of segregating sites across the resulting genome was approximately 1,670,000. A random sample of 60,000 segregating sites was selected from the sequence to be used as SNP on a 60,000 SNP array. In addition a set of 9,000 segregating sites were randomly selected from the sequence to be used as candidate QTL loci in two different ways, one a randomly sampled set and the other being a randomly sampled set with the restriction that their minor allele frequency could not exceed 0.30. Four different traits were simulated assuming an additive genetic model. The first pair of traits was generated using the 9,000 unrestricted candidate QTL loci. For the first trait (PolyUnres) the allele substitution effect at each QTL locus was sampled from a normal distribution with a mean of zero and standard deviation of one unit. For the second trait (GammaUnres) a random subset of 900 of the candidate QTL loci were selected and their allele substitution effects at each QTL locus were sampled from a gamma distribution with a shape of 0.4 and scale of 1.66 (Meuwissen et al., 2001) and a 50% chance of being positive or negative. The second pair of traits (PolyRes and GammaRes) was generated in the same way as the first pair except that the candidate QTL loci comprised the 9,000 with the restriction that their minor allele frequency could not exceed 0.30. Phenotypes with a heritability of 0.25 were generated for each trait. To ensure that the heritability of the four traits remained constant the residual variance was scaled relative to the variance of the breeding values of individuals in the base generation, which was given by a'a/(n-1), where a is a vector of breeding value of individuals in the base generation and n is the number of individuals in that generation. Ten replicates of each scenario were simulated. Training and validation data sets Subsets of the data were extracted for training and validation. The training set comprised the 2000 individuals in generations 4 and 5. Three validation sets were extracted. The first (Gen6) comprised 500 individuals sampled at random from generation 6. The second (Gen8) comprised 500 individuals sampled at random from generation 8. The third (Gen10) comprised 500 individuals sampled at random from generation 10.
Download SimulatedData_Part5.zip (619.1 Mb)
Download README.txt (3.729 Kb)
Details View File Details

When using this data, please cite the original publication:

Hickey JM, Gorjanc G (2012) Simulated data for genomic selection and genome-wide association studies using a combination of coalescent and gene drop methods. G3: Genes - Genomes - Genetics 2(4): 425-427. http://dx.doi.org/10.1534/g3.111.001297

Additionally, please cite the Dryad data package:

Hickey JM, Gorjanc G (2012) Data from: Simulated data for genomic selection and genome-wide association studies using a combination of coalescent and gene drop methods. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.nm290
Cite | Share
Download the data package citation in the following formats:
   RIS (compatible with EndNote, Reference Manager, ProCite, RefWorks)
   BibTex (compatible with BibDesk, LaTeX)

Search for data

Be part of Dryad

We encourage organizations to: