Skip to main content
Dryad

Comparing mixed models and Random Forest association tests using naturalGWAS and a Striped Bass SNP dataset

Cite this dataset

LeBlanc, Nathalie; Pavey, Scott (2022). Comparing mixed models and Random Forest association tests using naturalGWAS and a Striped Bass SNP dataset [Dataset]. Dryad. https://doi.org/10.5061/dryad.zw3r22872

Abstract

In this study, we used the phenotype simulation package naturalGWAS to test the performance of Zhao’s Random Forest method in comparison to an uncorrected Random Forest test, latent factor mixed models (LFMM), genome-wide efficient mixed models (GEMMA), and confounder adjusted linear regression (CATE). We created 400 sets of phenotypes, corresponding to five effect sizes and 2, 5, 15, or 30 causal loci, simulated from two empirical datasets containing SNPs from Striped Bass representing three and 13 populations. All association methods were evaluated for their ability to detect genotype-phenotype associations based on power, false discovery rates, and number of false positives. Genomic inflation was highest for uncorrected Random Forest and LFMM tests and lowest for Gemma and Zhao’s Random Forest. All association tests had similar power to detect causal loci, and Zhao’s Random Forest had the lowest false discovery rate in all scenarios. To measure the performance of association tests in small datasets with few loci surrounding a causal gene we also ran analyses again after removing causal loci from each dataset. All association tests were only able to find true positives, defined as loci located within 30k bp of a causal locus, in 3%–18% of simulations. In contrast, at least one false positive was found in 17%–44% of simulations. Zhao’s Random Forest again identified the fewest false positives of all association tests studied. The ability to test the power of association tests for individual empirical datasets can be an extremely useful first step when designing a GWAS study.

Methods

We created phenotypes based on empirical genotypes and location using the r package naturalGWAS. We created phenotypes drawn from 2, 5, 15, or 30 causal variants (SNPs that contribute to the phenotype), at effect sizes of 1, 10, 100, 1000, and 10000. We also generated phenotypes with GxE interactions set to 0.2 and 0.8, where 0.2 represents low amounts of GxE interactions and 0.8 represents high amounts. Gene-by-environment (GxE) interactions refer to interactions between the environment and an individual’s genotype where certain genotypes show greater environmental effects and lower heritability for a trait than others, in contrast to genetic influences or environmental influences each working in isolation (Hunter, 2005; Rutherford, 2000). All simulations were run with 40 confounders taken from the first 40 principal components of the dataset’s genotype matrix.

Simulations fell into 6 different 'datasets' and were analyzed using 6 different association tests as outlined in the associated manuscript. Genomic inflation was measured and corrected for using an R script. Power, false discovery rates, and number of false positives were also calculated using an R script. R scripts and shell scripts are included in this Dryad repository and organized as follows:

CFAll_m7miss70maf0-01_noRelMiss_LD0-8_HetDepFilt.recode.vcf - VCF containing the genotypes of samples included in 3-population datasets

RangeLong_m7miss70maf0-01_AllFilters24Chroms_LD0-8.vcf - VCF containing the genotypes of samples included in 13-population datasets

The zip files simulation_scripts.zip contains R scripts and shell scripts used in the above analyses are deposited for each of six 'datasets', as follows:

Dataset Folder
3-population dataset, all samples and loci CF
3-population dataset, all samples and causal loci removed CF_noCausal
3-population dataset, admixed samples removed CF_noAdmixed
3-population datset, admixed samples and causal loci removed CF_noAdmixednoCausal
13-population dataset, all loci Range
13-population dataset, causal loci removed Range_noCausal

Each folder contains the following files named as follows (with variations when dataset is included in the file name):

  • Simulations.R - Calculates simulated phenotypes based on the provided genotype matrix. Code adapted and expanded from the naturalGWAS vignette found at https://github.com/bcm-uga/NaturalGWAS
  • CATE.R - Implements confounder adjusted testing and estimation (CATE) using the R package cate v1.1. This script also creates the qvals files that subsequent scripts read in and append their results.
  • MLM.R - Implements latent factor mixed model analysis (LFMM) using the R package lfmm v1.0
  • Oracle.R - Implements the 'ideal' mixed model analysis Oracle for comparison implemented in the R package naturalGWAS v0.1.0
  • URF.R - Implements Random orest using the R package randomforest v4.6-14
  • ZhaoRF.R - Inplements Random Forest using a method of correction outlined by Zhao et al. (2012). Linear regressions were implemented using code modified from Brieuc et al. 2018 (dryad doi: https://datadryad.org/stash/dataset/doi:10.5061/dryad.k55hh8f)
  • gemma_README.txt - describes the pipeline for running command-line Gemma on a Linix platform. All associated scripts begin with the word 'gemma'.
  • PowerCalc.R - Accepts files with the following columns: chromosome number, position on chromosome, p-values, and q-values. Given a list of causal loci, calculates power, false discovery rates, and number of false positives using sequential Bonferroni correction and FDR thresholds of 0.05 and 0.2. Reads in files according to rows in the file 'qvaluePowers_Template.csv' (provided) and records calculated values in each row. Also outputs graphs and heatplots of power, false discovery rates, and false positives.
  • fst_calc_workshop.R - Calculates locus-specific FST of causal loci and kruskal-wallace tests comparing correlation of phenotype and location,
  • extraLoci.R - creates a csv file that lists, for every causal locus, every locus within 30k basepairs of that locus. Loci are assumed to be named using the format 'chromosome#_position#'. The distance threshold can be easily modified.
  • col_Names.txt - Text file listing all locus names, used to add column names to genotype matrices
  • sample_names_X.csv - File containing sample names (SampleID) and their population group (Location).
  • coords.tsv - File containing a longtitude and latitude for each individual sample.
  • gen_map_maf0-01_HetDepth.tsv - File containing the chromosome position (#CHROM) and position within that chromosome (POS) of loci in the genotype matrix.
  • The nullScripts folders contain shell scripts that implement the permutation of simulated phenotypes, destroying the association between phenotype and causal loci, training of new Random Forest models. This is repeated 1000 times per phenotype to create a null distribution of importance values, which are then exported for use in p-value and q-value calculation in the above Random Forest naturalGWAS scripts.
  • In addition, the file simulationNames.txt contains a list of all unique simulation names, used in some R Scripts.

Also included is the folder Combined_Graphs which contains the following files:

  • CF_genomicControl.R - Calculates a lambda slope for all p-values, correct p-values, and then recalculates q-values based on the new values
  • CFnoAdmix_genomicControl.R - Calculates a lambda slope for all p-values, correct p-values, and then recalculates q-values based on the new values
  • CFnoAdmixCausal_genomicControl.R - Calculates a lambda slope for all p-values, correct p-values, and then recalculates q-values based on the new values
  • CFnoCausal_genomicControl.R - Calculates a lambda slope for all p-values, correct p-values, and then recalculates q-values based on the new values
  • Range_genomicControl.R - Calculates a lambda slope for all p-values, correct p-values, and then recalculates q-values based on the new values
  • RangenoCausal_genomicControl.R - Calculates a lambda slope for all p-values, correct p-values, and then recalculates q-values based on the new values
  • PowerCalc.R - Accepts files with the following columns: chromosome number, position, p-values, and q-values. Given a list of causal loci, calculates power, false discovery rates, and number of false positives using sequential Bonferroni correction and FDR thresholds of 0.05 and 0.2. Reads in files according to rows in the file 'qvaluePowers_Template.csv' (provided) and records calculated values in each row. Also outputs graphs and heatplots of power, false discovery rates, and false positives.
  • figures.R uses a csv file containing power, FDR, and false positive data of simulations to get summary statistics and graphs.
  • qvaluePowers_Template.csv - Copy of the template file used with PowerCalc.R. Column names are as follows: rowname = unique name for each row, Sim = unique name for each set of simulated phenotypes, Test = association test used to produce this row's results, Causal = number of causal loci for the simulated phenotypes, GxE = genotype x environment interaction value where 2 = 0.2 and 8 = 0.8, Effect = effect size used for simulated phenotypes, FP = number of false positives, Power = proportion of causal loci found, FDR = false discovery rate, Corr = method of multiple test correction used when calculating power and false positives (SB = sequential bonferroni, FDR0.05 = false discovery rate 0.5, FDR0.2 = false discovery rate 0.2), Replicate = letter used to identify replicates of causal-GxE-effect parameter value combinations (a-e).
  • qvals_c15ge2e10000b.csv - Example 'qval' file used in several scripts. Column names are as follows: x = row names exported from R, chrom = chromosome on which the locus is located, pos = position on the chromosome in basepairs, cate_pval  = p-values calculated by the association test CATE while measuring association of that row's locus to the simulated phenotype, cate_qval  = q-values calculated by the association test CATE while measuring association of that row's locus to the simulated phenotype, mlm_pval  = p-values calculated by the association test LFMM while measuring association of that row's locus to the simulated phenotype, mlm_qval  = q-values calculated by the association test LFMM while measuring association of that row's locus to the simulated phenotype.

Usage notes

R Scripts must be adapted for use on your own sample data and file paths. naturalGWAS scripts were run in the following order in the attached study: Oracle, MLM, CATE, Classic RF, Zhao's RF, however file paths and data input can be modified to change the order as needed.

R Scripts were written with the intention of being run in a programming environment such as R Studio and may not work if run from the command line. Any packages called by the scripts must be installed before use. Gemma shell scrips use the free software GEMMA found at https://github.com/genetics-statistics/GEMMA, run on a Linux command line.

Funding

Science and Engineering Research Council

Canada Research Chairs

Canada Foundation for Innovation

New Brunswick Innovation Foundation