Data from: A practical introduction to random forest for genetic association studies in ecology and evolution

Name: A practical introduction to random forest for genetic association studies in ecology and evolution
Keywords: Association studies

Brieuc, Marine S.O.1; Waters, Charles D.1; Drinan, Daniel P.1; Naish, Kerry Ann1; Brieuc, Marine S. O.1 2

Published Mar 01, 2018 on Dryad. https://doi.org/10.5061/dryad.k55hh8f

Data files

Mar 01, 2018 version files 23.82 MB

classification_RF_tutorial.R

31.27 KB
data_classification_RF_tutorial.csv

818.44 KB
data_regression_RF_tutorial.csv

819.24 KB
Datasets and results of simulations.xlsx

20.84 MB
Input data files and R code to examine overfitting by Random Forest.zip

1.23 MB
R scripts for simulations to correct for pop structure.zip

49.32 KB
regression_RF_tutorial.R

25.79 KB

Abstract

Large genomic studies are becoming increasingly common with advances in sequencing technology, and our ability to understand how genomic variation influences phenotypic variation between individuals has never been greater. The exploration of such relationships first requires the identification of associations between molecular markers and phenotypes. Here we explore the use of Random Forest (RF), a powerful machine learning algorithm, in genomic studies to discern loci underlying both discrete and quantitative traits, particularly when studying wild or non-model organisms. RF is becoming increasingly used in ecological and population genetics because, unlike traditional methods, it can efficiently analyze thousands of loci simultaneously and account for non-additive interactions. However, understanding both the power and limitations of Random Forest is important for its proper implementation and the interpretation of results. We therefore provide a practical introduction to the algorithm and its use for identifying associations between molecular markers and phenotypes, discussing such topics as data limitations, algorithm initiation and optimization, as well as interpretation. We also provide short R tutorials as examples, with the aim of providing a guide to the implementation of the algorithm. Topics discussed here are intended to serve as an entry point for molecular ecologists interested in employing Random Forest to identify trait associations in genomic data sets.

Data from: A practical introduction to random forest for genetic association studies in ecology and evolution

Data files

Abstract

Input data file for classification Random Forest tutorial in R

R script for classification Random Forest tutorial

Input data file for regression Random Forest tutorial in R

R script for regression Random Forest tutorial

Input data files and results of simulations to quantify the effectiveness of a method to correct for population stratification

R scripts for simulations to quantify the effectiveness of a method to correct for population stratification

Input data files and R code to examine overfitting by Random Forest for 1,000 loci and 10,000 loci data sets

Data from: A practical introduction to random forest for genetic association studies in ecology and evolution

Data files

Abstract

Usage notes

Input data file for classification Random Forest tutorial in R

R script for classification Random Forest tutorial

Input data file for regression Random Forest tutorial in R

R script for regression Random Forest tutorial

Input data files and results of simulations to quantify the effectiveness of a method to correct for population stratification

R scripts for simulations to quantify the effectiveness of a method to correct for population stratification

Input data files and R code to examine overfitting by Random Forest for 1,000 loci and 10,000 loci data sets

Works referencing this dataset