Data for: Estimating causal effects with machine learning: A guide for ecologists
Data files
Oct 17, 2025 version files 20.04 KB
-
case_study_data.csv
5.37 KB
-
MEE_SI.R
11.96 KB
-
README.md
2.72 KB
Abstract
This repository contains the R code and data (simulated and empirical) used for the manuscript “Estimating Causal Effects with Machine Learning: A Guide for Ecologists.” It provides reproducible examples demonstrating the application of four causal machine learning methods.
The dataset includes:
- Simulated data generated in R to estimate the causal effect of honeybee abundance on wild bee populations. Variables include environmental covariates (e.g., soil, climate, topography), confounders (e.g., pollinated agriculture), an instrumental variable (beekeeping policy), and outcome measures (wild bee abundance), and include a mixture of linear, nonlinear, and interactions.
- Empirical data and example scripts illustrating the use of Causal Forests to assess heterogeneous effects of depth on Laminaria digitata abundance across Atlantic Canada, incorporating geographic (latitude, longitude) and biotic (invasive bryozoan) covariates.
- Annotated R scripts implementing DML, TMLE, nonlinear IV (Deep IV–inspired), and Causal Forest workflows.
The dataset is designed for reuse by researchers interested in learning or applying causal machine learning in ecology or related disciplines. All data are either simulated or derived from publicly available sources and contain no sensitive, confidential, or personally identifiable information. The materials are released for open reuse and adaptation, facilitating transparent and replicable applications of causal inference in ecology.
Dataset DOI: 10.5061/dryad.mw6m90694
Description of the data and file structure
Simulated and empirical data and R code associated with: Estimating causal effects with machine learning: A guide for ecologists
Files and variables
File: case_study_data.csv
Description: Variables included in the Laminaria digitata dataset used for the causal forest case study.
Variables
- laminaria_digitata: abundance (percent cover) of Laminaria digitata at each sampling site
- depth: depth (in meters) at each sampling site
- lat: latitude (decimal degrees) of each sampling site
- lon: longitude (decimal degrees) of each sampling site
- Membranipora membranacea: abundance (percent cover) of *Membranipora membranacea *at each sampling site
File: MEE_SI.R
Description: R script containing all code associated with the manuscript “Estimating Causal Effects with Machine Learning: A Guide for Ecologists.” The script includes reproducible workflows for data simulation, Double Machine Learning (DML), Targeted Maximum Likelihood Estimation (TMLE), nonlinear instrumental variable analysis (Deep IV–inspired), and Causal Forest models. All explanatory text has been converted to R comments for direct execution and readability.
Code/software
All analyses were performed in R, an open-source statistical environment available from https://cran.r-project.org. Required packages include DoubleML, mlr3, mlr3learners, data.table, mgcv, tmle, SuperLearner, caret, nnet, AER, grf, and ggplot2. The provided script (MEE_SI.R) contains fully commented code for data simulation, Double Machine Learning, TMLE, nonlinear instrumental variable (Deep IV–inspired), and Causal Forest analyses. All software and packages are freely available through CRAN.
Access information
Data was derived from the following sources:
- Krumhansl K., Brooks C., Lowen B., DiBacco C. (2025). Camera Surveys of the Subtidal Flora of Nova Scotia and Southwest New Brunswick 2022–2023. Version 1.7. Fisheries and Oceans Canada. Sampling event dataset. Used for Laminaria digitata and Membranipora membranacea abundance data. Available under the Open Government Licence – Canada.
- General Bathymetric Chart of the Oceans (GEBCO). 500-m resolution global bathymetry dataset, accessed through R as a publicly available raster for deriving depth values at sampling locations.
All other variables (simulated data) were generated within R for this study.
