Skip to main content
Dryad

Data for: Estimating causal effects with machine learning: A guide for ecologists

Data files

Oct 17, 2025 version files 20.04 KB

Click names to download individual files

Abstract

This repository contains the R code and data (simulated and empirical) used for the manuscript “Estimating Causal Effects with Machine Learning: A Guide for Ecologists.”  It provides reproducible examples demonstrating the application of four causal machine learning methods. 

The dataset includes:

  1. Simulated data generated in R to estimate the causal effect of honeybee abundance on wild bee populations. Variables include environmental covariates (e.g., soil, climate, topography), confounders (e.g., pollinated agriculture), an instrumental variable (beekeeping policy), and outcome measures (wild bee abundance), and include a mixture of linear, nonlinear, and interactions. 
  2. Empirical data and example scripts illustrating the use of Causal Forests to assess heterogeneous effects of depth on Laminaria digitata abundance across Atlantic Canada, incorporating geographic (latitude, longitude) and biotic (invasive bryozoan) covariates.
  3. Annotated R scripts implementing DML, TMLE, nonlinear IV (Deep IV–inspired), and Causal Forest workflows. 

The dataset is designed for reuse by researchers interested in learning or applying causal machine learning in ecology or related disciplines. All data are either simulated or derived from publicly available sources and contain no sensitive, confidential, or personally identifiable information. The materials are released for open reuse and adaptation, facilitating transparent and replicable applications of causal inference in ecology.