Farmers need accurate estimates of winter cover crop biomass to make informed decisions on termination timing or to estimate potential release of nitrogen from cover crop residues to subsequent cash crops. Utilizing data from an extensive experiment across 11 states from 2016 to 2020, this study explores the most reliable predictors for determining cereal rye cover crop biomass at the time of termination. Our findings demonstrate a strong relationship between early-season and late-season cover crop biomass. Employing a random forest model, we predicted late-season cereal rye biomass with a margin of error of approximately 1,000 kg ha-1 based on early-season biomass, growing degree days, cereal rye planting and termination dates, photosynthetically active radiation, precipitation, and site coordinates as predictors. Our results suggest that similar modeling approaches could be combined with remotely sensed early-season biomass estimations to improve the accuracy of predicting winter cover crop biomass at termination for decision support tools.
https://doi.org/10.5061/dryad.ngf1vhj1r
Utilizing data from an extensive experiment across 11 states from 2016 to 2020, this study explores the most reliable predictors for determining cereal rye cover crop biomass at the time of termination. The dataset includes experimental data such as early-season biomass, cereal rye planting and termination dates, and site coordinates, and we extracted weather data such as growing degree days, photosynthetically active radiation, and precipitation to predict cereal rye biomass at the time of termination.
Description of the data and file structure
Two data files are included; first, the "experimental_data.csv" which includes all of the data describing and collected as a result of the field experiments. Another file "experimental_and_weather_data.csv" joins the weather and experimental data. All columns are defined in detail in the "data_dictionary.csv" file. "NA" entries correspond to missing data.
We include an R Project file and use the "here" package for data and directory organization. We intend for the data files to be contained in a "data" subdirectory in the root folder where the "CC_biomas_model.Rproj" file is located.
Sharing/Access information
This is a section for linking to other ways to access the data, and for linking to sources the data is derived from, if any.
Links to other publicly accessible locations of the data that we used to derive weather variables:
Code/Software
We include 3 R scripts developing using R v 4.3.1 and the following packages and versions:
httr_1.4.7, lubridate_1.9.2, forcats_1.0.0, stringr_1.5.0, dplyr_1.1.3,\
purrr_1.0.2, readr_2.1.4, tidyr_1.3.0, tibble_3.2.1, ggplot2_3.4.4 ,
tidyverse_2.0.0, here_1.0.1, patchwork_1.1.3, GGally_2.2.0 merTools_0.6.1, arm_1.13-1, MASS_7.3-60 , sjPlot_2.8.15, effects_4.2-2, lmerTest_3.1-3, MuMIn_1.47.5, car_3.1-2, carData_3.0-5, lme4_1.1-34 Matrix_1.6-1.1, randomForest_4.7-1.1.
The scripts are ordered in the intended order which they are designed to be run. "1_weather_data_for_CC_model.R" demonstrates how the weather data was downloaded and derived from the public sources mentioned above. "2_CC_biomass_modeling.R" explores covariation in the dataset and contains code to produce the GLMM model and some figures . "3_RF_model.R" contains the code to produce the random forest model and some figures.
2.1 Field sites and operations
Cereal rye cover crop biomass data used in the modeling approach were obtained from a field experiment conducted on research farms in 11 states between 2016 and 2020 (as outlined in Supplementary Table 1). Cereal rye was planted in 9.1 by 12.2 m plots in the late fall of each year, with four or five replicates per site-year. Management practices (i.e., cereal rye variety, seeding rates and methods) specific to each site were based on local norms. Biomass samples were collected from two 0.5-m2 quadrats in each plot at six weeks (hereafter referred to as “early-season biomass”) and two weeks (“late-season biomass”) prior to target dates for soybean planting. Cereal rye variety, as well as cereal rye planting and early and late-season biomass dates are summarized across sites and years in Supplementary Table 2.
2.2 Data assembly and preparation
In addition to early-season biomass, which was hypothesized to predict winter cover crop growth, weather variables related to temperature and radiation were used to model late-season biomass. Minimum and maximum air temperatures (℃) and shortwave incoming solar radiation (W m-2) were extracted for each site-year on a daily basis at a spatial resolution of 0.125° by 0.125° from the North American Land Data Assimilation System Phase 2 dataset (Xia et al., 2012). Cumulative growing degree days (CGDD) (-4.5° C base) were calculated over two time periods and negative values were omitted (Pessotto et al., 2023). “Early CGDD” and “early precipitation” were summed between cereal rye planting date and early termination date (six weeks prior to soybean planting), and “late CGDD” and “late precipitation” were summed between early termination and late termination date. Precipitation data were extracted from the multi-radar/multi-sensor system (NOAA Multi-Radar/Multi-Sensor System (MRMS), 2023). Daily photosynthetically active radiation (PAR) was calculated from shortwave radiation using the ‘sw.to.par’ function in the LakeMetabolizer v.1.5.0 R package (Winslow et al., 2016). The mean of daily PAR was calculated for the period between early and late cover crop termination dates.
2.3 Statistical Analyses
2.3.1 Model selection to evaluate support for each covariate
All predictor variables, early-season cereal rye biomass, cereal rye planting date (Julian days), late termination date (Julian days), mean late PAR, and both early and late CGDD, were standardized by subtracting the mean and dividing by the standard deviation of each variable (Gelman, 2008).We examined all candidate predictor variables for collinearity using the vif function in the car package (v3.0-10) (Fox et al., 2018) and removed the precipitation variables because of their variance inflation factor scores > 3 (Zuur et al., 2010). Site location (which varied occasionally from year to year within states) was input as a unique categorical variable for each set of field location coordinates.
We fit a generalized linear mixed effects model (GLMM) using the glmer function in the lme4 package (Bates et al., 2015) with a Gaussian error distribution and log link function due to overdispersion in the response variable (late-season cereal rye cover crop biomass in kg ha-1). We specified a hierarchical model with random intercepts for each location and for blocks (nested under each location) to address the non-independence of repeated measurements within the same locations and blocks through time (Pinheiro & Bates, 2000). We fit a “global” model with all covariates that we hypothesized to be important including early-season cereal rye biomass, cereal rye planting date (in Julian days), late termination date, mean late PAR, and both early and late CGDD. We visually assessed model assumptions of homogeneity of variance across groups and normality of fitted residuals.
2.3.2 Random forest model and validation
To improve the accuracy of predictions, we also fit a random forest machine learning model on the dataset using the randomForest package v. 4.7–11 in R (Breiman, 2001; Liaw & Wiener, 2002). We specified a random forest model with the training parameters ntree set to 1,000 and mtry set to 2 and included the same covariates as the GLMM, except we included site latitude and longitude coordinates separately rather than as categorical locations. Variable importance was calculated with the randomForest package; variables were ranked using %IncMSE, the mean decrease in prediction accuracy on the out of bag samples as each variable is randomly permuted.
The dataset was randomly partitioned so that the random forest model was trained on 70% of the total data, and 30% was withheld and used for model validation. We also used the same data partition to validate a version of the “global model” GLMM that was refitted to include only the training data. To assess how model performance varied across “low” and “high” cover crop biomass values, we evaluated it separately for “low” biomass observations of 4,000 kg ha-1 or less and “high” cereal rye biomass values greater than 4,000 kg ha-1.