Data for: A hybrid biophysical-machine learning framework for diurnal surface energy flux estimation using proximal sensing
Data files
Abstract
Thermal-based remote sensing of surface energy fluxes has traditionally relied on high spatial resolution satellite data with revisit frequencies on the order of weeks. In this study, we evaluate a biophysics-based analytical surface energy balance model for predicting latent energy (LE) and sensible heat (H) fluxes using proximal sensing observations. The Surface Temperature Initiated Closure (STIC1.2) model has been extensively validated across a wide range of spatial and temporal scales using various satellite-derived thermal datasets. Here we extend this validation by applying STIC at sub-hourly temporal resolution over multiple growing seasons for four distinct agricultural systems. We further develop and evaluate novel STIC variants that incorporate machine learning (ML) techniques to eliminate the need for specific surface energy balance observations, specifically net radiation and soil heat flux, thereby enhancing model applicability in data-sparse settings. The integration of an ML component to estimate surface available energy is shown to have strong predictive performance for both LE (R2 = 0.81-0.94) and H (R2 = 0.46-0.72) across all agricultural systems examined here, demonstrating the potential of hybrid biophysical – machine learning approaches for surface energy balance modeling with minimal data requirements. This study concludes with a novel application of explainable machine learning (exML) to diagnose sources of model error. This exML framework attributes residual prediction errors to both model input variables and environmental drivers not explicitly included in the simulation experiments. This approach provides a new pathway for improving model design and integrating previously overlooked yet influential variables into future model iterations.
https://doi.org/10.5061/dryad.r4xgxd2tm
Authors: James F. Cross; Kanishka Mallick; Guler Aslan-Sungur; Andy VanLoocke; Darren T. Drewry
Corresponding author: Darren Drewry (drewry.19@osu.edu)
Repository Structure
The archive consists of the following main components:
│ CODE.zip
├─ DATA_OUTPUT
│ └── PARAMETERS
│ └── SHAP_DATA
│ └── STIC_OUT_DATA
├─ HELPER_FUNCTIONS
├─ FIGURES_CODE
│ └── Figure [2-9]
│ └── Figure [S1-S5]
│ A.DEFINE_STIC_parameters.m
│ A.DEFINE_TOWER_parameters.m
│ B.MAIN_STIC.m
│ C.OUTPUT_SHAP_INPUTS_csv.m
│ D.RUN_SHAP_COHORT.pynb
│ E.GATHER_SHAP_importance.m
│ DATA.zip
│ PAPER_FIGURES.zip
│ └── Figure [1-9]
│ └── Figure [S1-S5]
│ SOFTWARE.zip
├─ COHORT_SHAP
│ └── cohort_shapley-main
Code Execution
- To reproduce results from the paper, the four zip files should be extracted as in the above repository structure. (i.e. with the parent directory consisting of the folders "CODE","DATA","PAPER_FIGURES","SOFTWARE")
- The primary findings of the paper are reproduced by running the six files enumerated in the order (A-E) in the section "Code — Main Code Files".
- Figure files can be reproduced after running the main files by executing the respective script (*.m) in "CODE/FIGURES_CODE/Figure X/".
Code
CODE/DATA_OUTPUT/PARAMETERS/
- NET_energy_models/
- Contains pretrained models used by STIC_{ML,Rn} to estimate the net energy (net radiation - soil heat flux).
- SHF_models/
- Contains pretrained models used by STIC_{ML,G} to estimate the soil heat flux.
- STIC_PARAMS.mat
- Output from "A__DEFINE_STIC_parameters.m"
- TOWER_PARAMS.mat
- Output from "A__DEFINE_TOWER_parameters.m"
CODE/DATA_OUTPUT/SHAP_DATA/
- Stores intermediary data products for SHAP analysis.
- .csv outputs are stored from "C__OUTPUT_SHAP_INPUTS.csv" and "D__SHAP_Processing.pynb".
- Additionally stores the compiled output from "E__GATHER_SHAP_importance.m" as "SHAP_output.m"
CODE/DATA_OUTPUT/STIC_OUT_DATA/
- Contains the per crop data files output from "B__MAIN_STIC.m". Each .mat file contains a struct including the pre- and post-processed results for each of the STIC formulations.
CODE/HELPER_FUNCTIONS/
- A repository of ancillary functions developed during the project.
CODE/FIGURES_CODE/
- Figure [2-9, S1-S5]
- Folders for reproducing each of the individual figures. Each folder contains one ".m" file, which will produce the figure files "STIC_FigX.jpg" and "STIC_FigX.fig" in the local folder.
- PRINT_SHAP_STATS.m
- Tabulates the SHAP explained variance values, which are depicted in Figure 7. Results are output to the console.
- PRODUCE_STIC_TABLES.m
- Tabulates the RMSE and R2 performances of the differing STIC formulations, which is depicted in Table 1.
CODE/
- Main Code Files
- A__DEFINE_STIC_parameters.m
- Defines unit conventions and naming conventions used in preprocessing of "B__MAIN_STIC.m".
- A__DEFINE_TOWER_parameters.m
- Defines tower date scheme and planting rotation used in preprocessing of "B__MAIN_STIC.m".
- B__MAIN_STIC.m
- The main code file for defining and running the different STIC formulations in use. Per-model (STIC Variations) parameters are assigned to the structures ["gen_params","soil_model_params","net_model_params","obs_model_params"] for STIC_{BP}, STIC_{ML,G}, STIC_{ML,Rn}, and STIC_{Ref} respectively.
- Additional pre- and post-processing parameters can be set here in ["processing_parameters", "post_parameters"]
- data_folder = './../DATA/';
- output_folder = './DATA_OUTPUT/STIC_OUT_DATA/';
- Data Pre-Processing Options (processing_parameters.xxxx) %below metrics are all less restrictive than their post-processing counterparts
- units_file = input_vars_file; % file which specifies mapping of input variable fieldnames to STIC expected fieldnames
- stic_params = stic_params_file; % file which specifies STIC expected units and ranges
- valid_year_range = [2021,2026];
- valid_hour_range = [-1,25]; %model estimates become unreliable when the sun is down
- valid_doy_range = [130,290]; %data outside of the growth season can skew metrics
- valid_ndvi_range = [0.3,0.3]; %start of season NDVI cutoff, end of season NDVI cutoff
- ndvi_variable = 'daytime_ndvi'; %target fieldname for ndvi reference value
- Data Post-Processing Options (post_parameters.xxxx)
- valid_year_range = [2021,2022]; %data range to consider
- valid_hour_range = [-1,25]; %[min,max] hour of day data to include (0-24)
- valid_doy_range = [160,260]; %[min,max] day of year data to include (0-365)
- valid_ndvi_range = [0.7,0.7]; %[start of season NDVI cutoff, end of season NDVI cutoff] ndvi minimum value threshold
- ndvi_variable = 'daytime_ndvi'; %target fieldname for ndvi reference value
- Data is read in from "DATA/" and exported to "DATA_OUTPUT/STIC_OUT_DATA".
- C__OUTPUT_SHAP_INPUTS.csv
- Features for SHAP analysis are specified here.
- Data is read from "DATA_OUTPUT/STIC_OUT_DATA", selected, and output to "DATA__OUTPUT/SHAP_DATA" as .csv files.
- D__SHAP_Processing.pynb
- Jupyter Notebook for calculating Cohort SHAP attributions from the prior .csv files. Imports the python module from "CODE/COHORT_SHAP".
- Additional parameters for the number of bins used for the Variance SHAP calculation, as well as the similarity cutoff and ratio used for Cohort SHAP, can be set here.
- infolder = "./DATA_OUTPUT/SHAP_DATA/"
- infiles = ["Corn.csv","Miscanthus.csv","Sorghum.csv","Soybean.csv"] # Input files for
- variance_bin_count = 10 # Number of bins to use for global variance shap importance calculation
- cohort_similarity_prctile_cutoff = 2.5 # Normalizes to the prctile range [2.5,97.5] to reduce outlier sensitivity
- cohort_similarity_ratio_threshold = 0.2 # Maximum allowed ratio of "sample distance"/"population distance" to be considered "similar"
- cpu_parallel_cores = 64 # Number of cpu cores to use when computing Cohort SHAP
- Generates additional files in the "DATA_OUTPUT/SHAP_DATA" directory.
- E__GATHER_SHAP_importance.m
- Compiles the files in "DATA_OUTPUT/SHAP_DATA" for ease of access and use in figure generation.
- A__DEFINE_STIC_parameters.m
Data
├─ DATA.zip
│ └── Corn_Tower.mat
│ └── Miscanthus_Tower.mat
│ └── Sorghum_Tower.mat
│ └── Soybean_Tower.mat
│ └── PrecipData_2019_2024.mat
│ └── README.txt
DATA/
- Stores eddy covariance, meteorological, and remote sensing data for each of the four: corn, miscanthus, sorghum, and soybean towers.
- README.txt: Includes additional details on the data contained in each file.
- Variables (in *_Tower.mat)
- DateTime — Timestamp of each measurement; datetime
- H_mdsgf — Sensible heat flux; W m⁻²
- LE_mdsgf — Latent heat flux; W m⁻²
- NETRAD_1_1_1 — Net radiation; W m⁻²
- SHF_average — Soil heat flux (averaged); W m⁻²
- PPT — Precipitation; mm
- SW_IN_1_1_1 — Incoming shortwave radiation; W m⁻²
- SW_OUT_1_1_1 — Outgoing shortwave radiation; W m⁻²
- LW_IN_1_1_1 — Incoming longwave radiation; W m⁻²
- LW_OUT_1_1_1 — Outgoing longwave radiation; W m⁻²
- TA_average_filled — Air temperature (gap‑filled); °C
- RH_average_filled — Relative humidity (gap‑filled); %
- PA_merge_filled — Air pressure (merged & filled); Pa
- WS_filled — Wind speed (gap‑filled); m s⁻¹
- IRT_SH1 — Infrared surface temperature measurement; °C
- IRT_431 — Infrared surface temperature measurement; °C
- NDVI — Normalized difference vegetation index, midday averages interpolated aross season; °C
- PRI_filled — Photochemical reflectance index, midday averages interpolated aross season; °C
- VPD_merge_filled — Near-surface vapor pressure deficit; Pa
- Variables (in PrecipData_2019_2024.mat)
- DateTime — Timestamp of each measurement; datetime
- decdoy — Decimal day of year
- decyear — Decimal year
- PPT — Half-hourly precipitation from Iowa Mesonet system; mm
PAPER_FIGURES
│ PAPER_FIGURES.zip
│ └── Figure [2-9]
│ └── Figure [S1-S5]
PAPER_FIGURES/
- Figure [1-9, S1-S5]
- Each folder contains the figure files "STIC_FigX.jpg" and "STIC_FigX.fig" as referenced in the publication.
SOFTWARE
│ SOFTWARE.zip
├─ COHORT_SHAP
│ └── cohort_shapley-main
SOFTWARE/COHORT_SHAP/
- Contains version v0.1.0 of the Cohort SHAP module released on March 10th, 2021. Code was developed by Masayoshi Mase, and is included here for reproducibility under the MIT License.
- https://github.com/cohortshapley/cohortshapley?tab=readme-ov-file
- Mase, M., Owen, A. B., & Seiler, B. (2019). Explaining black box decisions by Shapley cohort refinement. arXiv preprint arXiv:1911.00467.
