Data from: High-resolution soil total phosphorus mapping for the conterminous USA using machine learning
Data files
Jan 16, 2026 version files 786.61 MB
-
ds01_modeling_data.xlsx
2.38 MB
-
final_soil_tp_scripts.Rmd
38.13 KB
-
predictor_grid.rds
784.19 MB
-
README.md
4.14 KB
Abstract
Accurate estimates of soil total phosphorus (TP) concentrations are essential for sustainable nutrient management, food security, and water quality protection. This study predicts and maps the spatial distribution of TP in the top 5 cm and C horizon of soils across the conterminous USA (CONUS) using data from the Geochemical and Mineralogical Data for Soils of the Conterminous United States. We compare the performances of random forest (RF) and inverse distance weighting (IDW) to model and generate soil TP predictions. The RF incorporates 19 predictor variables, including spatial coordinates, climate, soil properties, and topography, while IDW relies solely on coordinates and interpolates between soil TP observations. Models are evaluated using five-fold cross-validation. The RF models outperform the IDW models and explain 52 % (RMSE = 0.22 log10 mg kg -1) and 56 % (RMSE = 0.26 log10 mg kg -1) of the variance in soil TP for the top 5 cm and C horizon, respectively. As expected, both model types identify higher TP concentrations in the top 5 cm than in the C horizon, particularly in agricultural regions, reflecting anthropogenic influences. Furthermore, the RF-generated maps show more realistic spatial patterns that capture the heterogeneity of the CONUS and avoid the bullseye patterns often characteristic of IDW-generated maps. Additional insights from the RF models show that coordinates, soil texture, pH, and climate are top predictors of soil TP. Increased availability of variables, such as iron and aluminum, that can bind with phosphorus in soils, could improve RF model performance.
Overview
This dataset accompanies the manuscript titled: High-resolution Modeling of Soil Total Phosphorus Across the Conterminous United States
It includes processed datasets, scripts, and results related to the spatial prediction of soil total phosphorus (TP) across the CONUS at two depths (top 5 cm and C horizon) using a hybrid Random Forest and Inverse Distance Weighting (IDW) approach.
Dataset Description
File:
ds01_modeling_data.xlsx
This Excel file contains two sheets:
- Sheet 1: Soil TP and covariates for the top 5 cm depth
- Sheet 2: Soil TP and covariates for the C horizon
Columns (both sheets)
| Column Name | Description |
|---|---|
Top5_P |
Total phosphorus (top 5 cm) in mg kg⁻¹ |
C_P |
Total phosphorus (C horizon) in mg kg⁻¹ |
easting / northing |
Coordinates (EPSG:5070) |
temp / precip |
Climate variables |
elevation / slope |
Topographical attributes |
sand, silt, clay |
Soil texture (%) |
pH, ec, caco3, som max, ksat,bulk den, soil depth, water cap |
Soil properties |
lu ag, lu nat, lu other, lu urban |
Land use categories |
so int, so slt, so str |
Soil order categories |
log TP |
Log10-transformed soil TP |
predictor_grid.rds
This R file contains the predictors (800 m grid) used to map soil total phosphorus concentration estimations for the CONUS.
R Script
File: final_soil_tp_scripts.Rmd
This is the primary R script (R version 4.3.2; RStudio.Version(2023.12.0.369)) used for analysis and visualization. It includes:
- Reading and pre-processing of input data
- Five-fold cross-validation using Random Forest and Inverse Distance Weighting
- Hyperparameter tuning (
mtryandidp) - Creation of Partial Dependence Plots and Variable Importance Plots
- Residual analysis and spatial mapping
Required R Packages
The R script uses the following packages:
tidyverse,plyr,dplyr,broom,data.table,classInt,tidyr,ggplot2,ggpubr,patchwork,colorspace,gridExtracaret,caretEnsemble,themis,Metrics,pdpgstat,dismo,RStoolbox,data.table,scalessf,raster,terra,rnaturalearth,spData,sp,rnaturalearthdata,openxlsx,readxl,car,dismo,forcats,reporter,rnaturalearthhiresdoParallel– for parallel cross-validation
Access and sharing information:
All files deposited in this repository are researcher-generated, processed, and derived data products. Public-domain U.S. Geological Survey soil phosphorus (https://mrdata.usgs.gov/ds-801/) data were quality-controlled and combined with extracted environmental predictors for analysis and modeling.
Related Data Sources:
-
PRISM Climate Group. Oregon State University. https://prism.oregonstate.edu
-
National Land Cover Database (NLCD) 2008 Land Cover Conterminous United States. https://doi.org/10.5066/P9KZCM54
-
Soil properties. California Soil Resource Lab. https://casoilresource.lawr.ucdavis.edu/soil-properties/download.php
