Data from: Global soil pollution by toxic metals threatens agriculture and human health

Data files

Apr 17, 2025 version files 122.70 MB

Dryad_attachments-Mapping_global_soil_pollution.zip

122.69 MB
README.md

6.02 KB

Abstract

This dataset includes the attachments related to the research conducted by Hou et al., titled "Global soil pollution by toxic metals threatens agriculture and human health". Our study involved the compilation of a comprehensive global dataset on soil contamination by arsenic, cadmium, cobalt, chromium, copper, nickel and lead, sourced from investigations encompassing sampling locations across various climate zones, geological formations, and land usage patterns. Subsequently, advanced machine learning methodologies were employed to identify regions with exceedance of agricultural and human health thresholds. This research underscores the critical impact of soil pollution on global food security and the imperative alignment with Sustainable Development Goals.

https://doi.org/10.5061/dryad.83bk3jb2z

Description of the data and file structure

File: Attachment_1_Soil_pollution_data_sources

Description: This table compiles literature selected for the global soil pollution dataset. It presents key details, including the first author's last name, publication year, region, study area coordinates, and the title of the literature in separate columns. Detailed reference information for these publications is located at the end of the table.

File: Attachment_2_Source_of_co-variate_and_variable_retention_for_modeling

Description: This table provides basic information on variables used in modeling. The first column lists all the variables used in the model. Columns 2-8 show variables remaining after feature selection under Human health and ecological thresholds for each toxic metal. Columns 9-15 display variables remaining after feature selection under Agricultural thresholds for each toxic metal. If the variable is selected, it will be marked with a Y, otherwise it will be blank. Columns 16-22 present detailed information about the datasets of covariates, including the spatial resolution, temporal coverage, unit of measurement, data format, database version, data source and download link for each variable. The empty cells and symbol “\” both indicate that the corresponding dataset does not contain the specified information, i.e. missing data.

File: Attachment_3_Feature_importance_and_correlation_with_toxic_metals

Description: This table provides the importance of variables calculated using Shapley Additive Explanations (SHAP) and Mean Decrease in Node Impurity (MDI), along with the variables' correlation with toxic metals. It displays the importance and correlation of variables remaining in models after feature selection. The empty cell indicates that the corresponding variable has not been selected in the feature selection. Columns 1-2 list variable names and abbreviations used in modeling. Columns 3-9 and Columns 10-16 respectively show the average absolute value of SHAP value in models for toxic metals under Human Health and Ecological Thresholds (HHET) and Agricultural Thresholds (AT). Higher values indicate greater variable importance in the model. Columns 17 -23 and Columns 24-30 display the Pearson correlation coefficients between selected variables and concentrations of toxic metals in all the land use types and agricultural lands, respectively. Columns 31-37 and Columns 38-44 reveal the importance of variables calculated by MDI under HHET and AT. Larger values indicate greater variable importance.

File: Attachment_4_Global_dataset_of_predicted_toxic_metals_exceedance_under_HHET

Description: This table presents the probability of exceedance for different toxic metals under Human Health and Ecological Thresholds for each grid. Columns 1-2 provide the coordinates of the grid, and probabilities of exceedance for toxic metals are listed in the following columns.

File: Attachment_5_Global_dataset_of_predicted_toxic_metals_exceedance_under_AT

Description: This table presents the probability of exceedance for different toxic metals under Agricultural Thresholds for each grid. Columns 1-2 provide the coordinates of the grid, and probabilities of exceedance for toxic metals are listed in the following columns.

File: Attachment_6_Distribution_of_sample_location_for_As

Description: This figure presents the distribution of sample locations for As. The color of points represents the number of samples in each location. Points data (e.g. data from LUCAS) were aggregated according to administration devotions.

File: Attachment_7_Distribution_of_sample_location_for_Cd

Description: This figure presents the distribution of sample locations for Cd. The color of points represents the number of samples in each location. Points data (e.g. data from LUCAS) were aggregated according to administration devotions.

File: Attachment_8_Distribution_of_sample_location_for_Co

Description: This figure presents the distribution of sample locations for Co. The color of points represents the number of samples in each location. Points data (e.g. data from LUCAS) were aggregated according to administration devotions.

File: Attachment_9_Distribution_of_sample_location_for_Cr

Description: This figure presents the distribution of sample locations for Cr. The color of points represents the number of samples in each location. Points data (e.g. data from LUCAS) were aggregated according to administration devotions.

File: Attachment_10_Distribution_of_sample_location_for_Cu

Description: This figure presents the distribution of sample locations for Cu. The color of points represents the number of samples in each location. Points data (e.g. data from LUCAS) were aggregated according to administration devotions.

File: Attachment_11_Distribution_of_sample_location_for_Ni

Description: This figure presents the distribution of sample locations for Ni. The color of points represents the number of samples in each location. Points data (e.g. data from LUCAS) were aggregated according to administration devotions.

File: Attachment_12_Distribution_of_sample_location_for_Pb

Description: This figure presents the distribution of sample locations for Pb. The color of points represents the number of samples in each location. Points data (e.g. data from LUCAS) were aggregated according to administration devotions.

File: Attachment_13_Code_for_models_development_and_data_analysis

Description: This file contains the code used in this study for model development and data analysis. The code is primarily written in Python, and the utilized packages are listed.

In this study, we synthesize a global soil concentration dataset for seven toxic metals: arsenic (As), cadmium (Cd), cobalt (Co), chromium (Cr), copper (Cu), nickel (Ni), and lead (Pb). This exhaustive compilation involved data from 1493 regional studies, encompassing 796,084 sampling points across a wide range of climate zones, geological settings, and land use types. Then, a series of covariates associated with soil contamination were used to construct predictive models for the distribution of toxic metal exceedance, including geological variables, climate variables, soil texture and basic physico-chemical properties, topography, and socioeconomic variables. All variables were resampled and reprojected to match the 10 km resolution grid of the toxic metal distribution.

The dataset was randomly divided into training and test sets in an 8:2 ratio. In order to avoid overfitting and multicollinearity, and improve model performance and interpretability, feature selection was conducted with the combination of recursive feature elimination and Pearson correlation analyses. After a preliminary comparison of ten machine learning algorithms, extremely randomized trees (ERT) emerged as the top-performing model for further refinement. Grid search was used to optimize the key hyperparameters in ERT, including the number of trees, the maximum depth of the tree, the minimum number of samples required to be at a leaf node, the minimum number of samples required to split an internal node and the function to measure the quality of a split. The optimized model was subsequently evaluated by a range of metrics, such as balanced accuracy (BA), sensitivity, specificity, F1 score, average precision (AP), the area under the Receiver Operating Characteristic Curve (AUC), and Cohen's kappa coefficient (KIA). Then, the optimized models were utilized to predict pollution probabilities for 2,000,000 pixels, enabling the generation of global pollution probability maps for different toxic metals. Areas covered by desert and permafrost were excluded, resulting in a final dataset of 1,290,000 remaining pixels.