FILE GUIDE FOR KASS ET AL. "The global distribution of known and undiscovered ant biodiversity" ------- SUMMARY ------- The directory "main_analysis_data", archived in the "main_analysis_data.zip" file, contains all the core data used in the analyses described in the paper, except those data described below that were too big to include in one compressed file. This includes the Global Ant Biodiversity Informatics (GABI) data, both raw and after cleaning and geocoding, all diversity estimate raster data shown in the figures of the paper, and other related data. Please consult "main_analysis_data/README_main_analysis_data.txt" for more details. Additional results for species and genera, including individual range estimates from polygons and species distribution models, individual datasets used for modeling, and intermediate occurrence data subsets for the geocoding analysis, are found in "results_species_add" and "results_genus_add", respectively, archived in "results_sp_gen_data_add.zip". Intermediate occurrence data subsets for the geocoding analysis are archived in "processing_data_add", archived also in "results_sp_gen_data_add.zip". The fitted Random Forest (RF) models used to make predictions of unknown diversity centers under a global high-sampling scenario, with variable importance calculated, are archived in "random_forest_add.zip". Please see more details below. ---------------------------------------- WHERE TO FIND GLOBAL DIVERSITY ESTIMATES ---------------------------------------- All global diversity estimates can be found in the main_analysis_data folder. There are two versions of each global raster: Geographic (degrees, in latitude/longitude) and Projected (meters, in equal area projection Eckert IV), all in 10 arcmin resolution (~20 km at the equator). These diversity estimates can be used for further analyses. Estimated ant global species richness: Geographic: /results_species/richness_predictions_irs60K/raw/ant_sp_richness_msk.tif Projected: /results_species/richness_predictions_irs60K/proj/ant_sp_richness_msk.tif Estimated ant global species range rarity: Geographic: /results_species/richness_predictions_irs60K/raw/ant_sp_rarity_msk.tif Projected: /results_species/richness_predictions_irs60K/proj/ant_sp_rarity_msk.tif Random Forest extrapolated ant global species richness under high-sampling scenario: Geographic: /random_forest/rf_predictions_richness/rf_rich_mt4_ext.tif Projected: /random_forest/rf_predictions_richness/rf_rich_mt4_ext_proj.tif Random Forest ant extrapolated global species range rarity under high-sampling scenario: Geographic: /random_forest/rf_predictions_irs60K/rf_irslog_mt2_ext.tif Projected: /random_forest/rf_predictions_irs60K/rf_irslog_mt2_ext_proj.tif Additionally, diversity center shapefiles for all taxa (ants and terrestrial vertebrates), and ArcGIS geodatabase with intersections with protected areas can be found in "results_species_add/centers_projected". ----------------------------- WHERE TO FIND OCCURRENCE DATA ----------------------------- The Global Ant Biodiversity Informatics (GABI) occurrence point data used in the analysis can be found here. These data were cleaned and geocoded according to the details specified in the paper, and can be used for further analyses. Species: main_analysis_data/processing_data/3A_forAnalysis_processed_database_SPECIES.csv Genus: main_analysis_data/processing_data/3A_forAnalysis_processed_database_GENUS.csv The raw GABI data before cleaning and geocoding can be found at "main_analysis_data/processing_data/1_GABI_raw_all_records.csv". Please see "main_analysis_data/README_main_analysis_data.txt" for more details. ----------------------------- WHERE TO FIND RANGE ESTIMATES ----------------------------- All ant range estimates used to make diversity estimates (richness and rarity) can be found in the "results_species_add" folder for species and "results_genus_add" folder for genera. These folders are archived in the "results_sp_gen_data_add.zip" file. These range estimates can be used for further analyses. Individual species distribution model predictions for species with sufficient data (>=5 records): results_species_add/sdm_species Individual rasterized polygonal (alpha hull or buffer) range estimates for low-data species (<5 records): results_species_add/poly_species Individual input data for species distribution models (predictor variable rasters masked to background extents [i.e., polygonal range estimates] and background points): results_species_add/model_inputs Individual SDM predictions for genera (all had sufficient data of >=5 records): results_genus_add/sdm_genus Individual input data for genus distribution models (predictor variable rasters masked to background extents [i.e., polygonal range estimates] and background points): results_genus_add/model_inputs ------------------------------------------------------- Detailed descriptions of addendum folders and files ------------------------------------------------------- The addendum folders with names ending in "_add" are meant to be added into the folder corresponding to the name before "_add". Example: The contents "results_genus_add" is meant to be added to "results_genus", found inside "main_analysis_data". See NOTES at the bottom of this README for more details. The data is organized like this to avoid archiving folders with very big combined file sizes. Archived in "results_sp_gen_data_add.zip": processing_data_add/for_geocoding: This directory contains intermediate data subsets necessary to reproduce the geocoding workflow. Details about the files are in "README_for_geocoding.txt". results_species_add/centers_projected: This directory contains the projected (Eckert IV) diversity center shapefiles for ants and vertebrates, as well as a geodatabase containing each diversity center intersected with the WDPA protected areas shapefile (www.protectedplanet.net). The intersection operation was done in ArcGIS Pro with the Pairwise Intersect tool and saved to an ESRI geodatabase because it is a heavy data operation. This .gdb file is read by R in the Kass_et_al_2022_SciAdv_prog_code/analysis/figures.R script. results_species_add/model_inputs: This directory contains the background point (.csv) and study extent raster (.tif) data for all species modeled in the analysis; each .tif file includes all predictor variable rasters, which is why it is much bigger in file size than the "sdm_species" directory described below. results_species_add/poly_species: This directory contains the rasterized polygon range estimates for low-data species (<5 occurrences). These were combined with the species distribution model predictions to estimate global species richness and rarity. In our analysis, these polygon rasters were generated and incorporated into diversity estimates on-the-fly without saving to file--this directory simply serves as a repository for these files for others to use in separate studies. results_species_add/sdm_species: This directory contains the species distribution model predictions (.tif) for all species modeled in the analysis. results_genus_add/model_inputs: This directory contains the background point (.csv) and study extent raster (.tif) data for all genera modeled in the analysis; each .tif file includes all predictor variable rasters, which is why it is much bigger in file size than the sdm_genus prediction directory described below. results_genus_add/sdm_genus: This directory contains the species distribution model predictions (.tif) for all genera modeled in the analysis. No genera had <5 occurrence points, so all were modeled. Archived in "random_forest_add.zip": random_forest_add/rf_mods: This directory contains the fitted Random Forest models with variable importance values calculated for the complexity settings selected via spatial cross-validation. The prefix "mtX" refers to a value X of mtry used when running RF. These values ranged from 1 to 10, and an optimal value of mtry was selected per model with the lowest mean square error (MSE). These models are very big and saved in large .rds files, and thus are included separately. --------- ON ZENODO --------- --Software Archived in Kass_et_al_2022_SciAdv_prog_code.zip: A simple package that contains all the R and Python scripts used to conduct the analysis and generate the figures in the paper. This folder has its own separate README with more details. The easiest way to run code is to open the .Rproj file in RStudio and press the "Install and Restart" button under the "Build" tab in the Environment frame -- this installs and loads the package and thus makes all functions available in the programming environment. The main analysis script is located in analysis/main_analysis.R. --Supplemental Information Archived in out_irs60K.zip: Example figure plots made by Kass_et_al_2022_SciAdv_prog_code/analysis/figures.R. The figures displayed in the paper were made with ArcGIS using the same underlying data. ODMAP_model_metadata.csv: Metadata for Maxent species distribution models and Random Forest models structured according to the ODMAP (Overview, Data, Model, Assessment and Prediction) framework formalized by Zurell et al. (2020) [https://doi.org/10.1111/ecog.04960]. This metadata was created using the shiny app located at https://odmap.wsl.ch/ and was edited lightly by hand to include some extra detail. ----- NOTES ----- The "addendum" folders (ending in "_add") need to be put into the following directories for some analyses (*) and plotting functions (^) found in Kass_et_al_2022_SciAdv_prog_code/analysis/figures.R to run properly. In the code, polygon range estimates are rasterized for low-data species on-the-fly, so the poly_species folder is not necessary for analysis, but was included for archival purposes. * results_species_add/model_inputs -> main_analysis_data/results_species/model_inputs * results_species_add/sdm_species -> main_analysis_data/results_species/sdm_species ^ results_species_add/centers_projected -> main_analysis_data/results_species/centers_projected * results_genus_add/model_inputs -> main_analysis_data/results_genus/model_inputs * results_genus_add/sdm_genus -> main_analysis_data/results_genus/sdm_genus * processing_data_add/for_geocoding -> main_analysis_data/processing_data/for_geocoding ^ random_forest_add/rf_mods -> main_analysis_data/random_forest/rf_mods