EcoCleanR: Enhancing data quality of biogeographic ranges with application for marine invertebrates
Data files
Jan 22, 2026 version files 726.47 KB
-
ecodata_cleaned.csv
205.52 KB
-
ecodata_corrected.csv
16.22 KB
-
ecodata_with_outliers.csv
208.66 KB
-
ecodata_x.csv
73.08 KB
-
ecodata.csv
211.58 KB
-
README.md
11.42 KB
Abstract
Published distribution data, while invaluable for understanding species' biogeography, often suffer from limitations such as dated and static representations of ranges, a bias toward latitudinal information, and a lack of resolution in sampling frequency and variation in abundance throughout a species’ distribution. Extensive open-source biodiversity data now allow us to construct biogeographic ranges with more modern observations, which can be useful in conservation, evolution, and ecological studies. However, data quality remains a persistent challenge, hampering data reliability and usability. We introduce EcoCleanR, an R package that integrates existing tools with new functionalities to address data integration and quality assessment through a systematic, step-by-step approach for marine occurrence data. This package enhances the process of identifying and resolving common issues in biodiversity data, including taxonomy and georeferencing errors. It provides: (a) example scripts to guide users, (b) functionalities to flag problematic occurrence records from multiple databases, and (c) outputs that include species-specific distribution ranges and their corresponding environmental conditions, to facilitate accurate biogeographic and ecological analyses.
Dataset DOI: 10.5061/dryad.fj6q57488
Description of the data and file structure
We have provided sample data files extracted in April 2025 to demonstrate a case study on the species Mexacanthina lugubris, a marine mollusc that inhabits rocky intertidal environments along the Pacific coast. The data sources used for this case study include GBIF, OBIS, iDigBio, and a local database (InvertEbase). The number of occurrence records may vary depending on when the data are extracted.
We created a merged dataset and removed all duplicate records using the EcoCleanR functions ec_db_merge() and ec_rm_duplicate(), resulting in an output file named ecodata.csv. All subsequent analyses can be performed using this file, with reproducible results as demonstrated in the package README.
Note: Cells containing NA indicate that no value was available for that field in the original source repository at the time of data retrieval. NA values therefore represent data not available, rather than data not applicable or removed during processing. No values were inferred or imputed to replace missing information.
1) ecodata.csv
This dataset contains ~1100 occurrence records for the species Mexacanthina lugubris. It was compiled by merging data from multiple online repositories — GBIF, OBIS, iDigBio and local file (InvertEbase) — consider as step 1 (Vignette: data_merging). Duplicate records were removed to retain only unique occurrences (2.1) using function "ec_rm_duplicate".
2) ecodata_corrected.csv
This data file is created by using GEOLocate tool by assigning georeferences for the records have locality information. With the GEOLocate outcome we only kept 4 columns. The georeference information from this file will be merge back with the main data file ecodata
3) ecodata_x.csv
This data was created to get unique combination of coordinate values to extract env variables from bio-oracle (in this example - temperature, pH, salinity and dissolve oxygen) and merge back in main data table - ecodata
4) ecodata_with_outliers.csv
This data file created after running ec_flag_outlier function. It has records with outlier probability based on spatial and non spatial attributes - with the threshold of 0.99 for the distance matrix for both spatial and non spatial attributes.
5) ecodata_cleaned.csv
This data shows the final cleaned occurrence records after all the cleaning steps, this outcome is a result after removing 5% of highest probability outlier data points. These records considered final values to describe species distribution range interms of spatial and env variables.
Files and variables
File: ecodata.csv
Description:
Variables
- basisOfRecord:Type of record (e.g., preserved specimen, fossil
- occurrenceStatus:Presence or absence of the organism
- institutionCode:Code of the institution that holds the record
- verbatimEventDate:Original recorded date of the event
- scientificName:Full scientific name of the organism
- individualCount:Number of individuals observed
- organismQuantity:Reported quantity of the organism
- abundance:Calculated or standardized abundance value
- decimalLatitude:Latitude in decimal degrees
- decimalLongitude:Longitude in decimal degrees
- coordinateUncertaintyInMeters:Uncertainty in coordinates (meters)
- locality:Named place where the occurrence was recorded
- verbatimLocality:Original text for locality description
- municipality:Municipality or town of the occurrence
- county:County where the record was observed
- stateProvince:State or province name
- country:Country name
- cleaned_catalog:Standardized catalog number for de-duplication
File: ecodata_corrected.csv
Description:
Variables
- cleaned_catalog:catalog number
- corrected_latitude:latitude values assigned by GEOLocate
- corrected_longitude:longitude values assigned by GEOLocate
- corrected_uncertainty:uncertainty values assigned by GEOLocate
File: ecodata_x.csv
Description:
Variables
- species:species name
- decimalLatitude:Latitude in decimal degrees
- decimalLongitude:Longitude in decimal degrees
- temperature_mean_BO:Mean sea surface temperature from Bio-ORACLE
- temperature_max_BO:Maximum sea surface temperature from Bio-ORACLE
- temperature_min_BO:Minimum sea surface temperature from Bio-ORACLE
File: ecodata_with_outliers.csv
Description:
Variables
- basisOfRecord:Type of occurrence record (e.g., preserved specimen, fossil)
- occurrenceStatus:Indicates presence or absence of the species
- institutionCode:Code of the institution that provided the record
- verbatimEventDate:Original text for the event or collection date
- scientificName:Scientific name of the organism
- individualCount:Number of individuals recorded
- organismQuantity:Reported quantity (unit may vary)
- abundance:Standardized or calculated abundance value
- decimalLatitude:Latitude in decimal degrees
- decimalLongitude:Longitude in decimal degrees
- coordinateUncertaintyInMeters:Spatial uncertainty of coordinates in meters
- locality:Named location where the record was collected}
- verbatimLocality:Original locality text as provided by the source
- municipality:Municipality or town of occurrence
- county:County of occurrence
- stateProvince:State or province of occurrence
- country:Country of occurrence
- cleaned_catalog:Standardized catalog number used for de-duplication
- lat_precision:Number of decimal places in the latitude coordinate
- lon_precision:Number of decimal places in the longitude coordinate
- flag_cordinate_precision:Flag for low coordinate precision
- flag_cc_val:Flag for invalid or impossible coordinates
- flag_cc_equal:Flag for identical latitude and longitude (likely erroneous)
- flag_cc_zero:Flag for coordinates at (0,0)
- flag_cc_cent:Flag for coordinates placed at a country or region centroid
- flag_cc_gbif:Flag for coordinates matching GBIF headquarters (artifact)
- flag_cc_inst:Flag for coordinates matching institution location
- flag_non_region:Flag for coordinates outside the study region
- outliers:Flag for outliers based clustering of spatial and env variables
- BO_sstmean:Mean sea surface temperature from Bio-ORACLE
- BO_sstmax:Maximum sea surface temperature from Bio-ORACLE
- BO_sstmin:Minimum sea surface temperature from Bio-ORACLE
- BO_chloro:Chlorophyll concentration from Bio-ORACLE
- BO_dissox:Dissolved oxygen level from Bio-ORACLE
File: ecodata_cleaned.csv
Description:
Variables
- basisOfRecord:Type of occurrence record (e.g., preserved specimen, fossil)
- occurrenceStatus:Indicates presence or absence of the species
- institutionCode:Code of the institution that provided the record
- verbatimEventDate:Original text for the event or collection date
- scientificName:Scientific name of the organism
- individualCount:Number of individuals recorded
- organismQuantity:Reported quantity (unit may vary)
- abundance:Standardized or calculated abundance value
- decimalLatitude:Latitude in decimal degrees
- decimalLongitude:Longitude in decimal degrees
- coordinateUncertaintyInMeters:Spatial uncertainty of coordinates in meters
- locality:Named location where the record was collected
- verbatimLocality:Original locality text as provided by the source
- municipality:Municipality or town of occurrence
- county:County of occurrence
- stateProvince:State or province of occurrence
- country:Country of occurrence
- cleaned_catalog:Standardized catalog number used for de-duplication
- lat_precision:Number of decimal places in the latitude coordinate
- lon_precision:Number of decimal places in the longitude coordinate
- flag_cordinate_precision:Flag for low coordinate precision
- flag_cc_val:Flag for invalid or impossible coordinates}
- flag_cc_equal:Flag for identical latitude and longitude (likely erroneous)
- flag_cc_zero:Flag for coordinates at (0,0)
- flag_cc_cent:Flag for coordinates placed at a country or region centroid
- flag_cc_gbif:Flag for coordinates matching GBIF headquarters (artifact)
- flag_cc_inst:Flag for coordinates matching institution location
- flag_non_region:Flag for coordinates outside the study region
- outliers:Flag for outliers based on clustering of spatial and environmental variables
- BO_sstmean:Mean sea surface temperature from Bio-ORACLE
- BO_sstmax:Maximum sea surface temperature from Bio-ORACLE
- BO_sstmin:Minimum sea surface temperature from Bio-ORACLE
- BO_chloro:Chlorophyll concentration from Bio-ORACLE
- BO_dissox:Dissolved oxygen level from Bio-ORACLE## Code/software
The data files provided in this repository can be viewed and processed using R (version ≥ 4.2.0) and the open-source R package EcoCleanR, which is available from CRAN and GitHub. EcoCleanR is designed to merge, clean, and visualize biodiversity occurrence data.
To reproduce the merged dataset (ecodata.csv), users should follow the EcoCleanR vignette data_merging, which demonstrates how the provided example datasets are combined and de-duplicated using the functions ec_db_merge() and ec_rm_duplicate(). Subsequent analyses and visualizations can be performed using either the instructions provided in README.Rmd or the detailed workflow described in the data_cleaning vignette, which explains the purpose and outputs of each EcoCleanR function.
Package dependencies are described in detail in the associated manuscript. When installing EcoCleanR from CRAN or GitHub, all required dependencies are automatically installed. No proprietary software is required.
EcoCleanR is available on CRAN at:
https://CRAN.R-project.org/package=EcoCleanR
An actively developed version of the package is available on GitHub:
https://github.com/sonipri/EcoCleanR
Users are encouraged to consult the GitHub repository for the most up-to-date version of the package and documentation.
Access information
Other publicly accessible locations of the data:
- iNaturalist
Data was derived from the following sources:
- GBIF, iDigBio, OBIS, InvertEbase
References:
Derived dataset GBIF.org (14 August 2025) Filtered export of GBIF occurrence data https://doi.org/10.15468/dd.8wxre4
Derived dataset GBIF.org (14 August 2025) Filtered export of GBIF occurrence data https://doi.org/10.15468/dd.6np6eh
Derived dataset GBIF.org (14 August 2025) Filtered export of GBIF occurrence data https://doi.org/10.15468/dd.crhkcq
Biodiversity occurrence data published by: InvertEBase (accessed through the InvertEBase Portal, https://invertebase.org/portal, 2025-08-14)
OBIS (2025) [Distribution records of Mexacanthina lugubris(G. B. Sowerby I, 1822)] [Dataset] (Available: Ocean Biodiversity Information System. Intergovernmental Oceanographic Commission of UNESCO. www.obis.org. Accessed: 2025-08-14)
https://www.idigbio.org/portal (2025),
Query: {"filtered": {"filter": {"and": [{"query": {"match": {"_all": {"operator": "and", "query": "mexacanthina lugubris"}}}}]}}},649 records, accessed on 2025-08-14T19:40:05.580627.
