Skip to main content

Sampling biases shape our view of the natural world

Cite this dataset

Hughes, Alice et al. (2021). Sampling biases shape our view of the natural world [Dataset]. Dryad.


Spatial patterns of biodiversity are inextricably linked to their collection methods, yet no synthesis of bias patterns or their consequences exists. As such, views of organismal distribution and the ecosystems they make up may be incorrect, undermining countless ecological and evolutionary studies. Using 742 million records of 374,900 species, we explore the global patterns and impacts of biases related to taxonomy, accessibility, ecotype, and data type across terrestrial and marine systems. Pervasive sampling and observation biases exist across animals, with only 6.74% of the globe sampled, and disproportionately poor tropical sampling. High -elevations and deep -seas are particularly unknown. Over 50% of records in most groups account for under 2% of species, and citizen-science only exacerbates biases. Additional data will be needed to overcome many of these biases, but we must increasingly value data publication to bridge this gap and better represent species’ distributions from more distant and inaccessible areas, and provide the necessary basis for conservation and management.


Full methods are given in the paper

Supplemental methods

A representative sample of vertebrates and invertebrates was selected for terrestrial and ocean systems. All four major terrestrial vertebrate groups were selected (Aves, Amphibia, Mammalia and Reptilia) along with a selection of groups found in both realms: Actinopterygii, Annelida, Arachnida, Cnidaria, Elasmobranchii, Gastropoda and Malacostraca (insects were not selected because they are virtually absent in marine systems and studies show that even the best-studied groups, such as bees, are exceedingly biased in coverage, 46). Whilst some of these groups are more diverse in either oceanic or terrestrial systems, most have representatives in both and show an array of different life-forms (between and within groups, and were selected at certain taxonomic levels to try to control for diversity of life-form) to enable extrapolation to other groups showing similar habits. In total, 742,161,633 records were analyzed, including 38,313,609 of 57,252,510 potential OBIS records for marine systems (80% of animal records, 67% of all records) and 703,848,024 of 1.23 billion potential GBIF records for terrestrial systems (71% of animal records, 57% of all records; details in Data S1, DOIs are listed in Data S2). Number of species to reach 50% of the data was also quantified from GBIF directly.


All GBIF records for amphibians, birds, mammals and reptiles within GBIF were downloaded on the 19th of April 2019 and other taxa were downloaded on July 3rd. The relationship between species distributions and various environmental parameters (coasts, rivers, cities etc-see below) was examined. Analysis was done for global and national scales and results at a national level are available in supplements, as relationships varied across taxa. For each country, we aggregated the number of records collected via different methods, with specimens and human observations dominant.

GBIF data were cleaned and filtered using taxon-specific lists for terrestrial vertebrates (;;;, numbers of species per genus were calculated based on this data, and the proportion of extinct species ( added as this may influence the relationship to roads.

Roads were downloaded from the GRIP dataset (, and distance from roads calculated in ArcGIS in a Cylindrical Equal-areas projection. The same projection was also used to measure distance from the coast (, distance from rivers ( and distance to cities (based on reclassifying the lights at night Blackmarble 2016 data ( to show only concentrated aggregations of light, based on testing what threshold of light brightness could be reliably used to define urban masses compared to Nature Vue imagery, based on global comparison of cities and relative brightness). The relationships between each road and coastline and sampling localities were then examined by looking at the number and percentage of records at 0-1km, 1-2.5km, 2.5-5km and over 5km from each.

The change in relationships within protected areas ( was examined by calculating the number of records within 2.5km of a road inside and outside protected areas, the same analysis was conducted for KBAs (

The latest versions of ecoregions were used ( to examine the level of sampling relative to the size and richness of each ecoregion, and ocean realms were downloaded from; percentage coverage and degree of sampling was calculated in each system. Elevation was downloaded from ( and average altitude at the four buffer distances was calculated using the zonal statistics function in Arc.

To examine representativeness of sampling for high-altitude biomes, we used two definitions. The first used a simple, standardized metric, using the top quartile of all elevations globally and classing these as high-altitude. The second approach was more representative of high biomes from an ecophysiological perspective, but required a more complex method. This required the integration of several different approaches which functioned better in different parts of the world with the aim of differentiating ecosystems above and below the treeline, and extracting developed and crop ecosystems. Tree density was reclassified to show just forest cover ( and non-tree areas above 1000m extracted. Cities and croplands ( were also removed from analysis. This was found to capture the zone above the treeline exceedingly well (based on testing with high-resolution imagery data) across the Northern Hemisphere above 30 degrees latitude, but over-predicted nival (above the treeline) habitats below this. For the Southern Hemisphere, using a mean temperature of 5.5 oC as a boundary demarcated nival areas well, whereas in the US between 15-50o North a mean of 2.5 oC performed better. These three zones were combined to give a global map of zones above treelines and tested using high-resolution imagery. The elevation of the nival zone was then extracted using a polygon of nival areas, then the lowest elevation of each nival zone calculated using zonal statistics. The region 500m below the treeline was created by subtracting 500 from the lowest treeline altitude for each area. A 10km buffer was marked around each treeline, and used to mask the elevation. The mask was then given the value of the 500m below treeline using zonal statistics and the value of the buffer elevation was mosaicked with the minus 500m elevation mask to retain the highest value. Areas above the 500m limit were then extracted by subtracting the value of the buffer, and the non-zero areas to create a mask. This area was then checked globally to examine this top tier of tree ecosystems.

For the elevation zones and ecoregions, the number of records and degree of coverage was then calculated. For coverage statistics, we gridded the world into 5 and 10km tiles, then assessed what percentage of all tiles had records. This was conducted for each taxa for all analysis where different zones or countries were examined and to understand how well-sampled terrestrial systems were for different taxa.


For Ocean records, OBIS ( was used to assess sampling and coverage (downloaded on 3rd July, See Supplemental Data 2). Overall sampling was calculated on the 5km and 10km grids overall, for each country’s Exclusive Economic Zone (from the Convention for the Conservation of Antarctic Marine Living Resources:, and for each marine realm (Costello et al. 2017:

The level of GDP per capita for each territory was calculated and compared to the percentage coverage of the Exclusive Economic Zone to the GDP. Data on exclusive economic zones was downloaded from Worldbank (, and other sources were used when no published data were available on a country or territory. GDP was compared to coverage in each country in addition to oceanic exclusive economic zone, and to the percentage of results from different sources. We also compared the percentage of records via human observations or specimens direct to GDP using the same approach. To collate tourist numbers, we used ; for the majority of countries, then filled in missing countries using national tourism statistics numbers or other reliable sources to the nearest available year (2013-2017) (Supplemental Data S2).

To understand the level of coverage for the deep-sea, the same SRTM was used as for terrestrial ecosystems and classified into depths of 1000m to 1800m (Webb 2010) and below -1800m, more indepth analysis of how marine diversity varies with depth is available from Costello & Chaudhary, 2017. For the groups under study, the records where depths were recorded were then filtered to provide a dataset with only species records which included depths. Records corresponding to the two depth cutoffs were maintained and the number of records and coverage levels calculated for the two zones.

For Marine Protected Areas (MPA) we calculated the percentage coverage for each group within the areas using the shapefiles from Protected Planet and the tabulate area tool in ArcMap 10.3.

Usage notes

Uploaded files include species and samples per cell for each group for the land and ocean. See readme for specifics. Five excel files of supplemental data are also included.