Data from: Coverage-based rarefaction fails to quantify relative species richness

Published May 19, 2026 on Dryad. https://doi.org/10.5061/dryad.pg4f4qs50

Data files

May 19, 2026 version files 280.92 KB

analyses_and_figures.R

16.55 KB
data_files.tar.gz

255.66 KB
functions.R

5.35 KB
README.md

3.37 KB

Abstract

Coverage-based rarefaction (CBR) is a high-profile tool for assessing biodiversity that provides relative species richness estimates by leveraging the Good-Turing index u to interpolate. In contrast to alternatives such as the Shannon and Simpson indices, CBR's main appeal is providing values in units of species. CBR is tested against a series of other biodiversity measures. Data are both simulated and empirical, in the latter case drawn from an eclectic global database of terrestrial organisms. Various challenges are presented. First, species counts are simulated under the geometric Weibull (GW), Poisson log normal, and discretised Weibull abundance distributions. CBR and six other diversity estimators are then computed. Second, diversity estimates are computed for species inventories and then recomputed after excluding the single most common species in each one, which stands in for random inclusion of high counts in replicate samples. Third, randomly selected pairs of inventories are either (1) analysed separately with richness estimates summed, or (2) combined and only then analysed. Fourth, estimates are compared for randomly paired inventories: each pair must include an identical or highly similar number of singletons (= species sampled once). On average, fitting GW in simulation consistently returns an accurate and precise estimate of richness. GW returns much the same values regardless of how the empirical data are treated. CBR often overestimates by a large margin when dominants are excluded, underestimates by a large margin when data are combined, and nearly randomises values for singleton-matched inventory pairs. In other words, CBR does not respond predictably to variation in species richness and cannot reconstruct it when the data have strong internal structure. Because better options are available, it is neither wise nor necessary to standardise by coverage in order to estimate richness.

Dataset DOI: 10.5061/dryad.pg4f4qs50

Description of the data and file structure

This supplemental data set includes all of the R code used to carry out the analyses and prepare the text figures, in addition to all of the output from the four analyses of empirical and simulated data that are illustrated in the text.

Files and variables

File: functions.R

Description: This R file includes all of the stand-alone analytical functions used in the analyses. The key function is gweibull, which fits the geometric Weibull distribution to count data. rgweibull randomly generates count data sets drawn from this distribution. rweib draws randomly from the discretised Weibull distribution. chao1, fisher, shannon, and simpson implement standard ecological diversity metrics. sqs implements what is alternatively called shareholder quorum subsampling and coverage-based rarefaction using standard combinatorial equations. pln computes statistics for the Poisson log normal distribution and depends on the R package poilog. sadrad is needed to make gweibull and pln work.

File: analyses_and_figures.R

Description: This R script generates all of the analyses and text figures described in the paper in order. The R packages it depends upon are listed at the top, and the functions in functions.R are loaded with a source command. The code generates the stand-alone Fig. 1, which illustrates the blended maximum likelihood and Bayesian inference method outlined in the text by showing a parameter space (panel A) and the corresponding prior distribution (panel B). The first statistical analysis involves simulated ecological inventory data. The three following empirical analyses depend on inventory data presented in Alroy (2025: https://doi.org/10.5061/dryad.brv15dvdc). Data sets generated by the analyses are written to four plain text files. The individual files are included in the following gz file.

File: data_files.tar.gz

Description: This file includes the four plain text data files generated by the preceding R script. simulations.txt includes the simulation output used in Fig. 2. minus_dominant.txt summarises diversity statistics generated by ignoring the most common species (dominant) in each empirical ecological inventory (Fig. 3). combinations.R summarises diversity estimates generated either by analysing randomly matched pairs of inventories separately and summing their estimates or by combining these pairs and analysing them together (Fig. 4). singleton_matches.txt summarises estimates for pairs of inventories matched on the number of species sampled only once in each inventory (= singletons) (Fig. 5). "NA" values indicate "Not Applicable" or missing data.

Code/software

The analyses were carried out in the R programming environment (version 4.5.1). They depend on the R packages grDevices, poilog, sads, and vioplot. All custom R code is included in the above-mentioned files.

Access information

Empirical data were derived from the following source on the Dryad website and are public domain: https://doi.org/10.5061/dryad.brv15dvdc