Global contemporary effective population sizes across taxonomic groups

Clarke, Shannon H.1 ; Lawrence, Elizabeth R.1; Matte, Jean-Michel1; Salisbury, Sarah J.2; Michaelides, Sozos N.1; Koumrouyan, Ramela1; Ruzzante, Daniel E.2; Grant, James W. A.1; Fraser, Dylan J.1

Published May 03, 2024 on Dryad. https://doi.org/10.5061/dryad.p2ngf1vzm

Data files

May 03, 2024 version files 6.53 MB

Abstract

Effective population size (N_e) is a particularly useful metric for conservation as it affects genetic drift, inbreeding and adaptive potential within populations. Current guidelines recommend a minimum N_e of 50 and 500 to avoid short-term inbreeding and to preserve long-term adaptive potential, respectively. However, the extent to which wild populations reach these thresholds globally has not been investigated, nor has the relationship between N_eand human activities. Through a quantitative review, we generated a dataset with 4610 georeferenced N_e estimates from 3829 unique populations, extracted from 723 articles. These data show that certain taxonomic groups are less likely to meet 50/500 thresholds and are disproportionately impacted by human activities; plant, mammal, and amphibian populations had a <54% probability of reaching = 50 and a <9% probability of reaching = 500. Populations listed as being of conservation concern according to the IUCN Red List had a smaller median than unlisted populations, and this was consistent across all taxonomic groups. was reduced in areas with a greater Global Human Footprint, especially for amphibians, birds, and mammals, however relationships varied between taxa. We also highlight several considerations for future works, including the role that gene flow and subpopulation structure plays in the estimation of in wild populations, and the need for finer-scale taxonomic analyses. Our findings provide guidance for more specific thresholds based on N_e and help prioritize assessment of populations from taxa most at risk of failing to meet conservation thresholds.

https://doi.org/10.5061/dryad.p2ngf1vzm

Through a quantitative review, we generated a dataset with 5498 georeferenced Ne and Nb estimates from 3829 unique populations, extracted from 723 articles.

Description of the data and file structure

There are two data files (.csv) associated with this dataset. The first file, 'MEC-23-0962_FullData' includes the full dataset of 5498 estimates and associated data. The second file, 'MEC-23-0962_NonReplicated' includes a subset of the data, with spatial and temporal replication removed. In both files, each row signifies a unique estimate of Ne or Nb. Details of the columns and what they contain are in the table below:

Column	Explanation
PaperID	A unique identifier given to each paper during the screening process. Numbers are not sequential, as papers were excluded and removed from the database during screening.
Title	The title of the paper, extracted from Web of Science
Authors	The authors of the paper, extracted from Web of Science
Year Published	The year the paper was published, extracted from Web of Science
DOI	The DOI identifier for the paper, extracted from Web of Science
Journal	The journal the paper was published in, extracted from Web of Science
Common Name	Common name of species, used by authors in article, all lower case. Some cells are blank, as authors did not always report a common name.
Genus	genus (capitalized)
Species	species (uncapitalized)
Genus/Species	The full scientific name of the species
IUCN_detailed	IUCN status of a species (Not evaluated, data deficient, least concern, near threatened, vulnerable, endangered, critically endangered). Only included in the full data (not in the nonreplicated data)
IUCN	Broader groupings of IUCN status: NA (not evaluated or data deficient), nonthreatened (least concern or near threatened), threatened (vulnerable, endangered, critically endangered). Only included in the full data (not in the nonreplicated data)
Class	freshwater fish, marine fish, diadromous fish, reptile, amphibian, mammal, bird, invertebrate, plant. If something is not listed here, or you are unsure, you can enter a comment in the column next to this. For fish that can be either resident or diadromous, use best judgement according to the authors’ descriptions of the population
was the population reintroduced or translocated?	Did the authors mention in the paper that the population was reintroduced or translocated from a wild source? (YES/NO)
Is the pop non-native?	Did the authors mention that this is a non-native or invasive species
When was the pop reintroduced/translocated?	the year when pop was reintroduced or translocated
is the population commercially harvested?	Did the authors mention that this population is commercially harvested?
notes	notes on the reintroduction/translocation/non-native/harvesting of the population. Some cells are blank if notes were not required.
Type of sequencing	RAD-seq, GBS, capillary electrophoresis, Sanger. Can enter a method not listed here by choosing "other" and then entering in comment column.
Comment for type of sequencing	Response for “other” on type of sequencing. Some cells are blank.
marker	Microsatellite, SNP, or Other
Marker type comment	If “other” for marker above. Some cells are blank.
Number of loci	number of loci (for microsatellite markers) used in estimate.
Number of SNP	Number of SNP used in the estimate
LociNumber	Combined the two columns of number of loci and number of SNP into a single column
GW correction	genome-wide bias correction for LD method; YES/NO. Based on study by Waples, Larson, and Waples (2016, Heredity)
Method (general)	LD (linkage disequilibium), SF (sibship frequency), HE (heterozygote excess), MC (molecular coancestry), Bayesian methods. If article used a different single-sample genetic estimator not listed, put "Other" and enter it into the notes column
Method (specific)	LDNe, NeEstimator (v1 or V2), COLONY, ONeSAMP. If article used a different software not listed, put "Other" and enter it into the notes column
Notes about method	If “other” was chosen, it is specified here. Some cells are blank if notes were not required.
Allele freq cutoff	for LD method; common values are 0.01, 0.05, 0.1. If the article reports multiple allele cutoff values, follow this rule: For sample sizes >25 use 0.02, and <25 use 0.05. Please report whether the study included multiple allele cutoffs.
Comment on allele cutoffs	Any comments relevant to the allele cutoff (e.g. if they reported multiple cutoffs, etc.). Some cells are blank if notes were not required.
Sample size	the number of individuals sampled from the population
Temporal replication?	Does this population have temporal replication YES/NO (i.e., were multiple estimates reported for the same population through time?)
Spatial replication	Does the population have spatial replication YES/NO (i.e., were there multiple estimates reported for the same population? E.g., Ne was reported by sampling location, but multiple sampling locations make up a single population based on Fst or STRUCTURE)
Population	name given to population; for species in bodies of water, could use the name of the lake/river, or another name used by authors in article
Population ID	numerical identification for each population (since there can be multiple estimates for a single population). Alpha-numeric system using the article number. i.e. if article # 100 has two populations, they will be 100A and 100B. If an article only has one estimate, still include A at the end.
Method of defining population	based off of the authors in the article and how they defined the population. E.g. using Fst values, STRUCTURE (determining # of groups), BAYESASS (measuring migration rates), IBA (individual-based-assignment; using genetic data from populations to assign individuals), etc. If there is any additional information, include it in the comment column. E.g. what their threshold Fst value was, or level of migration, etc.
Region where population is located	can be a city/province/ etc. or multiple of these things.
Country	The country where the population is located
Continent/Ocean	Continent or ocean (for marine species) where the population is located
Latitude*	Latitude of the population
Longitude*	Longitude of the population
How coordinate was generated*	Comment on how the coordinate was obtained (i.e. was it reported by authors, estimated from a map, converted from UTM, the midpoint from several sampling locations was estimated, etc.
He	average He (expected heterozygosity) across loci for that population. Some cells are blank if He was not reported in the paper.
Ho	Average Ho (observed heterozygosity) across loci for that population. Some cells are blank if Ho was not reported in the paper.
Ar	allelic richness; average # of alleles per locus, weighted by sample size. If the authors refer to "Ar" with no mention of method, assume they are correct. If they refer to Ar as the non-weighted version, then it was categorized as MNA instead. Some cells are blank if Ar was not reported.
MNA	mean number of alleles per locus. NOT weighted. If the authors refer to MNA but mention weighting, categorize as Ar instead. Some cells are blank if MNA was not reported.
Inbreeding coefficient (Fis)	usually calculated from heterozygosity measures. Some cells are blank if Fis was not reported/
Nucleotide diversity	The nucleotide diversity reported for estimates using SNPs. Some cells are blank if nucleotide diversity was not reported.
Ne	point estimate of Ne. Cells are blank for populations were Nb was reported instead.
Nb	point estimate of Nb. Cells are blank for populations where Ne was reported instead.
Year	year the samples were taken from the population. Some cells are blank if year was not reported.
LCI	lower confidence interval for estimate
UCI	upper confidence interval for estimate (can be "infinity" as well).
CI method	method used to calculate CIs. E.g. jackknife vs parametric methods in LDNe program. If the method is not provided, enter as text in comments column. some cells are blank if method was not reported.
CI comment	Comment on the CIs. Some cells are blank if notes were not required.
was the original estimate negative or infinite?	For populations with no point estimate (I.e., the estimate was negative or infinite), and we used the LCI as a proxy of the point estimate. This column indicates whether the original estimate was negative or infinite. Cells are blank for estimates that were not negative or infinite.
Fifty	A binary response indicating whether the population meets the threshold of fifty Ne. 0 = below fifty, 1 = equal or greater than fifty.
fivehundred	A binary response indicating whether the population meets the threshold of five hundred Ne. 0 = below five hundred, 1 = equal or greater than five hundred.
Did they sample individuals across different cohorts?	yes/no/unsure. Did the authors sample from multiple cohorts within a population?
Did they report Ne for sampling sites?	If the authors defined a population as a group of sampling sites, but only reported Ne for the sites (rather than for the population as a whole), this column was marked "YES", and each sampling site was entered on a unique row, with the SAME POPULATION ID. If Ne is reported for both the sampling sites, and overall population, the population-level Ne was used, and this column was marked as “NO”.
Did they pool samples from multiple years?	yes/no/unsure. Did the authors pool samples from the same population over multiple years?
Notes	any other notes on the validity of the Ne estimate (e.g. all samples came from a single breeding site and may not be representative of the population). Some cells are blank if notes were not required.
Nc estimate	point estimate of Nc. Some cells are blank if Nc was not reported.
Nc LCI	Lower confidence interval of Nc (if reported). Some cells are blank if Nc was not reported.
Nc UCI	Upper confidence interval of Nc (if reported). Some cells are blank if Nc was not reported.
Nc method	mark-recapture, complete count, incomplete count (e.g. quadrat study with extrapolation), or "other". Some cells are blank if Nc was not reported.
Nc method comment	Any other relevant information about the method of estimating Nc. Some cells are blank if Nc was not reported.
Year Nc taken	The year that the Nc estimate was taken. Some cells are blank if Nc was not reported.
comment	Any relevant information about the year that Nc was taken. Some cells are blank if Nc was not reported.
ratio	The ratio of Ne/Nc or Nb/Nc. Some cells are blank if Nc was not reported.
Ratio type	“Ne” or “Nb”, i.e. is the ratio between Ne and Nc, or between Nb and Nc. Some cells are blank if Nc was not reported.
Notes	Overall notes on the paper/methods/estimates, etc. Some cells are blank where notes were not required.
HFI	The human footprint index, extracted using the lat/long coordinates in QGIS. Only included in the full data (not in the nonreplicated data). Some cells are blank where no HFI value was associated with the GPS location.

*Information on Latitude and Longitude are not included in the publicly accessible document, to protect locations of at-risk populations. Please contact the authors to receive the coordinate information for the dataset.

Sharing/Access information

Data were derived from peer reviewed articles, the details of which can be found in the data file under the columns titled 'Title', 'Authors', 'Year Published', 'DOI', and 'Journal'.

Code/Software

The R script associated with this dataset, titled 'MEC-23-0962_Code.R' was run in R Statistical Software (Version 4.2.2) and RStudio (Version 2022.12.0+353). The packages included in the script are: glmmTMB (v. 1.1.5), ggplot2 (v. 3.4.1), dplyr (v. 1.1.0), emmeans (v. 1.8.4-1), lme4 (v. 1.1-31), and nlme (v. 3.1-160)

Literature search, screening, and data extraction

A primary literature search was conducted using ISI Web of Science Core Collection and any articles that referenced two popular single-sample N_e estimation software packages: LDNe (Waples & Do, 2008), and NeEstimator v2 (Do et al., 2014). The initial search included 4513 articles published up to the search date of May 26, 2020. Articles were screened for relevance in two steps, first based on title and abstract, and then based on the full text. For each step, a consistency check was performed using 100 articles to ensure they were screened consistently between reviewers (n = 6). We required a kappa score (Collaboration for Environmental Evidence, 2020) of ³ 0.6 in order to proceed with screening of the remaining articles. Articles were screened based on three criteria: (1) Is an estimate of N_e or N_b reported; (2) for a wild animal or plant population; (3) using a single-sample genetic estimation method. Further details on the literature search and article screening are found in the Supplementary Material (Fig. S1).

We extracted data from all studies retained after both screening steps (title and abstract; full text). Each line of data entered in the database represents a single estimate from a population. Some populations had multiple estimates over several years, or from different estimation methods (see Table S1), and each of these was entered on a unique row in the database. Data on N̂_e, N̂_b, or N̂_c were extracted from tables and figures using WebPlotDigitizer software version 4.3 (Rohatgi, 2020). A full list of data extracted is found in Table S2.

Data Filtering

After the initial data collation, correction, and organization, there was a total of 8971 N_e estimates (Fig. S1). We used regression analyses to compare Ne estimates on the same populations, using different estimation methods (LD, Sibship, and Bayesian), and found that the R² values were very low (R² values of <0.1; Fig. S2 and Fig. S3). Given this inconsistency, and the fact that LD is the most frequently used method in the literature (74% of our database), we proceeded with only using the LD estimates for our analyses. We further filtered the data to remove estimates where no sample size was reported or no bias correction (Waples, 2006) was applied (see Fig. S6 for more details).

N_e is sometimes estimated to be infinity or negative within a population, which may reflect that a population is very large (i.e., where the drift signal-to-noise ratio is very low), and/or that there is low precision with the data due to small sample size or limited genetic marker resolution (Gilbert & Whitlock, 2015; Waples & Do, 2008; Waples & Do, 2010) We retained infinite and negative estimates only if they reported a positive lower confidence interval (LCI), and we used the LCI in place of a point estimate of N_e or N_b. We chose to use the LCI as a conservative proxy for in cases where a point estimate could not be generated, given its relevance for conservation (Fraser et al., 2007; Hare et al., 2011; Waples & Do 2008; Waples 2023). We also compared results using the LCI to a dataset where infinite or negative values were all assumed to reflect very large populations and replaced the estimate with an arbitrary large value of 9,999 (for reference in the LCI dataset only 51 estimates, or 0.9%, had an or > 9999). Using this 9999 dataset, we found that the main conclusions from the analyses remained the same as when using the LCI dataset, with the exception of the HFI analysis (see discussion in supplementary material; Table S3, Table S4 Fig. S4, S5). We also note that point estimates with an upper confidence interval of infinity (n = 1358) were larger on average (mean = 1380.82, compared to 689.44 and 571.64, for estimates with no CIs or with an upper boundary, respectively). Nevertheless, we chose to retain point estimates with an upper confidence interval of infinity because accounting for them in the analyses did not alter the main conclusions of our study and would have significantly decreased our sample size (Fig. S7, Table S5).

We also retained estimates from populations that were reintroduced or translocated from a wild source (n = 309), whereas those from captive sources were excluded during article screening (see above). In exploratory analyses, the removal of these data did not influence our results, and many of these populations are relevant to real-world conservation efforts, as reintroductions and translocations are used to re-establish or support small, at-risk populations.

We removed estimates based on duplication of markers (keeping estimates generated from SNPs when studies used both SNPs and microsatellites), and duplication of software (keeping estimates from NeEstimator v2 when studies used it alongside LDN_e). Spatial and temporal replication were addressed with two separate datasets (see Table S6 for more information): the full dataset included spatially and temporally replicated samples, while these two types of replication were removed from the non-replicated dataset. Finally, for all populations included in our final datasets, we manually extracted their protection status according to the IUCN Red List of Threatened Species. Taxa were categorized as “Threatened” (Vulnerable, Endangered, Critically Endangered), “Nonthreatened” (Least Concern, Near Threatened), or “N/A” (Data Deficient, Not Evaluated).

Mapping and Human Footprint Index (HFI)

All populations were mapped in QGIS using the coordinates extracted from articles. The maps were created using a World Behrmann equal area projection. For the summary maps, estimates were grouped into grid cells with an area of 250,000 km² (roughly 500 km x 500 km, but the dimensions of each cell vary due to distortions from the projection). Within each cell, we generated the count and median of N_e. We used the Global Human Footprint dataset (WCS & CIESIN, 2005) to generate a value of human influence (HFI) for each population at its geographic coordinates. The footprint ranges from zero (no human influence) to 100 (maximum human influence). Values were available in 1 km x 1 km grid cell size and were projected over the point estimates to assign a value of human footprint to each population. The human footprint values were extracted from the map into a spreadsheet to be used for statistical analyses. Not all geographic coordinates had a human footprint value associated with them (i.e., in the oceans and other large bodies of water), therefore marine fishes were not included in our HFI analysis. Overall, 3610 N_e estimates in our final dataset had an associated footprint value.

Global contemporary effective population sizes across taxonomic groups

Data files

Abstract

README: Global contemporary effective population sizes across taxonomic groups

Description of the data and file structure

Sharing/Access information

Code/Software

Methods

Works referencing this dataset