Data from: A latitudinal gradient of reference genomes
Data files
Aug 21, 2025 version files 331.83 KB
-
amphibia.tsv
7.50 KB
-
birds.tsv
78.12 KB
-
congen_lit_review.csv
170.95 KB
-
crocodylia.tsv
596 B
-
mammals.tsv
52.78 KB
-
README.md
9.54 KB
-
squamates.tsv
9.23 KB
-
testudines.tsv
2.97 KB
-
tuatara.tsv
137 B
Abstract
Global inequality rooted in legacies of colonialism and uneven development can lead to systematic biases in scientific knowledge. In ecology and evolutionary biology, findings, funding, and research effort are disproportionately concentrated at high latitudes, while biological diversity is concentrated at low latitudes. This discrepancy may have a particular influence in fields like phylogeography, molecular ecology, and conservation genetics, where the rise of genomics has increased the cost and technical expertise required to apply state-of-the-art methods. Here, we ask whether a fundamental biogeographic pattern – the latitudinal gradient of species richness in tetrapods – is reflected in available reference genomes, an important data resource for various applications of molecular tools for biodiversity research and conservation. We also ask whether sequencing approaches differ between the Global South and Global North, reviewing the last five years of conservation genetics research in four leading journals. We find that extant reference genomes are scarce relative to species richness at low latitudes, and that reduced-representation and whole-genome sequencing are disproportionately applied to taxa in the Global North. We conclude with recommendations to close this gap and improve international collaborations in biodiversity genomics.
Dataset DOI: 10.5061/dryad.2v6wwpzxh
Description of the data and file structure
Linck and Cadena 2024 Mol. Ecol. obtained two types of data: 1) metadata on published tetrapod reference genomes from NCBI's Genome Browser; and 2) georeferenced occurrence data from the Global Biodiversity Information Facility. Data were aggregated using the NCBI Datasets command-line tools v.16.19.0 and rgbif v.3.8.0. GBIF data are used in accordance with the organization's Data user agreement (https://www.gbif.org/terms/data-user)
GBIF Data
The georeferenced occurrence data used in the study and necessary to knit 01_analysis.Rmd are available directly from GBIF via dedicated DOIs and landing pages. These downloads are: https://doi.org/10.15468/dl.59eyey (producing 0013380-240626123714530.zip); https://doi.org/10.15468/dl.vybgce (for 0013381-240626123714530.zip); https://doi.org/10.15468/dl.jd85b2 (for 0013376-240626123714530.zip); https://doi.org/10.15468/dl.jbg6xy (for 0085284-240506114902167.zip);https://doi.org/10.15468/dl.zw99uq (for 0083881-240506114902167.zip); https://doi.org/10.15468/dl.5pkrsd (for 0085162-240506114902167.zip); https://doi.org/10.15468/dl.p4r6hf (for 0071470-240506114902167.zip); and https://doi.org/10.15468/dl.njrdnn (for 0083708-240506114902167.zip).
Local Files and Variables
mammals.tsv: Results of NCBI Datasets command-line tools query for metadata for all available Class Mammalia genome assemblies.
Variables
- Assembly Accession: NCBI accession number—a unique identifier used to locate and access genome assemblies.
- Assembly Name: Author-input name for assembly version.
- Annotation Name: NCBI-generated identifier for a genome assembly's annotation. Blank cells indicate N/A or not available (i.e., there is no associated annotation for this assembly).
- Annotation Release Date: Release date for a genome assembly's annotation. Blank cells indicate N/A or not available (i.e., there is no associated annotation for this assembly).
- Organism Name: Latin binomial or trinomial for the species the assembly was generated from.
squamates.tsv: Results of NCBI Datasets command-line tools query for metadata for all available Class Squamata genome assemblies.
Variables
- Assembly Accession: NCBI accession number—a unique identifier used to locate and access genome assemblies.
- Assembly Name: Author-input name for assembly version.
- Annotation Name: NCBI-generated identifier for a genome assembly's annotation. Blank cells indicate N/A or not available (i.e., there is no associated annotation for this assembly).
- Annotation Release Date: Release date for a genome assembly's annotation. Blank cells indicate N/A or not available (i.e., there is no associated annotation for this assembly).
- Organism Name: Latin binomial or trinomial for the species the assembly was generated from.
birds.tsv: Results of NCBI Datasets command-line tools query for metadata for all available Order Squamata genome assemblies.
Variables
- Assembly Accession: NCBI accession number—a unique identifier used to locate and access genome assemblies.
- Assembly Name: Author-input name for assembly version.
- Annotation Name: NCBI-generated identifier for a genome assembly's annotation. Blank cells indicate N/A or not available (i.e., there is no associated annotation for this assembly).
- Annotation Release Date: Release date for a genome assembly's annotation. Blank cells indicate N/A or not available (i.e., there is no associated annotation for this assembly).
- Organism Name: Latin binomial or trinomial for the species the assembly was generated from.
amphibia.tsv: Results of NCBI Datasets command-line tools query for metadata for all available Class Amphibia genome assemblies.
Variables
- Assembly Accession: NCBI accession number—a unique identifier used to locate and access genome assemblies.
- Assembly Name: Author-input name for assembly version.
- Annotation Name: NCBI-generated identifier for a genome assembly's annotation. Blank cells indicate N/A or not available (i.e., there is no associated annotation for this assembly).
- Annotation Release Date: Release date for a genome assembly's annotation. Blank cells indicate N/A or not available (i.e., there is no associated annotation for this assembly).
- Organism Name: Latin binomial or trinomial for the species the assembly was generated from.
testudines.tsv: Results of NCBI Datasets command-line tools query for metadata for all available Order Testudines genome assemblies.
Variables
- Assembly Accession: NCBI accession number—a unique identifier used to locate and access genome assemblies.
- Assembly Name: Author-input name for assembly version.
- Annotation Name: NCBI-generated identifier for a genome assembly's annotation. Blank cells indicate N/A or not available (i.e., there is no associated annotation for this assembly).
- Annotation Release Date: Release date for a genome assembly's annotation. Blank cells indicate N/A or not available (i.e., there is no associated annotation for this assembly).
- Organism Name: Latin binomial or trinomial for the species the assembly was generated from.
tuatara.tsv: Results of NCBI Datasets command-line tools query for metadata for all available Order Rhynchocephalia genome assemblies.
Variables
- Assembly Accession: NCBI accession number—a unique identifier used to locate and access genome assemblies.
- Assembly Name: Author-input name for assembly version.
- Annotation Name: NCBI-generated identifier for a genome assembly's annotation. Blank cells indicate N/A or not available (i.e., there is no associated annotation for this assembly).
- Annotation Release Date: Release date for a genome assembly's annotation. Blank cells indicate N/A or not available (i.e., there is no associated annotation for this assembly).
- Organism Name: Latin binomial or trinomial for the species the assembly was generated from.
crocodylia.tsv: Results of NCBI Datasets command-line tools query for metadata for all available Order Crocodilia genome assemblies.
Variables
- Assembly Accession: NCBI accession number—a unique identifier used to locate and access genome assemblies.
- Assembly Name: Author-input name for assembly version.
- Annotation Name: NCBI-generated identifier for a genome assembly's annotation. Blank cells indicate N/A or not available (i.e., there is no associated annotation for this assembly).
- Annotation Release Date: Release date for a genome assembly's annotation. Blank cells indicate N/A or not available (i.e., there is no associated annotation for this assembly).
- Organism Name: Latin binomial or trinomial for the species the assembly was generated from.
congen_lit_review.csv: Data from the manual literature review.
Variables
- Authors: Authors of a given publication included in our review.
- Article Title: The title of the publication.
- Source Title: The journal the publication appeared in.
- DOI: The Digital Object Identifier for the publication.
- Publication Year: The year the article was published.
- First Author Origin: Geographic origin of the first author's affiliated institution, categorized as Global North, Global South, or Both.
- Senior Author Origin: Geographic origin of the first author's affiliated institution, categorized as Global North, Global South, or Both.
- Global South Middle Author?: Whether or not the study includes a middle author from the Global South, coded as Yes, No, or N/A as in not applicable (if no middle author).
- Sampled Focal Taxa Range: Geographic distribution of the study's focal taxa / taxon, coded as Global North, Global South, Both (Migratory), or Both (Shared Resident).
- Sequencing Method: DNA sequencing approach used by the study, assigned to the categories Sanger Sequencing, Microsatellites, Reduced Representation (e.g., UCEs or RADseq), WGS, or Other.
- Study Focus: Description of the focus of the study, categorized as Selection / GWAS / Runs of Homozygosity / Adaptive Potential / Gene Expression, Population Structure / Gene Flow / Diversity / Demographic History, or Identification / Sexing / Taxonomy / Systematics.
- Comments: Miscellaneous comments. Blank cells should be interpreted as N/A / Not Applicable.
Code/software
ncbi.md: Markdown file with NCBI CLI commands used to generate .tsv genome assembly metadata files.
01_analysis.Rmd: Rmarkdown file that performs all analyses and generates all figures for the manuscript.
Access information
Other publicly accessible locations of the data:
- Personal GitHub: https://github.com/elinck/lat_grad_genome/
Data was derived from the following sources:
- GBIF. All GBIF occurence records were associated with a CC BY-NC 4.0 license.
- NCBI Datasets. NCBI metadata used in the study are part of the public domain / available via an Open Database License.
We used the National Center for Biotechnology Information (NCBI) Datasets command-line tools v.16.19.0 (O’Leary et al. 2024) to download taxonomy metadata for the subset of species with an assembled reference genome in the following taxa: birds (Class: Aves), mammals (Class: Mammalia), squamates (Order: Squamata), amphibians (Class: Amphibia), turtles (Order: Testudines), crocodilians (Order: Crocodilia) and tuataras (Order: Rhynchocephalia). We selected these groups—together comprising extant tetrapods—to provide a snapshot of animal diversity in relatively well-studied clades with different ecologies and evolutionary histories, while restricting the total dataset to a computationally manageable size. From this initial list we retained species with an exact match to the Global Biodiversity Information Facility’s (GBIF) Backbone Taxonomy using rgbif v.3.8.0 (Chamberlain et al. 2024) and downloaded all observations of each backed by georeferenced voucher specimens in natural history museum collections (NHCs), excluding those without coordinates and those flagged for geospatial issues (n= 3,006,946). We repeated this process for all species in each higher-level taxon represented in our list of reference genomes (i.e., downloaded metadata for all georeferenced tetrapod specimens on GBIF; n= 9,303,258). DOIs for each download are available in the References section below (GBIF.org 2024a; GBIF.org 2024b; GBIF.org 2024c; GBIF.org 2024d; GBIF.org 2024e; GBIF.org 2024f; GBIF.org 2024g; GBIF.org 2024h).
Filtering these aggregated datasets to contain only species with 10 or more specimen records, we generated convex hull polygons for each as a coarse approximation of their geographic distribution using the R package sf v.1.0-16 (Pebesma 2018; Pebesma & Bivand 2023). Overlaying these on a shapefile of Earth’s landmasses from rnaturalearth v.1.0.1 (Massicotte & South 2024), we calculated species richness as the number of overlapping convex hulls in 2-degree x 2-degree grid cells, statistically standardizing this value by subtracting observed mean global species richness and dividing by its standard deviation. We subtracted the number of species with reference genomes from total species richness to determine the regions with the largest representation gap in genomic resources, again standardizing the difference. To assess the significance and slope of a correlation between species richness and the absolute value (or modulus) of latitude in decimal degrees, we performed simple linear regressions in R v.4.4.0 (R Core Team 2024), analyzing species with reference genomes and our full dataset separately.
To evaluate how the geography of authorship might impact sequencing strategy of studies in conservation biology, we performed a restricted Web of Science literature search on 29 June 2024 for English-language conservation genetics papers published in the last five years in the journals Conservation Genetics, Molecular Ecology, Journal of Heredity, and Conservation Biology, selected for frequently publishing empirical work on non-model organisms. We used the queries ‘SO=”Conservation Genetics”’ and ‘SO=("Molecular Ecology" OR "Journal of Heredity" OR "Conservation Biology") AND (TS="Conservation Genet*" OR KP="Conservation Genet*" OR TI="Conservation Genet*"’), excluding reviews, genome announcements, meta-analyses, preprints, and studies that were purely simulations. Our criteria aimed to achieve a tractable sample size for careful study (<1000 papers) while covering the period in which WGS became commonly used for the conservation genetics of non-model organisms (Fuentes-Pardo & Ruzzante 2017; Hohenlohe et al. 2021).
We then manually reviewed each study, first assigning the home institution of its first and last author to the Global North, Global South or both (i.e., joint affiliations) using the 2024 UN Trade and Development Classifications. Because the number of middle authors varied widely across our sample, we assessed their affiliations on a binary basis, indicating only whether a contributor from an institution from the Global South was present outside of the lead and senior positions. Synthesizing these data, we assigned papers to mutually exclusive groups based on whether they included one or more Global South authors or only Global North authors. Next, we categorized each study’s sequencing approach as reduced representation, WGS, Sanger sequencing, microsatellites, or other, and described its overall focus using tiered categories based on discussion in Bertola et al. 2024. These tiers were: 1) Taxonomy / systematics, identification, or sexing; 2) Phylogeography / population genetic structure, estimating genetic diversity, and inferring demographic history; and 3) Detecting outlier loci, quantifying runs of homozygosity, and evaluating adaptive potential. When studies employed more than one sequencing approach or addressed goals belonging to multiple tiers, we assigned them to a single category based on their most data-intensive method or question. To explore geographic patterns in sequencing effort, we assessed whether each study’s taxonomic sampling included 1) at least one species distributed in the Global South and 2) at least one species distributed in the Global North. Because some studies included multiple taxa and some species are broadly distributed or migrate between regions, these categories were not mutually exclusive. To evaluate whether geographic representation in conservation genetics changed over the period covered by our review, we performed logistic regression using the stats package R v.4.4.0, treating the presence or absence of an author from the Global South as a binary outcome variable and year as the sole independent variable.
