Global patterns of nuclear and mitochondrial genetic diversity in marine fishes

Published Apr 29, 2024 on Dryad. https://doi.org/10.5061/dryad.8gtht76wn

Abstract

Genetic diversity is a fundamental component of biodiversity. Examination of global patterns of genetic diversity can help highlight mechanisms underlying species diversity, though a recurring challenge has been that patterns may vary by molecular marker. Here, we compiled 6862 observations of genetic diversity from 492 species of marine fish and tested among hypotheses for diversity gradients: the founder effect hypothesis, the kinetic energy hypothesis, and the productivity-diversity hypothesis. We fit generalized linear mixed effect models (GLMMs) and explored the extent to which various macroecological drivers (latitude, longitude, temperature (SST), and chlorophyll-a concentration) explained variation in genetic diversity. We found that mitochondrial genetic diversity followed geographic gradients similar to those of species diversity, being highest near the Equator, particularly in the Coral Triangle, while nuclear genetic diversity did not follow clear geographic patterns. Despite these differences, all genetic diversity metrics were correlated with chlorophyll-a concentration, while mitochondrial diversity was also positively associated with SST. Our results provide support for the kinetic energy hypothesis, which predicts that elevated mutation rates at higher temperatures increase mitochondrial but not necessarily nuclear diversity, and the productivity-diversity hypothesis, which posits that resource-rich regions support larger populations with greater genetic diversity. Overall, these findings reveal how environmental variables can influence mutation rates and genetic drift in the ocean, caution against using mitochondrial macro-genetic patterns as proxies for whole-genome diversity, and aid in defining global gradients of genetic diversity.

This repository provides the data and scripts for all analyses in the associated paper. Data was gathered from a literature search in the Web of Science.

A complete list of all necessary software and packages (with version numbers) can be found at the bottom of this README.

Data

A list of the files read into R scripts for analyses. These include the CSV files where data from the literature search was originally recorded. Cells with missing information are infilled with NA.

To respect the licenses and conditions of data reuse outlined by the data creators and managers of the primary data publications, as well as limit data reduplication, we have excluded the genetic diversity estimates (He, Hd, and pi) in each of the following datasets. All genetic diversity estimates are provided and freely accessible in the original publications (cited within each dataset), or upon request. Associated metadata are reported.
To reduce the potential to introduce unintended risk, we have masked the location information for vulnerable, endangered, or otherwise threatened species in our dataset. We have followed Dryad's guidelines for species data. A summary of each species' IUCN status can be found in spp_combined_info.csv.

Files generated by scripts are not included, with 1 exception.

Fishery lat msats .csv, msat_2011-2020.csv

These files are where microsatellite data were originally recorded. In these spreadsheets, studies are assigned to rows. If a study recorded data from more than one species, location, and/or microsatellite marker, it was given more than one row (e.g., each row represented data from a unique study/species/location/marker combination). These files are all read into assemble_data_msat.R. Information in the columns is as follows:

Column 1: Species scientific name
Column 2: Species common name
Column 3: Study (paper) the data were recorded from
Column 4: Indicates whether the study is a primer note or not (1 for yes, 0 for no)
Column 5: Country where the sample was taken
Column 6: Site where the sample was taken, as named by the paper author
Column 7: Degrees latitude (may be decimal degrees)
Column 8: Minutes latitude
Column 9: Seconds latitude
Column 10: Degrees longitude (may be decimal degrees)
Column 11: Minutes longitude
Column 12: Seconds longitude
Column 13: Year in which the samples were taken
Column 14: Number of microsatellite markers
Column 15: Name of the microsatellite marker, as listed in the paper
Column 16: Indicates whether the microsatellite was originally developed in a different species (1 if yes, 0 if no)
Column 17: Number of individuals sampled
Column 18: Length of the microsatellite repeat, in bases
Column 19 (for the msat_2011-2020_data.csv* files only): A "Notes" column, that contains any notes made by the recorder during the data collection process.

ppdat_2016-03-04wLL.csv

This file contains the raw data used in Pinsky & Palumbi (2014). It is also read into assemble_data_msat.R. Information in the columns is as follows:

Columns 1-8: Match columns 1-8 in Fishery lat msats.csv and msat_2011-2020_data.csv
Column 9: Degrees longitude (may be decimal degrees)
Column 10: Minutes longitude
Columns 11-16: Match columns 13-18 in Fishery lat msats.csv and msat_2011-2020_data.csv

Fishery lat mtDNA Complete Database.csv, mtdna_2013-2020_data.csv, mtdna_2013-2020_data_1site.csv

These files are where mitochondrial DNA (mtDNA) data were originally recorded. In these spreadsheets, studies are assigned to rows. If a study recorded data from more than one species, location, and/or mtDNA locus, it was given more than one row (e.g., each row represented data from a unique study/species/location/locus combination). These files are all read into assemble_data_mtDNA.R. Information in the columns is as follows:

Column 1: Species scientific name
Column 2: Species common name
Column 3: Study (paper) the data were recorded from
Column 4: Country where the sample was taken
Column 5: Site where the sample was taken, as named by the paper author
Column 6: Degrees latitude (may be decimal degrees)
Column 7: Minutes latitude
Column 8: Seconds latitude
Column 9: Degrees longitude (may be decimal degrees)
Column 10: Minutes longitude
Column 11: Seconds longitude
Column 12: Year in which the sample was taken
Column 13: Name of the mtDNA locus, as listed in the paper
Column 14: Number of individuals sampled
Column 15: Length of the mtDNA locus, in bases
Column 16: Any notes made by the recorder during the data collection process

msat_to_match.csv

This file contains information on matching stock IDS (RAM Legacy fishery stocks) for some of the microsatellite data. This file is read into assemble_data_msat.R. Information in the columns is as follows:

Column 1: Species scientific name (of matching stock from the RAM database)
Column 2: Country of the matching stock from RAM database
Column 3: Name of the location of the matching stock from the RAM database
Column 4: Name of the stock from the RAM database
Column 5: Species scientific name
Column 6: Species common name
Column 7: Study (paper) the data were recorded from
Column 8: Indicates whether the study is a primer note or not (1 for yes, 0 for no)
Column 9: Year in which the samples were taken
Column 10: Number of microsatellite markers
Column 11: Name of the microsatellite marker, as listed in the paper
Column 12: Indicates whether the microsatellite was originally developed in a different species (1 if yes, 0 if no)
Column 13: Number of individuals sampled
Column 14: Length of the microsatellite repeat, in bases
Column 15: Name of the data file that contains data from the matching population
Column 16: Latitude of the sample (in decimal degrees)
Column 17: Longitude of the sample (in decimal degrees)

mtDNA_to_match.csv

This file contains information on matching stock IDS (RAM Legacy fishery stocks) for some of the mtDNA data. This file is read into assemble_data_mtDNA.R. Information in the columns is as follows:

Column 1: Species scientific name (of matching stock from the RAM database)
Column 2: Country of the matching stock from RAM database
Column 3: Name of the location of the matching stock from the RAM database
Column 4: Name of the stock from the RAM database
Column 5: Species scientific name
Column 6: Species common name
Column 7: Study (paper) the data were recorded from
Column 8: Year in which the samples were taken
Column 9: Name of the mtDNA locus, as listed in the paper
Column 10: Number of individuals sampled
Column 11: Length of the mtDNA locus, in bases
Column 12: Latitude of the sample (in decimal degrees)
Column 13: Longitude of the sample (in decimal degrees)

spp_combined_info.csv

This file contains the taxonomic information, IUCN status, and range extent information for all species in either (or both) the microsatellite and mitochondrial datasets. This file is read into bootstrap_hd.R, bootstrap_he.R, bootstrap_pi.R, msat_he_family_trends.R, msat_he_models.R, msat_he_predict.R, mtdna_hd_family_trends.R, mtdna_hd_models.R, mtdna_hd_predict.R, mtdna_pi_family_trends.R, mtdna_pi_models.R, and mtdna_pi_predict.R. Information in the columns is as follows:

Column 1: Species scientific name
Column 2: Species common name
Column 3: The genus the species belongs to
Column 4: The family the species belongs to
Column 5: The order the species belongs to
Column 6: The IUCN red list status of the species
Column 7: Whether the species is pelagic or coastal
Column 8: The latitude of the northernmost extent of the species range
Column 9: The latitude of the southernmost extent of the species range
Column 10: The longitude of the westernmost extent of the species range
Column 11: The longitude of the easternmost extent of the species range
Column 12: The total latitudinal breadth of the species range
Column 13: Half of the total latitudinal breadth of the species
Column 14: The latitude of the central point (centroid) of the species range

sharedstudies.csv

This file contains the microsatellite and mtDNA genetic diversity observations for populations where both nuclear & mitochondrial genetic diversity were recorded. This file is both read into (and created by) ID_shared_species.R.

Column 1: Species scientific name
Column 2: Species common name
Column 3: Study (paper) the data were recorded from
Column 4: Country where the samples were taken
Column 5: Site where the samples were taken, as named by the paper author
Column 6: Latitude of the samples (in decimal degrees)
Column 7: Longitude of the samples (in decimal degrees)

Code

A list of the R scripts used for analyses.

assemble_data_msat.R
Cleans and assembles Fishery lat msats.csv, msat_2011-2020.csv, and ppdat_2016-03-04wLL.csv into one cohesive data frame.

assemble_data_mtDNA.R
Cleans and assembles Fishery lat mtDNA Complete Database.csv, mtdna_2013-2020_data.csv, and mtdna_2013-2020_data_1site.csv into one cohesive data frame.

bootstrap_hd.R
Bootstraps mtDNA Hd models to calculate 95% confidence intervals for model coefficients.

bootstrap_he.R
Bootstraps nuclear microsatellite He models to calculate 95% confidence intervals for model coefficients.

bootstrap_pi.R
Bootstraps mtDNA pi models to calculate 95% confidence intervals for model coefficients.

coefficient_bootstrap_cis.R
Reads in output from the bootstrap.R files to create supplemental figures summarizing results.

ID_shared_species.R
Identifies populations/species where both mtDNA and nuclear (microsatellite) genetic diversity was measured.

maps.R
Creates maps of the global distribution of all genetic diversity estimates, as well as mean chlorophyll-a concentration and SST (and a correlation plot between the two).

msat_he_family_trends.R
Plots marginal effects (relationship between microsatellite He and a given predictor variable) for a subset of 10 families for all nuclear (microsatellite) He models.

msat_he_models.R
Code for all nuclear (microsatellite) He models.

msat_he_predict.R
Plots marginal effects (relationship between microsatellite He and a given predictor variable) for all nuclear (microsatellite) He models.

mtdna_hd_family_trends.R
Plots marginal effects (relationship between mtDNA Hd and a given predictor variable) for a subset of 10 families for all mtDNA Hd models.

mtdna_hd_models.R
Code for all mtDNA Hd models.

mtdna_hd_predict.R
Plots marginal effects (relationship between mtDNA Hd and a given predictor variable) for all mtDNA Hd models.

mtdna_pi_family_trends.R
Plots marginal effects (relationship between mtDNA pi and a given predictor variable) for a subset of 10 families for all mtDNA pi models.

mtdna_pi_models.R
Code for all mtDNA pi models.

mtdna_pi_predict.R
Plots marginal effects (relationship between mtDNA pi and a given predictor variable) for all mtDNA pi models.

pull_env_data.R
Pulls corresponding environmental data from Bio-ORACLE for populations in either (or both) the nuclear (microsatellite) and mtDNA datasets.

pull_range_data.R
Pulls range extent data from FishBase (i.e., AquaMaps) for all species in either (or both) the nuclear (microsatellite) and mtDNA datasets

Software

Necessary Software

R (v.4.2.2 or above)
RStudio (v.2022.12.0 or above)

Necessary R Packages

data.table (v.1.14.8)
DHARMa (v.0.4.6)
glmmTMB (v.1.1.7)
here (v.1.0.1)
leaflet (v.2.1.2)
lme4 (v.1.1-31)
MuMIn (v.1.47.5)
performance (v.0.10.4)
raster (v.3.6-20)
rfishbase (v.4.1.1)
scales (v.1.2.1)
sdmpredictors (v.0.2.14)
sf (v.1.0.13)
sjplot (v.2.8.12)
splines (v.4.2.2)
tidyverse (v.2.0.0)
tmap (v.3.3.4)

Data Filtering Steps

Documents that catalog all of the filtering steps and decisions that the studies included in the literature search went through.

General outline of the steps:

Microsatellite Web of Science search (conducted on 2011.12.11 & 2020.01.05) with the following keywords: fish, microsatellite, marine OR ocean OR sea. mtDNA Web of Science search (conducted on 2013.01.29 & 2020.01.05) with the following keywords: fish, mtDNA, marine OR ocean OR sea.
Filtered title/abstract by hand and recorded data from each study following the guidelines listed in msat_data_entry_instructions.md or mtDNA_data_entry_instructions.md. Notes on filtering can be found in (1) undergrad_data_entry_notes_2012-2015.md, rene_data_entry_notes_2019-2020.md, or (3) marial_data_entry_notes_2020.md.
- A complete list of references for all studies included in our datasets can be found in Data_Sources.pdf.