Global patterns of nuclear and mitochondrial genetic diversity in marine fishes
Data files
Apr 29, 2024 version files 6.15 MB
-
Fishery_lat_msats_000_2015-05-20_SSP.csv
143.31 KB
-
Fishery_lat_msats_001_2015-07-04_MLP.csv
319.53 KB
-
Fishery_lat_msats_002_2015-08-08_SSP.csv
12.99 KB
-
Fishery_lat_msats_100_2015-08-20_MLP.csv
139.07 KB
-
Fishery_lat_msats_101_2015-07-17_SSP.csv
101.89 KB
-
Fishery_lat_msats_200_2015-02-10_MLP.csv
17.96 KB
-
Fishery_lat_msats_201_2015-10-13_MLP.csv
49.81 KB
-
Fishery_lat_mtDNA_Complete_Database.csv
217.62 KB
-
msat_2011-2020_data_2_302.csv
458.33 KB
-
msat_2011-2020_data_3_303.csv
393.37 KB
-
msat_2011-2020_data_301.csv
528.83 KB
-
msat_2011-2020_data_4_304.csv
327.01 KB
-
msat_2011-2020_data_marial.csv
570.26 KB
-
msat_to_match.csv
1.42 MB
-
mtdna_2013-2020_data_1site.csv
8.82 KB
-
mtdna_2013-2020_data.csv
132.98 KB
-
mtdna_to_match.csv
58.92 KB
-
ppdat_2016-03-04wLL.csv
1.14 MB
-
README.md
13.40 KB
-
sharedstudies.csv
31.38 KB
-
spp_combined_info.csv
61.36 KB
Abstract
Genetic diversity is a fundamental component of biodiversity. Examination of global patterns of genetic diversity can help highlight mechanisms underlying species diversity, though a recurring challenge has been that patterns may vary by molecular marker. Here, we compiled 6862 observations of genetic diversity from 492 species of marine fish and tested among hypotheses for diversity gradients: the founder effect hypothesis, the kinetic energy hypothesis, and the productivity-diversity hypothesis. We fit generalized linear mixed effect models (GLMMs) and explored the extent to which various macroecological drivers (latitude, longitude, temperature (SST), and chlorophyll-a concentration) explained variation in genetic diversity. We found that mitochondrial genetic diversity followed geographic gradients similar to those of species diversity, being highest near the Equator, particularly in the Coral Triangle, while nuclear genetic diversity did not follow clear geographic patterns. Despite these differences, all genetic diversity metrics were correlated with chlorophyll-a concentration, while mitochondrial diversity was also positively associated with SST. Our results provide support for the kinetic energy hypothesis, which predicts that elevated mutation rates at higher temperatures increase mitochondrial but not necessarily nuclear diversity, and the productivity-diversity hypothesis, which posits that resource-rich regions support larger populations with greater genetic diversity. Overall, these findings reveal how environmental variables can influence mutation rates and genetic drift in the ocean, caution against using mitochondrial macro-genetic patterns as proxies for whole-genome diversity, and aid in defining global gradients of genetic diversity.
This repository provides the data and scripts for all analyses in the associated paper. Data was gathered from a literature search in the Web of Science.
A complete list of all necessary software and packages (with version numbers) can be found at the bottom of this README.
Data
A list of the files read into R scripts for analyses. These include the CSV files where data from the literature search was originally recorded. Cells with missing information are infilled with NA.
- To respect the licenses and conditions of data reuse outlined by the data creators and managers of the primary data publications, as well as limit data reduplication, we have excluded the genetic diversity estimates (He, Hd, and pi) in each of the following datasets. All genetic diversity estimates are provided and freely accessible in the original publications (cited within each dataset), or upon request. Associated metadata are reported.
- To reduce the potential to introduce unintended risk, we have masked the location information for vulnerable, endangered, or otherwise threatened species in our dataset. We have followed Dryad’s guidelines for species data. A summary of each species’ IUCN status can be found in spp_combined_info.csv.
Files generated by scripts are not included, with 1 exception.
Fishery lat msats .csv, msat_2011-2020.csv
These files are where microsatellite data were originally recorded. In these spreadsheets, studies are assigned to rows. If a study recorded data from more than one species, location, and/or microsatellite marker, it was given more than one row (e.g., each row represented data from a unique study/species/location/marker combination). These files are all read into assemble_data_msat.R. Information in the columns is as follows:
- Column 1: Species scientific name
- Column 2: Species common name
- Column 3: Study (paper) the data were recorded from
- Column 4: Indicates whether the study is a primer note or not (1 for yes, 0 for no)
- Column 5: Country where the sample was taken
- Column 6: Site where the sample was taken, as named by the paper author
- Column 7: Degrees latitude (may be decimal degrees)
- Column 8: Minutes latitude
- Column 9: Seconds latitude
- Column 10: Degrees longitude (may be decimal degrees)
- Column 11: Minutes longitude
- Column 12: Seconds longitude
- Column 13: Year in which the samples were taken
- Column 14: Number of microsatellite markers
- Column 15: Name of the microsatellite marker, as listed in the paper
- Column 16: Indicates whether the microsatellite was originally developed in a different species (1 if yes, 0 if no)
- Column 17: Number of individuals sampled
- Column 18: Length of the microsatellite repeat, in bases
- Column 19 (for the msat_2011-2020_data.csv* files only): A “Notes” column, that contains any notes made by the recorder during the data collection process.
ppdat_2016-03-04wLL.csv
This file contains the raw data used in Pinsky & Palumbi (2014). It is also read into assemble_data_msat.R. Information in the columns is as follows:
- Columns 1-8: Match columns 1-8 in Fishery lat msats.csv and msat_2011-2020_data.csv
- Column 9: Degrees longitude (may be decimal degrees)
- Column 10: Minutes longitude
- Columns 11-16: Match columns 13-18 in Fishery lat msats.csv and msat_2011-2020_data.csv
Fishery lat mtDNA Complete Database.csv, mtdna_2013-2020_data.csv, mtdna_2013-2020_data_1site.csv
These files are where mitochondrial DNA (mtDNA) data were originally recorded. In these spreadsheets, studies are assigned to rows. If a study recorded data from more than one species, location, and/or mtDNA locus, it was given more than one row (e.g., each row represented data from a unique study/species/location/locus combination). These files are all read into assemble_data_mtDNA.R. Information in the columns is as follows:
- Column 1: Species scientific name
- Column 2: Species common name
- Column 3: Study (paper) the data were recorded from
- Column 4: Country where the sample was taken
- Column 5: Site where the sample was taken, as named by the paper author
- Column 6: Degrees latitude (may be decimal degrees)
- Column 7: Minutes latitude
- Column 8: Seconds latitude
- Column 9: Degrees longitude (may be decimal degrees)
- Column 10: Minutes longitude
- Column 11: Seconds longitude
- Column 12: Year in which the sample was taken
- Column 13: Name of the mtDNA locus, as listed in the paper
- Column 14: Number of individuals sampled
- Column 15: Length of the mtDNA locus, in bases
- Column 16: Any notes made by the recorder during the data collection process
msat_to_match.csv
This file contains information on matching stock IDS (RAM Legacy fishery stocks) for some of the microsatellite data. This file is read into assemble_data_msat.R. Information in the columns is as follows:
- Column 1: Species scientific name (of matching stock from the RAM database)
- Column 2: Country of the matching stock from RAM database
- Column 3: Name of the location of the matching stock from the RAM database
- Column 4: Name of the stock from the RAM database
- Column 5: Species scientific name
- Column 6: Species common name
- Column 7: Study (paper) the data were recorded from
- Column 8: Indicates whether the study is a primer note or not (1 for yes, 0 for no)
- Column 9: Year in which the samples were taken
- Column 10: Number of microsatellite markers
- Column 11: Name of the microsatellite marker, as listed in the paper
- Column 12: Indicates whether the microsatellite was originally developed in a different species (1 if yes, 0 if no)
- Column 13: Number of individuals sampled
- Column 14: Length of the microsatellite repeat, in bases
- Column 15: Name of the data file that contains data from the matching population
- Column 16: Latitude of the sample (in decimal degrees)
- Column 17: Longitude of the sample (in decimal degrees)
mtDNA_to_match.csv
This file contains information on matching stock IDS (RAM Legacy fishery stocks) for some of the mtDNA data. This file is read into assemble_data_mtDNA.R. Information in the columns is as follows:
- Column 1: Species scientific name (of matching stock from the RAM database)
- Column 2: Country of the matching stock from RAM database
- Column 3: Name of the location of the matching stock from the RAM database
- Column 4: Name of the stock from the RAM database
- Column 5: Species scientific name
- Column 6: Species common name
- Column 7: Study (paper) the data were recorded from
- Column 8: Year in which the samples were taken
- Column 9: Name of the mtDNA locus, as listed in the paper
- Column 10: Number of individuals sampled
- Column 11: Length of the mtDNA locus, in bases
- Column 12: Latitude of the sample (in decimal degrees)
- Column 13: Longitude of the sample (in decimal degrees)
spp_combined_info.csv
This file contains the taxonomic information, IUCN status, and range extent information for all species in either (or both) the microsatellite and mitochondrial datasets. This file is read into bootstrap_hd.R, bootstrap_he.R, bootstrap_pi.R, msat_he_family_trends.R, msat_he_models.R, msat_he_predict.R, mtdna_hd_family_trends.R, mtdna_hd_models.R, mtdna_hd_predict.R, mtdna_pi_family_trends.R, mtdna_pi_models.R, and mtdna_pi_predict.R. Information in the columns is as follows:
- Column 1: Species scientific name
- Column 2: Species common name
- Column 3: The genus the species belongs to
- Column 4: The family the species belongs to
- Column 5: The order the species belongs to
- Column 6: The IUCN red list status of the species
- Column 7: Whether the species is pelagic or coastal
- Column 8: The latitude of the northernmost extent of the species range
- Column 9: The latitude of the southernmost extent of the species range
- Column 10: The longitude of the westernmost extent of the species range
- Column 11: The longitude of the easternmost extent of the species range
- Column 12: The total latitudinal breadth of the species range
- Column 13: Half of the total latitudinal breadth of the species
- Column 14: The latitude of the central point (centroid) of the species range
sharedstudies.csv
This file contains the microsatellite and mtDNA genetic diversity observations for populations where both nuclear & mitochondrial genetic diversity were recorded. This file is both read into (and created by) ID_shared_species.R.
- Column 1: Species scientific name
- Column 2: Species common name
- Column 3: Study (paper) the data were recorded from
- Column 4: Country where the samples were taken
- Column 5: Site where the samples were taken, as named by the paper author
- Column 6: Latitude of the samples (in decimal degrees)
- Column 7: Longitude of the samples (in decimal degrees)
Code
A list of the R scripts used for analyses.
assemble_data_msat.R
Cleans and assembles Fishery lat msats.csv, msat_2011-2020.csv, and ppdat_2016-03-04wLL.csv into one cohesive data frame.
assemble_data_mtDNA.R
Cleans and assembles Fishery lat mtDNA Complete Database.csv, mtdna_2013-2020_data.csv, and mtdna_2013-2020_data_1site.csv into one cohesive data frame.
bootstrap_hd.R
Bootstraps mtDNA Hd models to calculate 95% confidence intervals for model coefficients.
bootstrap_he.R
Bootstraps nuclear microsatellite He models to calculate 95% confidence intervals for model coefficients.
bootstrap_pi.R
Bootstraps mtDNA pi models to calculate 95% confidence intervals for model coefficients.
coefficient_bootstrap_cis.R
Reads in output from the bootstrap.R files to create supplemental figures summarizing results.
ID_shared_species.R
Identifies populations/species where both mtDNA and nuclear (microsatellite) genetic diversity was measured.
maps.R
Creates maps of the global distribution of all genetic diversity estimates, as well as mean chlorophyll-a concentration and SST (and a correlation plot between the two).
msat_he_family_trends.R
Plots marginal effects (relationship between microsatellite He and a given predictor variable) for a subset of 10 families for all nuclear (microsatellite) He models.
msat_he_models.R
Code for all nuclear (microsatellite) He models.
msat_he_predict.R
Plots marginal effects (relationship between microsatellite He and a given predictor variable) for all nuclear (microsatellite) He models.
mtdna_hd_family_trends.R
Plots marginal effects (relationship between mtDNA Hd and a given predictor variable) for a subset of 10 families for all mtDNA Hd models.
mtdna_hd_models.R
Code for all mtDNA Hd models.
mtdna_hd_predict.R
Plots marginal effects (relationship between mtDNA Hd and a given predictor variable) for all mtDNA Hd models.
mtdna_pi_family_trends.R
Plots marginal effects (relationship between mtDNA pi and a given predictor variable) for a subset of 10 families for all mtDNA pi models.
mtdna_pi_models.R
Code for all mtDNA pi models.
mtdna_pi_predict.R
Plots marginal effects (relationship between mtDNA pi and a given predictor variable) for all mtDNA pi models.
pull_env_data.R
Pulls corresponding environmental data from Bio-ORACLE for populations in either (or both) the nuclear (microsatellite) and mtDNA datasets.
pull_range_data.R
Pulls range extent data from FishBase (i.e., AquaMaps) for all species in either (or both) the nuclear (microsatellite) and mtDNA datasets
Software
Necessary Software
- R (v.4.2.2 or above)
- RStudio (v.2022.12.0 or above)
Necessary R Packages
- data.table (v.1.14.8)
- DHARMa (v.0.4.6)
- glmmTMB (v.1.1.7)
- here (v.1.0.1)
- leaflet (v.2.1.2)
- lme4 (v.1.1-31)
- MuMIn (v.1.47.5)
- performance (v.0.10.4)
- raster (v.3.6-20)
- rfishbase (v.4.1.1)
- scales (v.1.2.1)
- sdmpredictors (v.0.2.14)
- sf (v.1.0.13)
- sjplot (v.2.8.12)
- splines (v.4.2.2)
- tidyverse (v.2.0.0)
- tmap (v.3.3.4)
Data Filtering Steps
Documents that catalog all of the filtering steps and decisions that the studies included in the literature search went through.
General outline of the steps:
- Microsatellite Web of Science search (conducted on 2011.12.11 & 2020.01.05) with the following keywords: fish, microsatellite, marine OR ocean OR sea. mtDNA Web of Science search (conducted on 2013.01.29 & 2020.01.05) with the following keywords: fish, mtDNA, marine OR ocean OR sea.
- Filtered title/abstract by hand and recorded data from each study following the guidelines listed in msat_data_entry_instructions.md or mtDNA_data_entry_instructions.md. Notes on filtering can be found in (1) undergrad_data_entry_notes_2012-2015.md, rene_data_entry_notes_2019-2020.md, or (3) marial_data_entry_notes_2020.md.
- A complete list of references for all studies included in our datasets can be found in Data_Sources.pdf.
This dataset was collected via a literature search in Web of Science with the following keywords: fish microsatellite (marine OR ocean OR sea) and fish mtDNA (marine OR ocean OR sea). Only studies published before January 5, 2020, were included.
We retained studies that recorded genetic diversity (mitochondrial or nuclear) from wild marine fish populations. To do this, we first conducted a title/abstract filter by hand and then recorded data for each paper, following the guidelines listed in either msat_data_entry_instructions.md or mtDNA_data_entry_instructions.md.