Complex patterns of genetic population structure in the mouthbrooding marine catfish, Bagre marinus, in the Gulf of Mexico and U.S. Atlantic
Data files
May 05, 2025 version files 28.82 MB
-
BMA_by_pop_all-outl_genepop.gen
351.69 KB
-
BMA_by_pop_genepop.gen
14.40 MB
-
BMA_by_pop_only-neutral_genepop.gen
14.05 MB
-
README.md
2.13 KB
-
SampleInfo.txt
17.30 KB
Abstract
Patterns of genetic variation reflect interactions among microevolutionary forces that vary in strength with changing demography. For marine species, these patterns are often interpreted under the expectation that larval movement drives connectivity because most marine species exhibit broadcast spawning dispersal strategies. Here, patterns of variation within and among samples of the mouth brooding gafftopsail catfish (Bagre marinus, Family Ariidae) captured in the U.S Atlantic and throughout the Gulf of Mexico were analyzed using genomics to generate neutral and non-neutral SNP data sets. Because genomic resources are lacking for ariids, linkage disequilibrium network analysis was used to examine patterns of putatively adaptive variation. Finally, historical demographic parameters were estimated from site frequency spectra. The results show four differentiated groups, corresponding to the (1) U.S. Atlantic, and the (2) northeastern, (3) northwestern, and (4) southern Gulf of Mexico. Patterns of genetic variation for the neutral data resemble that of other fishes that use the same estuarine habitats as nurseries, regardless of the presence/absence of a dispersive larval phase, supporting the idea that adult/juvenile behavior and habitat are important predictors of contemporary patterns of genetic structure. The non-neutral data presented two contrasting signals of structure, one due to increases in diversity moving west to east and north to south, and another to increased heterozygosity in the Atlantic. Demographic analysis suggested recently reduced long-term effective population size in the Atlantic is likely an important driver of patterns of genetic variation and is consistent with a known reduction in population size potentially due to an epizootic.
https://doi.org/10.5061/dryad.nvx0k6f0n
The three genepop files consist of a complete dataset as well as the dataset split into neutral and outlier loci. A metadata text file is also included to describe the catch locations of the samples.
Description of the data and file structure
The complete genepop dataset (BMA_by_pop_genepop.gen) consisted of 367 gafftopsail catfish (Bagre marinus, Family Ariidae) samples genotyped at 5,554 microhaplotype loci (14,682 SNPs) collected from the western North Atlantic Ocean ranging from Indian River Lagoon, Florida to the Bay of Campeche, Mexico. The neutral genepop file (BMA_by_pop_only-neutral_genepop.gen) is a subset of the complete file which contains 5,421 loci while the outlier genepop (BMA_by_pop_all-outl_genepop.gen) contains 133 loci. The sample names for these files consist of a species indicator, library designation, ligation well and sample name. The sample names are included in the tab deliminated metadata (SampleInfo.txt) file along with the catch coordinates, population designation and estuary they were sampled in. Missing data in the genepop files is designated with “000000” as alleles are coded with three digits.
Metadata Key
SAMPLE_ID: The name assigned to the sample
LAT: The latitude of the vessel when the sample was brought on board
LONG: The longitude of the vessel when the sample was brought on board
POP: The geographic population assignment of the sample based upon catch location
ESTUARY: The estuary in which the sample was collected
Sharing/Access information
Further metadata for these samples can be found at:
Code/Software
Further information about the code used to produce and analyze this data can be found at:
Sampling and library prep
Fin clips were obtained from 382 mixed-age samples of gafftopsail catfish collected from nine geographic sampling locations (hereafter locations; Figure 1) from 2015 to 2018: one in the Atlantic in Indian River Lagoon, Florida and adjacent coastal waters (ATL) and eight in the Gulf. Locations in the Gulf were near Tampa Bay, Florida (FLGS), North of Tampa Bay, Florida (FLGN), near Mobile Bay, Alabama, (MB), in Mississippi Sound, Mississippi (MISS), in Chandeleur Sound, LA (CS), off Louisiana west of the Mississippi River (LA), in Corpus Christi Bay, Texas (CC) and in the Bay of Campeche, Mexico (CAMP). All locations were selected because they represent inshore habitats used by mouth brooding males for parturition and by juveniles as nursery habitat, except CAMP which was opportunistically sampled further offshore. Sampling took place as part of surveys routinely conducted by state or academic entities, the latter following approved animal care protocols. All fin clips were preserved in 20% DMSO-0.25M EDTA-saturated NaCl buffer (Seutin et al., 1991) and stored at room temperature until time of extraction.
DNA was extracted using Mag-Bind Tissue DNA kits (Omega Bio-Tek, Norcross, GA) and 500-1000 ng of high-quality genomic DNA used in a modified version of the ddRAD genomic library preparation method (Peterson et al., 2012). Briefly, genomic DNA was digested with two restriction endonucleases (EcoRI, MspI), and a barcoded adapter was ligated to EcoRI sites while a common adapter was ligated to MspI sites. Following adapter ligation, individuals were pooled by index and size-selected using a Pippin Prep size-selection system (Sage Science, Beverly, MA) to a standard size range (338 – 412 base pairs). Polymerase chain-reaction (PCR) amplification of fragments was performed to incorporate adaptors necessary for annealing to an Illumina flow cell and index-specific identifiers. Index pools were then combined into libraries of approximately 150 individuals spread across the geographic range of sampling and duplicate individuals (technical replicates), and three libraries were sequenced (paired-end) each on a lane of an Illumina HiSeq 4000 DNA sequencer at GeneWiz®, New Jersey, USA.
Genotyping
RAD sequences retrieved from each run were demultiplexed using process_radtags (Catchen et al., 2011) and quality trimming, reduced-representation reference assembly, read mapping and SNP calling were performed using the dDocent pipeline (Puritz et al., 2014). The ten individuals with the highest number of reads were selected from each lane for de novo reduced-representation reference assembly, using the overlapping read (OL) assembly option in dDocent. Similarity threshold for clustering (c = 0.8), minimum within individual coverage (K1 = 5) and minimum number of individuals a read must occur in to be included (K2 = 2) were chosen after comparing mapping statistics for ten individuals randomly chosen from each library and mapped to references generated for c = 0.8, K1 = 2 – 10, and K2 = 1 – 10 using BWA (Li & Durbin, 2009) to maximize the number of reads mapped as a proper pair and minimize reads where forward and reverse reads mapped to different contigs. The constructed reduced-representation reference encompassed a total 10,874,990 base pairs across 37,872 fragments (mean 287 bp; mode 307 bp).
Reads were mapped to the reduced-representation reference using BWA (Match=1, mismatch penalty=3 and gap penalty=5; Li, 2013) and SNPs called using freebayes (Garrison & Marth, 2012). The resulting data set was filtered to remove low quality and artefactual SNPs, paralogs, and low-quality individuals using vcftools (Danecek et al., 2011) and custom scripts following O’Leary et al. (2018), allowing for the retention of SNPs with more than 2 alleles. Genotypes with quality < 20 and < 5 reads were coded as missing, retaining loci with quality > 20, genotype call rate > 90%, and mean depth 15 – 300. Loci were also filtered based on allelic balance (remove SNPs < 0.25 and >0.75), mapping quality ratios (remove SNPs < 0.25 and >1.75), strand balance (remove SNPs with > 100x more forward alternate reads than reverse alternate reads and > 100x more forward reverse reads than reverse alternate reads), paired status, depth/quality ratio (< 0.2), and excess heterozygosity (remove SNPs > 0.5 and that deviate significantly from the expectations of Hardy-Weinberg Equilibrium). Individuals with > 25% missing data were removed. Finally, rad_haplotyper (Willis et al., 2017) was used to merge SNPs on the same fragments into SNP-containing loci (hereafter microhaplotypes), by using a random sample of 20 reads per locus and recording all possible haplotypes and then discarding haplotypes that are not possible given the SNPs present in the final dataset. Loci are flagged as paralogs if too may haplotypes are called given SNP genotypes. Genotyping error is flagged if an individual as too few haplotypes given SNP genotypes). The resulting haplotyped data set was further filtered to remove loci haplotyped in < 90% of individuals, flagged as potential paralogs in > 4 individuals, or as affected by genotyping error in > 10 individuals. Technical replicates were compared to assess genotyping error, and loci systematically affected by genotyping error or flagged as deviating significantly from the expectations of Hardy-Weinberg Equilibrium (HWE) in > 5 sites were removed.