Skip to main content

Genomic structure of the Sicklefin Barb, Puntioplites falcifer (Cyprinidae), in the lower Mekong River basin reveals patterns of both migration and population partitioning

Cite this dataset

Bird, Christopher E. (2020). Genomic structure of the Sicklefin Barb, Puntioplites falcifer (Cyprinidae), in the lower Mekong River basin reveals patterns of both migration and population partitioning [Dataset]. Dryad.


Effective management of the Sicklefin Barb, Puntioplites falcifer, with the planned construction of several dams in the Mekong River Basin depends upon disentangling conflicting reports of isolated populations and highly migratory behavior. We investigated patterns of population genomic structure, relatedness, and inferred connectivity among three locations on the Mekong and a fourth at Attapeu on the Sekong tributary. The results supported both isolation of populations and migratory behavior. STRUCTURE, AMOVA, and PCA revealed three distinct nDNA clusters. The most divergent nDNA cluster (pairwise FST ≥ 0.43, p < 0.0001) exhibited negligible inferred relative migration rates with the other samples (m ≤ 0.02), including those from common sampling locations, and was likely a different species - putatively P. proctozysron (Ppr). However, mtDNA barcoding suggested differentiation between Ppr and a published mtDNA genome for this species.  Most of the fish from the Sekong tributary belonged to a second distinct nDNA cluster and the sample from that location was differentiated from the Mekong sites (pairwise FST = 0.02 - 0.03, p < 0.0001).  Supporting migration, a third nDNA cluster exhibited high rates of migration among the Mekong locations (m = 0.6 - 1) and was found in small numbers at the Sekong location which was characterized by intermediate migration rates with the Mekong (m = 0.3 - 0.4).  Mitochondrial DNA barcoding indicated that the fish comprising the Mekong and Sekong nDNA clusters were differentiated from a P. falcifer mtDNA genome sampled well upstream of the Mekong locations in this study.  Estimates of Ne by both location and nDNA cluster were near or below the minimal sustainable size (173-1651), suggesting susceptibility to over-exploitation or population fragmentation. Together, these results suggest that proposed hydropower dams could subdivide connected Mekong populations, isolate and split the Sekong population, and further drive down Ne if accommodations are not made to facilitate connectivity.  Additionally, the combined pattern of nDNA and mtDNA diversity is consistent with substantial cryptic diversity and a P. falciferP. proctozysron species complex that could be further described with rigorous population genomic surveys and expanded geographic sampling.


Ethical Statement

No local permits were required for the work performed in this study.  The care and use of fish complied with the United States Animal Welfare Act and relevant Public Health Service Policies as approved by the Institutional Animal Care and Use Committee of Old Dominion University (923066-2). 

Sample collection

Specimens were collected from fish markets and fish landings along the Mekong mainstem from sites that would be impacted from planned dam development, including Pakse in Lao PDR, and Stung Treng and Kratié in Cambodia, and from the Sekong tributary in Attapeu, Lao PDR (Fig. 1, Table 1). The distribution and marketing of fishes within the MRB is highly complex. Fish being sold can originate from distant locations, especially in large markets (Bush, 2004; Baran et al., 2007). Therefore, great care was taken in choice of vendors and fresh condition of fishes to ensure fishes sampled were from the target region. Vendors and fishermen were asked by a local scientist familiar with the fishery (Cambodia: S. Uy; Lao PDR: L. Phounvisouk) where the fish were caught (without them knowing we wanted locally caught fish). Specimens were only purchased if we were confident from the response of the vendor that they were caught nearby.

Genomic library preparation and sequencing

Tissue samples were stored in 95% molecular-grade ethanol and transported to Nha Trang University, Vietnam for DNA extraction using a Qiagen DNeasy Blood and Tissue kit (Hilden, Germany). Successful DNA extraction was confirmed for four separate and consecutive elutions of each sample with gel electrophoresis, and the elution with the best quality high-molecular weight band was used for double digest RAD (ddRAD, Peterson et al., 2012) library preparation at Texas A&M University – Corpus Christi prior to sequencing. SPRI bead size selection (0.4x, Beckman-Coulter, Brea, CA, USA) was carried out on extractions still containing low molecular weight DNA, as determined by gel electrophoresis, and all other extractions were cleaned with AmpureXP paramagnetic beads (Beckman-Coulter). Both the SPRI and AmpureXP cleanup protocols were modified to leave the paramagnetic beads in the samples and the beads were subsequently reactivated down-stream using 3M NaCl 20% PEG (Fisher et al., 2011).

The concentration of DNA was quantified for each sample using the Biotium AccuBlue HighSensitivity kit (Hayward, CA, USA) on a Spectramax M3 fluorescent plate reader (San Jose, CA, USA). Samples were normalized and 50ng of DNA from each sample was digested with the restriction enzymes Hin1II and EcoRI. One of forty-eight barcodes was applied to each fish during ligation. Each set of 48 barcodes were subsequently dual-indexed via polymerase chain reaction (PCR) using 11 bp index sequences developed by Quail et al. (2014). Each index was used only once to minimize the chances of mis-assigned sequence reads due to index swapping (Kircher et al., 2011; Hussmann, 2015; Wright & Vetsigian, 2016). Unlike the ddRAD protocol of Peterson et al. (2012), we did not combine the libraries for each fish until we renormalized them following PCR, as in ezRAD (Toonen et al., 2013), to reduce the variation in sequence read depth among individuals.

Following PCR, fluorescent quantification of DNA concentration and normalization, the libraries of individual fish were combined in groups of 48. A BluePippin (Sage Science, Beverly, MA, USA) automated electrophoresis rig was used to size select fragments of 400-500 bp. Libraries were normalized (using the Biotium AccuBlue HighSensitivity kit on a Spectramax M3 fluorescent plate reader) and pooled into a super library that was at least 2 nM. The molarity of the super library was determined using the KAPA Library Quantification Kit for Illumina Platforms (Wilmington, MA, USA) and an ABI StepOne-Plus real time thermocycler (Foster City, CA, USA). The super library was sequenced on 1.5 lanes of an Illumina HiSeq 4000 (paired-end 150 bp) at New York University’s Genome Technology Center.

Genomic data processing

Sequence reads were demultiplexed by Illumina dual-indices and barcodes. Sequences mapping to the phi X bacteriophage genome were removed using a custom script. Quality trimming, de novo reference assembly, mapping, and SNP calling were carried out using a modified version of the dDocent pipeline (Puritz et al., 2014), dDocentHPC ( dDocentHPC accepts an extensive, detailed settings file that contains all the information required for methodological reproduction (Supporting Information S1: Pfal.dDocentHPC.config). Briefly, sequence reads were trimmed to 146 bp for de novo reference genome assembly, and any read that required adapter or quality trimming (phred < 20) was removed from consideration. The data set was filtered down to unique reads with more than 3x coverage and present in more than 7 individuals before de novo assembly with rainbow (Chong et al. 2012). Reads for mapping were trimmed for adapters and low quality (phred < 20), with reads <75 bp after trimming being removed from consideration. Reads were mapped to the reference genome with bwa-mem (Li 2013) and then filtered for secondary alignments, low mapping quality (<30), orphaned read pairs, and improperly paired reads using samtools view (Li et al. 2009). Single nucleotide polymorphisms were called using freebayes (Garrison & Marth 2012) with settings to prevent the calling of multi-nucleotide polymorphisms (MNPs) because the filtering software used works with SNPs and information is lost when converting MNPs to SNPs with existing tools, such as vcflib (

The resulting SNPs underwent stringent filtering that closely followed the filtering protocol of O’Leary et al. (2018) with the filtering script fltrVCF ( fltrVCF accepts an extensive configuration file which can be used for methodological reproduction, and we included it as a supplement (Supporting Information S2: Pfal.fltrVCF.config). Briefly, we removed (1) all individuals missing greater than 40% of loci; (2) SNP loci that had a mean depth of coverage below 15x and above 400x; (3) SNP genotypes with 10x coverage or less; and (4) SNP loci with greater than 20% missing genotypes after applying the previous filters. rad_haplotyper was used to eliminate putative paralogues (Willis et al. 2017). For the remaining SNPs, one was randomly selected from each contig in the reference genome for subsequent analysis to avoid redundancy resulting from linkage. A final variant call format (VCF) file of putative SNPs was converted to various data formats with PGDSpider (Lischer & Excoffier, 2012) for statistical analysis. SNPs putatively under selection were identified from this dataset with BayeScan v. 2.1 (Foll & Gaggiotti 2008). All loci identified as FST outliers compared to what is expected under neutrality (with a false discovery rate correction of 0.05) were filtered to construct a panel of putatively neutral loci.

Usage notes

The FASTQ files associated with this project are raw and have not been altered.  Each sequence read pair contains a DNA barcode that uniquely identifies individual fish in each pair of FASTQ files.  The FASTQ files need to be demultiplexed prior to further processing.  We provide a "demultiplex decode" file for each pair of FASTQ files that is compatible with the STACKS script `process_radtags` (see

To facilitate replication, we have also included the configuration files used to to parameterize dDocentHPC ( and fltrVCF ( and process the FASTQ data files following demultiplexing. Refer to the methods listed above for their names and usage.