Data and code from: Extensive multi-species hybridization between Leuciscidae minnow species
Data files
May 21, 2026 version files 1.88 GB
-
alignment_plots.R
12.76 KB
-
AMP22_BNDxCC_woP13_entropy_list_k2_Geno_ID_q.txt
286 B
-
AMP22_BNDxCC_woP13_entropy_list_k2_indivs_q_Q.txt
23.30 KB
-
AMP22_BNDxCS_woP13_entropy_list_k2_Geno_ID_q.txt
295 B
-
AMP22_BNDxCS_woP13_entropy_list_k2_indivs_q_Q.txt
15.55 KB
-
AMP22_CSxCC_woP13_entropy_list_k2_Geno_ID_q.txt
259 B
-
AMP22_CSxCC_woP13_entropy_list_k2_indivs_q_Q.txt
24.41 KB
-
AMP22_FieldworkSummaryAM.Rmd
10.73 KB
-
AMP22_target_03may23_miss0.6_mac3_Q30_DP3_ind95_maf001_k12_150k_rep1_qk12inds.hdf5
625.41 MB
-
AMP22_target_03may23_miss0.6_mac3_Q30_DP3_ind95_maf001_k12_150k_rep2_qk12inds.hdf5
625.41 MB
-
AMP22_target_03may23_miss0.6_mac3_Q30_DP3_ind95_maf001_k12_150k_rep3_qk12inds.hdf5
625.41 MB
-
AMP22_thesis_24feb25_entropy_list_k10_indivs_q.txt
182.49 KB
-
AMP22_thesis_24feb25_entropy_list_k10_species_q.txt
1.92 KB
-
AMP22_thesis_24feb25_entropy_list_k11_indivs_q.txt
199.77 KB
-
AMP22_thesis_24feb25_entropy_list_k11_species_q.txt
2.11 KB
-
AMP22_thesis_24feb25_entropy_list_k12_indivs_q.txt
216.47 KB
-
AMP22_thesis_24feb25_entropy_list_k12_species_q.txt
2.30 KB
-
AMP22_thesis_24feb25_entropy_list_k13_indivs_q.txt
232.73 KB
-
AMP22_thesis_24feb25_entropy_list_k13_species_q.txt
2.47 KB
-
AMP22_thesis_24feb25_entropy_list_k14_indivs_q.txt
249.35 KB
-
AMP22_thesis_24feb25_entropy_list_k14_species_q.txt
2.66 KB
-
AMP22_thesis_24feb25_entropy_list_k15_indivs_q.txt
265.92 KB
-
AMP22_thesis_24feb25_entropy_list_k15_species_q.txt
2.86 KB
-
AMP22_thesis_24feb25_entropy_list_k2_indivs_q.txt
50.76 KB
-
AMP22_thesis_24feb25_entropy_list_k2_species_q.txt
488 B
-
AMP22_thesis_24feb25_entropy_list_k3_indivs_q.txt
67.38 KB
-
AMP22_thesis_24feb25_entropy_list_k3_species_q.txt
661 B
-
AMP22_thesis_24feb25_entropy_list_k4_indivs_q.txt
83.45 KB
-
AMP22_thesis_24feb25_entropy_list_k4_species_q.txt
839 B
-
AMP22_thesis_24feb25_entropy_list_k5_indivs_q.txt
100.13 KB
-
AMP22_thesis_24feb25_entropy_list_k5_species_q.txt
1.02 KB
-
AMP22_thesis_24feb25_entropy_list_k6_indivs_q.txt
116.76 KB
-
AMP22_thesis_24feb25_entropy_list_k6_species_q.txt
1.20 KB
-
AMP22_thesis_24feb25_entropy_list_k7_indivs_q.txt
133.09 KB
-
AMP22_thesis_24feb25_entropy_list_k7_species_q.txt
1.37 KB
-
AMP22_thesis_24feb25_entropy_list_k8_indivs_q.txt
149.85 KB
-
AMP22_thesis_24feb25_entropy_list_k8_species_q.txt
1.56 KB
-
AMP22_thesis_24feb25_entropy_list_k9_indivs_q.txt
166.23 KB
-
AMP22_thesis_24feb25_entropy_list_k9_species_q.txt
1.74 KB
-
append_paths.py
793 B
-
assign_geno_ID.R
43.32 KB
-
breeding_behaviours.R
7.80 KB
-
check_k.sh
1.68 KB
-
colour_palette.R
2.26 KB
-
count_assembled_7jul22.sh
1.11 KB
-
demultiplexing_txtfiles.R
892 B
-
entropy_queuesub_27jul22.sh
3.38 KB
-
filter_indivs.R
6.52 KB
-
fishtree_AMP22.R
1.34 KB
-
inds_AMP22_target_03may23_miss0.6_mac3_Q30_DP3_ind95_maf001.recode.txt
8.80 KB
-
inputdataformat.R
7.67 KB
-
large_CIs.R
416 B
-
Leuciscid_Metadata_May2023.csv
182.07 KB
-
Leuciscid_OFAT_May2023.csv
11.92 KB
-
logistic_regression.R
12.08 KB
-
loop_plotting.sh
1.10 KB
-
Metadata_OWIT_merge.R
2.23 KB
-
pca_gcov_summer2019_AM.R
43.33 KB
-
plotting_DIC.R
1.27 KB
-
plotting_mean_q.R
23.93 KB
-
q_barplot_multisite_rstudio.R
11.57 KB
-
q_barplot.R
19.30 KB
-
Q_triangleplot.R
11.98 KB
-
RandomForest.R
27.55 KB
-
rawdata_summary_AM.R
1.72 KB
-
README.md
17.46 KB
-
reads_and_DNA_concentration.R
4.02 KB
-
run_bwa_queuesub_22jun22.sh
1.82 KB
-
run_sabre_queuesub.sh
711 B
-
run_variantcall_bcf_queuesub_22jun22.sh
1.69 KB
-
startingvals_loop_queuesub.sh
1.32 KB
-
triangle_plot_boundaries.R
1.96 KB
Abstract
Anthropogenic disturbances can disrupt ecosystems and alter species population dynamics. Interspecific hybridization is common between genetically related organisms, especially once reproductive barriers such as spatial isolation have been removed. We used genotyping-by-sequencing data to assess outcomes of hybridization between several Leuciscidae minnow species and to identify to what extent land use type and environmental variables influence the frequency of hybridization. We found that both two-species and multi-species hybridization was widespread; hybrids were sampled at all 25 sampling sites and made up almost 30 % of all individuals sampled. While most species hybridized with at least one other sampled species, the amount of hybridization was variable. We used logistic regression to estimate the influence of anthropogenic disturbance on hybridization, and found weak relationships between hybridization and environmental factors. This research improves our understanding of hybridization dynamics in species-rich clades like the Leuciscidae with low reproductive isolation, and points to the need for additional work to better understand predictors of hybridization in multi-species hybrid zones.
Dataset DOI: 10.5061/dryad.5mkkwh7kn
Description of the data and file structure
This dataset contains data and code from Meuser et al. (2026). We generated genotyping-by-sequencing data (NCBI Sequence Read Archive under BioProject number PRJNA1291971) from 1213 Leuciscid minnows captured across southern Ontario, Canada. After filtering, we analysed data from 731 individuals at 1228 loci using the program ENTROPY to examine the hybridization between the 9 sampled species. We also used upstream watershed land coverage data from the Ontario Watershed Information Tool to examine the extent of the influence of anthropogenic disturbance on hybridization between the Leuciscid minnow species.
Files and variables
File: AMP22_target_03may23_miss0.6_mac3_Q30_DP3_ind95_maf001_k12_150k_rep1_qk12inds.hdf5
Description: Output file from ENTROPY run will all individuals, K=12, repetition 1
File: AMP22_target_03may23_miss0.6_mac3_Q30_DP3_ind95_maf001_k12_150k_rep2_qk12inds.hdf5
Description: Output file from ENTROPY run will all individuals, K=12, repetition 1
File: AMP22_target_03may23_miss0.6_mac3_Q30_DP3_ind95_maf001_k12_150k_rep3_qk12inds.hdf5
Description: Output file from ENTROPY run will all individuals, K=12, repetition 1
alignment_plots.R
Description: R script for generating plots to assess alignment of raw reads to the reference genome
AMP22_BNDxCC_woP13_entropy_list_k2_Geno_ID_q.txt
AMP22_BNDxCC_woP13_entropy_list_k2_indivs_q_Q.txt
AMP22_BNDxCS_woP13_entropy_list_k2_Geno_ID_q.txt
AMP22_BNDxCS_woP13_entropy_list_k2_indivs_q_Q.txt
AMP22_CSxCC_woP13_entropy_list_k2_Geno_ID_q.txt
AMP22_CSxCC_woP13_entropy_list_k2_indivs_q_Q.txt
Description: The above 6 files are q and Q data pulled from the ENTROPY output (HDF5 files) on the species cross runs, used in subsequent plotting.
AMP22_FieldworkSummaryAM.Rmd
Description: R markdown file to generate a summary of our fieldwork sampling efforts
AMP22_thesis_24feb25_entropy_list_k10_indivs_q.txt
AMP22_thesis_24feb25_entropy_list_k10_species_q.txt
AMP22_thesis_24feb25_entropy_list_k11_indivs_q.txt
AMP22_thesis_24feb25_entropy_list_k11_species_q.txt
AMP22_thesis_24feb25_entropy_list_k12_indivs_q.txt
AMP22_thesis_24feb25_entropy_list_k12_species_q.txt
AMP22_thesis_24feb25_entropy_list_k13_indivs_q.txt
AMP22_thesis_24feb25_entropy_list_k13_species_q.txt
AMP22_thesis_24feb25_entropy_list_k14_indivs_q.txt
AMP22_thesis_24feb25_entropy_list_k14_species_q.txt
AMP22_thesis_24feb25_entropy_list_k15_indivs_q.txt
AMP22_thesis_24feb25_entropy_list_k15_species_q.txt
AMP22_thesis_24feb25_entropy_list_k2_indivs_q.txt
AMP22_thesis_24feb25_entropy_list_k2_species_q.txt
AMP22_thesis_24feb25_entropy_list_k3_indivs_q.txt
AMP22_thesis_24feb25_entropy_list_k3_species_q.txt
AMP22_thesis_24feb25_entropy_list_k4_indivs_q.txt
AMP22_thesis_24feb25_entropy_list_k4_species_q.txt
AMP22_thesis_24feb25_entropy_list_k5_indivs_q.txt
AMP22_thesis_24feb25_entropy_list_k5_species_q.txt
AMP22_thesis_24feb25_entropy_list_k6_indivs_q.txt
AMP22_thesis_24feb25_entropy_list_k6_species_q.txt
AMP22_thesis_24feb25_entropy_list_k7_indivs_q.txt
AMP22_thesis_24feb25_entropy_list_k7_species_q.txt
AMP22_thesis_24feb25_entropy_list_k8_indivs_q.txt
AMP22_thesis_24feb25_entropy_list_k8_species_q.txt
AMP22_thesis_24feb25_entropy_list_k9_indivs_q.txt
AMP22_thesis_24feb25_entropy_list_k9_species_q.txt
Description: The above 30 files are q and Q data pulled from the ENTROPY output (HDF5 files) on the all species runs, used in subsequent plotting.
append_paths.py
Description: Python script for appending paths onto file names, used to get a list of paths for data manipulation.
assign_geno_ID.R
Description: R script used to assign genomic IDs to each individual based on q values from ENTROPY, as well as code for generating many plots based on these IDs.
breeding_behaviours.R
Description: R script to assign the breeding behaviour associated with the species, then perform a statistical test and visualization of the assignments.
check_k.sh
Description: Bash script for running DIC to assess the fit of each value of K from all ENTROPY runs.
colour_palette.R
Description: R script containing colours used in other data visualization with R.
count_assembled_7jul22.sh
Description: Bash script for counting the number of raw and mapped reads for each individual, to use for data visualization.
demultiplexing_txtfiles.R
Description: R script for creating the 2 barcode key files needed for demultiplexing 2 libraries with sabre.
entropy_queuesub_27jul22.sh
Description: Bash script for submitting ENTROPY jobs on the high performance computing cluster.
filter_indivs.R
Description: R script to filter out non-target species individuals from various datasets.
fishtree_AMP22.R
Description: R script for generating phylogenetic tree for species involved in the project.
inds_AMP22_target_03may23_miss0.6_mac3_Q30_DP3_ind95_maf001.recode.txt
Description: Text file containing individuals in the VCF file submitted to ENTROPY
inputdataformat.R
Description: Modified version of script from ENTROPY used to generate starting values files for running the program.
large_CIs.R
Description: R script for visualizing credible intervals from ENTROPY output.
Leuciscid_Metadata_May2023.csv
Description: CSV file with metadata for all individuals in the project. Variables:
- Date - Date that the individual was sampled on
- Waterbody - Name of waterbody that the individual was sampled at
- Waterbody_Code - Code for the waterbody that the individual was sampled at
- Mandeville_ID - Mandeville Lab group-specific ID code
- McCann_ID - McCann Lab group-specific ID code. NA = not applicable. Samples that have NA were sampled by the Mandeville lab directly, while samples with a McCann ID were sampled by the McCann lab, with a subsample taken by the Mandeville lab for this study
- Common_Name - Common name for the species of the individual was sampled
- Species_Name - Scientific name for the individual was sampled
- Plate - Which library preparation plate the the individual was in during lab work
- Well - Which well of the library preparation plate the the individual was in during lab work
- Barcode - Which barcode the individual was indexed with during lab work
- Barcode_plate - Which barcode plate the barcode for this individual came from
- Sample - What type of tissue was taken from this individual
- Total_length_cm - Total length of the individual from tip to end of tail in centimeters
- Fork_length_cm - Total length of the individual from tip to fork of tail in centimeters. NA = not available. Most individuals were only measured to the end of the tail.
- Weight_g - Weight in grams. NA = not available. Most individuals were not weighed.
- Sex - Sex of the individual. NA = not available. Sex is only able to be discerned if the individual is an adult sampled during breeding season so it was impossible to tell sex for many individuals.
- Stomach_contents - Stomach contents of the individual, if dissected. NA = not available. Most individuals were not dissected.
- Mandeville_notes - Notes by the Mandeville lab
- McCann_notes - Notes by the McCann lab
Leuciscid_OFAT_May2023.csv
Description: CSV file with watershed data for all sites in the project. Variables:
- Site - Full site name
- Code - Abbreviated site name
- Lab - Name of lab that sampled at the site
- Lat - Latitude of the site
- Long - Longitude of the site
- Drainage_area_km2 - Total upstream watershed in square kilometers
- Shape_factor - Shape factor of the upstream watershed
- Length_main_channel_km - Length of the main channel in the upstream watershed in kilometers
- Max_channel_elevation_m - Maximum elevation of the main channel in the upstream watershed in meters
- Min_channel_elevation_m - Minimum elevation of the main channel in the upstream watershed in meters
- Slope_main_channel_m_km - Slope of the main channel in the upstream watershed in meters per kilometer
- Slope_main_channel_percent - Slope of the main channel in the upstream watershed in percent
- Area_lakes_divide_wetlands_km2 - Total area of lakes in the the upstream watershed divided by the total area of wetlands in square kilometers
- Area_lakes_km2 - Total area of lakes in the the upstream watershed in square kilometers
- Area_wetlands_km2 - Total area of wetlands in the the upstream watershed in square kilometers
- Mean_elevation_m - Mean elevation of the upstream watershed in meters
- Max_elevation_m - Maximum elevation of the upstream watershed in meters
- Mean_slope_percent - Mean slope of the upstream watershed in percent
- Annual_mean_temp_C - Annual mean temperature of the upstream watershed in celsius
- Annual_precipitation_mm - Annual mean precipication of the upstream watershed in mililiters
- Clear_Open_Water_km2 - Area of upstream watershed covered by clear open water in square kilometers
- Marsh_km2 - Area of upstream watershed covered by marsh in square kilometers
- Swamp_km2 - Area of upstream watershed covered by swamp in square kilometers
- Fen_km2 - Area of upstream watershed covered by fen in square kilometers
- Bog_km2 - Area of upstream watershed covered by bog in square kilometers
- Sparse_Treed_km2 - Area of upstream watershed covered by sparse trees in square kilometers
- Treed_Upland_km2 - Area of upstream watershed covered by trees in upland areas in square kilometers
- Deciduous_Treed_km2 - Area of upstream watershed covered by deciduous trees in square kilometers
- Mixed_Treed_km2 - Area of upstream watershed covered by mixed types of trees in square kilometers
- Coniferous_Treed_km2 - Area of upstream watershed covered by coniferous trees in square kilometers
- Plantations_Treed_Cultivated_km2 - Area of upstream watershed covered by tree plantations in square kilometers
- Hedge_Rows_km2 - Area of upstream watershed covered by hedge rows in square kilometers
- Tallgrass_Woodland_km2 - Area of upstream watershed covered by tallgrass or woodland in square kilometers
- Sand_Gravel_Mine_Tailings_Extraction_km2 - Area of upstream watershed covered by sand, gravel, mining, tailings from mining, or other types of natural resource extraction in square kilometers
- Community_Infrastructure_km2 - Area of upstream watershed covered by community infrastructure in square kilometers
- Agriculture_and_Undifferentiated_Rural_Land_Use_km2 - Area of upstream watershed covered by agricultural and other types of rural land use in square kilometers
- Clear_Open_Water_percent - Area of upstream watershed covered by clear open water as percent of total area
- Marsh_percent - Area of upstream watershed covered by marsh as percent of total area
- Swamp_percent - Area of upstream watershed covered by swamp as percent of total area
- Fen_percent - Area of upstream watershed covered by fen as percent of total area
- Bog_percent - Area of upstream watershed covered by bog as percent of total area
- Sparse_Treed_percent - Area of upstream watershed covered by sparse trees as percent of total area
- Treed_Upland_percent - Area of upstream watershed covered by trees in upland areas as percent of total area
- Deciduous_Treed_percent - Area of upstream watershed covered by deciduous trees as percent of total area
- Mixed_Treed_percent - Area of upstream watershed covered by mixed types of trees as percent of total area
- Coniferous_Treed_percent - Area of upstream watershed covered by coniferous trees as percent of total area
- Plantations_Treed_Cultivated_percent - Area of upstream watershed covered by tree plantations as percent of total area
- Hedge_Rows_percent - Area of upstream watershed covered by hedge rows as percent of total area
- Tallgrass_Woodland_percent - Area of upstream watershed covered by tallgrass or woodland as percent of total area
- Sand_Gravel_Mine_Tailings_Extraction_percent - Area of upstream watershed covered by sand, gravel, mining, tailings from mining, or other types of natural resource extraction as percent of total area
- Community_Infrastructure_percent - Area of upstream watershed covered by community infrastructure as percent of total area
- Agriculture_and_Undifferentiated_Rural_Land_Use_percent - Area of upstream watershed covered by agricultural and other types of rural land use as percent of total area
- Total_agriculture_percent - Area of upstream watershed covered by agricultural and other types of rural land use as percent of total area
- Total_urban_and_development_percent - Area of upstream watershed covered by sand, gravel, mining, tailings from mining, or other types of natural resource extraction and by community infrastructure as percent of total area
- Total_treed_percent - Area of upstream watershed covered by any type of treed cover as percent of total area
- Total_wetland_percent - Area of upstream watershed covered by any type of wetland cover as percent of total area
- Open_water_percent - Area of upstream watershed covered by any type of open water cover as percent of total area
- Notes - Notes on the site
logistic_regression.R
Description: R script for performing a logistic regression on the data.
loop_plotting.sh
Description: Helper bash script used to loop q data plotting script (q_barplot.R).
Metadata_OWIT_merge.R
Description: R script for merging OWIT and metadata.
pca_gcov_summer2019_AM.R
Description: R script to create principal components analysis (PCA) plots.
plotting_DIC.R
Description: R script to plot the DIC values showing fit of each K model from ENTROPY.
plotting_mean_q.R
Description: R script to plot the mean value of q for each phenotypically identified species, across each value of K. Used to help correlate the phenotypic and genomic IDs.
q_barplot_multisite_rstudio.R
Description: R script used to create barplots with q-value data split per site, in RStudio on a local computer rather than in R on the computing cluster.
q_barplot.R
Description: R script used to create barplots with q data in R on the computing cluster, as well as generate summary output text files with q and Q data per individual and species.
Q_triangleplot.R
Description: R script used to create barplots with Q data in R on the computing cluster.
RandomForest.R
Description: R script used to run random forest statistical models on the genomic and environmental data.
rawdata_summary_AM.R
Description: R script used to assess raw sequencing data, including coverage and read length.
reads_and_DNA_concentration.R
Description: R script used to compare concentration of DNA and sequenced reads for each individual.
run_bwa_queuesub_22jun22.sh
Description: Bash script to run mapping with BWA.
run_sabre_queuesub.sh
Description: Bash script to run demultiplexing with sabre.
run_variantcall_bcf_queuesub_22jun22.sh
Description: Bash script to run variant calling with BCFtools.
startingvals_loop_queuesub.sh
Description: Bash script to loop over submission of ENTROPY jobs (with entropy_queuesub_27jul22.sh), when using starting values files (generated by inputdataformat.R).
triangle_plot_boundaries.R
Description: R script used generate the plot depicting our cut-offs for hybrid genomic IDs.
Code/software
We demultiplexed raw FASTQ files using sabre (github.com/najoshi/sabre). Then we used the Burrows-Wheeler Alignment tool (version 0.7.17, Li 2009), with the BWA-MEM algorithm (Li 2013), to align the FASTQ files to the creek chub reference genome (Meuser et al. 2023). We used bcftools mpileup (version 1.11, Li 2009) to identify variable loci; these were output into a BCF file which we converted to a VCF file using SAMtools (version 1.16, Li 2009b). We used the software ENTROPY to assess admixture between each of the 9 sampled minnow species (Shastry et al. 2021). We performed all of the data visualization for this project in R (version 4.2.2), either on a high-performance computing cluster or in RStudio on a local computer (RCoreTeam 2021).
Access information
Other publicly accessible locations of the data:
- Raw sequencing data can be found on NCBI Sequence Read Archive under BioProject number PRJNA1291971 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1291971)
- Scripts, metadata, and data uploaded here can also be found on GitHub (https://github.com/amanda-meuser/Extensive_hybridization_paper)
Data was derived from the following sources:
- Upstream watershed land coverage data was gathered from the Ontario Watershed Information Tool (https://www.lioapplications.lrc.gov.on.ca/OWIT/index.html?viewer=OWIT.OWIT&locale=en-CA)
