Data from: Increased rates of hybridization in swordtails are associated with water pollution
Data files
Jun 02, 2026 version files 8.63 GB
-
AIMS_to_tracts_PLAZ_HI_comparison.tar.gz
11.95 MB
-
Assortative_mating_ABC_CALM_CALL_betadist.tar.gz
272.82 MB
-
Assortative_mating_ABC_CALM_CALL.tar.gz
496.94 MB
-
Assortative_mating_test.tar.gz
75.34 KB
-
calnali_tributaries_hybrid_index_lowcountsremoved.csv
6.22 KB
-
chemtrial_QC.tar.gz
8.43 KB
-
histology_data.tar.gz
52.35 KB
-
histology_images.tar.gz
5.45 GB
-
huazalingo_pochula_calnali_conzintla_allhybridindex.csv
245.21 KB
-
hybrid_index_downsampling_PE.tsv
10.28 KB
-
hybrid_index_downsampling_SE.tsv
7.96 KB
-
land_use.tar.gz
2.40 GB
-
metals_carcasses.csv
19.06 KB
-
metals_fish_metadata.csv
5.82 KB
-
README.md
21.77 KB
-
site_riverkm_distances_noduplicates.csv
2.23 KB
-
site_riverkm_distances.csv
2.81 KB
-
waterchem_with_averagedsonde_labmetals_data.csv
109.86 KB
Abstract
The nature of reproductive barriers that separate species is a fundamental question in evolutionary biology. Such barriers may be sensitive to environmental conditions, and recent research has documented an increasing number of cases where anthropogenic environmental disturbance is associated with new hybrid populations. However, few studies have been able to quantitatively compare potential environmental drivers and test possible mechanisms connecting interspecific hybridization to anthropogenic disturbance. Here, we combine genomic and chemical surveys to explore the loss of reproductive isolation between the sister species Xiphophorus malinche and X. birchmanni, fishes whose riverine habitat in the Sierra Madre Oriental of Mexico is increasingly impacted by human-mediated disturbance. By inferring genome-wide ancestry in thousands of fish, we characterize the landscape of hybridization between these species in four distinct streams. Ancestry structure varied dramatically across streams, ranging from stable coexistence to clinal hybrid zones, hinting that the dynamics of hybridization in this system may be environmentally dependent. In one stream, a sudden shift in hybridization patterns coincides with the stream’s passage through an urbanized area, with upstream sites showing distinct ancestry clusters and downstream sites showing a swarm of hybrids with variable ancestry. By sequencing mothers and embryos, we show that assortative mating by ancestry is weakened downstream of this urbanized area. We hypothesize that the hybrid swarm downstream of the town is driven by chemical disruption of olfaction that impacts mating preferences. Water chemistry measurements show that water quality changes significantly across this area, including in parameters known to disrupt fish olfaction and mating. We identify alterations of the olfactory epithelium between sites upstream and downstream of the urbanized area that are consistent with differential effects of water quality. Taken together, our work illuminates potential mechanisms linking anthropogenic disturbance to the breakdown of reproductive isolation in these hybridizing species.
Dataset DOI: 10.5061/dryad.nvx0k6f5h
Description of the data and file structure
This dataset features various data that combine to connect anthropogenic impacts on water quality to increased gene flow between the swordtail fishes Xiphophorus birchmanni and X. malinche. To do so, we genetically quantified population structure across sites and connected it to environmental variables at that site, then conducted targeted analyses of olfactory histology and mate choice at sites with contrasting anthropogenic impacts. This required estimates of ancestry fractions from each parental species for over 2500 Xiphophorus hybrids, geographic data on sampling sites and land use, multiparameter water quality measurements collected across 4 years in the Rio Atlapexco drainage of Hidalgo, Mexico, and data from wild and experimental measurements of swordtail olfactory histology.
Files and variables
File: calnali_tributaries_hybrid_index_lowcountsremoved.csv
Description: a supplemental hybrid index file for individuals from four tributaries whose confluence with the Rio Calnali is downstream of the town of Calnali."Demographic influences on ancestry structure in X. birchmanni × X. malinche hybrid zones".
Variables
- indiv: individual ID
- malcount: count of alleles hard-called as X. malinche
- bircount: count of alleles hard-called as X. birchmanni
- hybrid_index: proportion of malinche ancestry = malcount / (malcount + bircount)
- heterozygosity: proportion of sites called as heterozygous
- Site.Code: abbreviation for sampling site used across datasets
- Site_order: ordination of sites, from upstream to downstream
- Collection_Year: year in which the individual sample was collected
- read1_counts: count of reads in original fastq file (used for quality control)
File: site_riverkm_distances_noduplicates.csv
Description: file of metadata on collection sites for genetic and chemical sampling - this file does not include duplicate entries for sites found at the confluence of two rivers, and is used in genetic analyses and graphics.
Variables
- Site.Code: abbreviation for sampling site used across datasets
- Drainage: major drainage in which the site is located
- dist_from_first_site_km: distance along the river from the highest upstream site sampled (km)
- dist_from_prev_site_km: distance along the river from the closest upstream site sampled (km)
- elevation: sampling site elevation (m)
- graphing_elevation: convenience variable used to avoid overlap in spectrum-based graphs of sitewise ancestry distributions (not used in any statistics)
- Site.Upstream: Site.Code for the upstream site referred to in dist_from_prev_site_km
File: site_riverkm_distances.csv
Description: file of metadata on collection sites for genetic and chemical sampling - this file includes duplicate entries for sites found at the confluence of two rivers, and is used in land use and chemistry analyses.
Variables
- Site.Code: abbreviation for sampling site used across datasets
- Drainage: major drainage in which the site is located
- dist_from_first_site_km: distance along the river from the highest upstream site sampled (km)
- dist_from_prev_site_km: distance along the river from the closest upstream site sampled (km)
- elevation: sampling site elevation (m)
- graphing_elevation: convenience variable used to avoid overlap in spectrum-based graphs of sitewise ancestry distributions (not used in any statistics)
- Site.Upstream: Site.Code for the upstream site referred to in dist_from_prev_site_km
- bedrock: bedrock underlying site, inferred from geological map in Fig. S7 (see corresponding figure for abbreviation codes)
File: waterchem_with_averagedsonde_labmetals_data.csv
Description: master datasheet of water chemistry data collected between May 2021 and September 2024 at sites across Hidalgo state, Mexico.
Variables
- Date: collection date
- Time: collection time
- Site Name: full name of sampling site
- Drainage: major drainage of the sampling site
- Site Code: abbreviation for sampling site used across datasets
- Lon: longitude (degrees)
- Lat: longitude (degrees)
- DO_sat: dissolved oxygen from YSI exo2 sonde (% saturation)
- DO_conc: dissolved oxygen from YSI exo2 sonde (mg/L of oxygen)
- Temp: temperature from YSI exo2 sonde (degrees Celsius)
- fDOM_RFU: fluorescent dissolved organic matter from YSI exo2 sonde (relative fluorescence units)
- fDOM_QSU: fluorescent dissolved organic matter from YSI exo2 sonde (quinine sulfate units)
- DOC_lab: dissolved organic carbon from acidified samples measured on Shimadzu TOC-L analyzer
- Conductivity: special conductivity (uS/cm)
- Total_Hardness: total hardness (CaCO3 equivalents)
- Calcium_Hardness: calcium hardness (CaCO3 equivalents)
- Alkalinity: total alkalinity (CaCO3 equivalents)
- pH: pH
- NO2: nitrite from YSI colorimeter (mg nitrogen /L)
- NO3: nitrate from YSI colorimeter (mg nitrogen /L)
- NH3: nitrate from YSI colorimeter (mg nitrogen /L)
- N_lab: total nitrogen from acidified samples measured on Shimadzu TOC-L analyzer
- PO4: phosphate from YSI colorimeter (mg PO4 /L)
- SO4: sulfate from YSI colorimeter (mg SO~4 ~/L)
- Sulfite: sulfite from YSI colorimeter (mg SO~3 ~/L)
- Sulfide: sulfide from YSI colorimeter (mg H2S~ ~/L)
- Color_Apparent: apparent water color measurements from Hanna Instruments portable photometer
- Color_True: true (post-filtering) water color measurements from Hanna Instruments portable photometer
- Turbidity: turbidity measurement (formazin nephelometric units for YSI sonde, nephelometric turbidity units for Orion portable turbidimeter)
- Turbidity_instrument: whether turbidity measurements were taken with a YSI sonde or an Orion turbidimeter
- Fe_colorimeter: iron concentration measured by YSI colorimeter (mg/L)
- Cu_free_colorimeter: free (non-chelated) copper concentration measured by YSI colorimeter (mg/L)
- Cu_total_colorimeter: total copper concentration measured by YSI colorimeter (mg/L)
- Ni_colorimeter: nickel concentration measured by YSI colorimeter (mg/L)
- Mn_colorimeter: manganese concentration measured by YSI colorimeter (mg/L)
- ##_t: total concentration of ion (represented by periodic table code) from ICP-MS
- ##_t: dissolved (post-filtering) concentration of ion (represented by periodic table code) from ICP-MS
- Weather: notes about weather at time of sample collection
- Observations: other notes relevant to data collection
File: metals_carcasses.csv
Description: datasheet of ICP-MS measurements of metal concentrations in whole fish carcasses collected in November 2021 and February 2022.
Variables
- Client Sample ID: sample name
- Site.Code: abbreviation for sampling site used across datasets
- ALS Sample ID: internal identification of the sample used by the ICP-MS facility
- Month: month of fish sampling and preservation
- Parameter: type of sample submitted to ICP-MS
- Moisture (%): proportion of body weight attributable to water content (1 - dry weight/wet weight)
- ##_wwt: total concentration of ion (represented by periodic table code) from ICP-MS, measured on a wet weight basis (mg/kg)
- ##_dwt: total concentration of ion (represented by periodic table code) from ICP-MS, measured on a dry weight basis (mg/kg, calculated from wet weight measurement and moisture, where available)
File: metals_fish_metadata.csv
Description: metadata of whole fish collected in November 2021 and February 2022 for ICP-MS measurements of metal concentrations.
Variables
- Date: date of sample collection
- Month: month of fish sampling and preservation
- Site.Code: abbreviation for sampling site used across datasets
- Site_num: unique number applied to sample site (not used in analysis)
- Bag_label: unique name for a sampling instance (site + date)
- Foil_label: unique name for each fish collected (site + date + individual)
- Fish_num: ID for each fish within its sampling trip and site
- Fish_species: species (or hybrid) ID of the fish
- Fish_length_mm: standard length of fish, in millimeters
- Fish_weight_g: weight of fish from field measurement on portable scale, in grams
- Condition_factor: Fish_weight_g divided by Fish_length_mm
- Sex: biological sex of fish (male, female, or juvenile male)
File: huazalingo_pochula_calnali_conzintla_allhybridindex.csv
Description:
Variables
- indiv: individual ID
- malcount: count of alleles hard-called as X. malinche
- bircount: count of alleles hard-called as X. birchmanni
- hybrid_index: proportion of malinche ancestry = malcount / (malcount + bircount)
- heterozygosity: proportion of sites called as heterozygous
- Site.Code: abbreviation for sampling site used across datasets
- Site_order: ordination of sites, from upstream to downstream
- Collection_Year: year in which the individual sample was collected
File: AIMS_to_tracts_PLAZ_HI_comparison.tar.gz
Description: genetic data from the Plaza (PLAZ) population used in confirming that two alternative methods of calculating the hybrid index generated functionally equivalent results (Manuscript Supp. Info. 5). Includes:
- PLAZ_hybrid_index_readcounts.csv: ancestry informative marker count-based hybrid index estimates of hybrid index, as in huazalingo_pochula_calnali_conzintla_allhybridindex.csv
- genotypes_PLAZ.tsv_plot.txt: local ancestry hard calls for PLAZ individuals; rows are ancestry calls at specific ancestry informative markers, columns are site (bp), chromosome, and individual genotype at that locus (0 = birchmanni, 1 = heterozygous, 2 = malinche)
File: hybrid_index_downsampling_PE.tsv
Description: inferred ancestry of five individuals with varying amounts of downsampled paired-end reads used to confirm that our read count threshold ensured accurate ancestry calls.
Variables
- readfile: the name of the file passed to ancestryinfer for local ancestry calling (includes individual name and downsampled read count)
- malcount: count of alleles hard-called as X. malinche
- bircount: count of alleles hard-called as X. birchmanni
- hybrid_index: proportion of malinche ancestry = malcount / (malcount + bircount)
- heterozygosity: proportion of sites called as heterozygous
File: hybrid_index_downsampling_SE.tsv
Description: inferred ancestry of five individuals with varying amounts of downsampled single-end reads used to confirm that our read count threshold ensured accurate ancestry calls.
Variables
- readfile: the name of the file downsampled and passed to ancestryinfer for local ancestry calling (parsed to extract individual name)
- malcount: count of alleles hard-called as X. malinche
- bircount: count of alleles hard-called as X. birchmanni
- hybrid_index: proportion of malinche ancestry = malcount / (malcount + bircount)
- heterozygosity: proportion of sites called as heterozygous
- reads: count of reads included in readfile in downsampling analysis
File: Assortative_mating_test.tar.gz
Description: files used in the test of assortative mating patterns at the sites Calnali Mid (CALM) and Calnali Low (CALL) against expectations of random mating. Files include:
- hybrid_index_CALL_adults.csv: hybrid index for all adults from Calnali Low; a subset of huazalingo_pochula_calnali_conzintla_allhybridindex.csv
- hybrid_index_CALL_mother_embryo.csv: paired hybrid indices for pregnant females from Calnali Low and their embryos; format is analogous to huazalingo_pochula_calnali_conzintla_allhybridindex.csv, but with two sets of columns, for the mother on the left, and the embryo on the right. Note that mother entries are duplicated where multiple embryos were sequenced.
- hybrid_index_CALM_adults.csv: hybrid index for all adults from Calnali Mid; a subset of huazalingo_pochula_calnali_conzintla_allhybridindex.csv
- hybrid_index_CALM_mother_embryo.csv: paired hybrid indices for pregnant females from Calnali Mid and their embryos; format is analogous to huazalingo_pochula_calnali_conzintla_allhybridindex.csv, but with two sets of columns, for the mother on the left, and the embryo on the right. Note that mother entries are duplicated where multiple embryos were sequenced.
File: chemtrial_QC.tar.gz
Description: data files from measurements used to confirm the desired effect of chemical manipulations during laboratory trials of chemical exposure on olfactory histology. Treatments groups included ammonium chloride (AC in data files), humic acid (HA), copper (ii) chloride (CU), all of the above (Ev or All), and none of the above (Co or Control). Individual files include:
- 05082024_schumer_chemtrial_NH4.csv: concentrations of ammonia (measured as ammonium) from Westco SmartChem 200 Discrete Analyzer.
- 20240417_olfact_chemtrials_DOCnewcalibration_cleaned: dissolved organic carbon (measured as non-purgeable organic carbon) from Shimadzu TOC/TN (TOC-L).
- EMF_ICPMS_Results_Ben_Michael_Moran20240507.csv: copper concentrations from Thermo Scientific iCAP RQ ICP-MS.
File: histology_data.tar.gz
Description: tables of olfactory morphology data measured from histological sections of swordtail crania. Files include:
- AGZC_CALL_cilia_bed_lengths.csv: length in μm of sensory cilia beds in wild-caught fish from Aguazarca (AGZC) and Calnali Low (CALL), measured in ImageJ. Beds are numbered from posterior to anterior.
- AGZC_CALL_olfactory_rosette_lengths.csv: length in μm of entire olfactory rosettes (including sensory beds and surrounding non-sensory tissue) in wild-caught fish from Aguazarca (AGZC) and Calnali Low (CALL), measured in ImageJ.
- Histology_blinding_key_CALL_AGZC_labCilia.csv: key connecting true individual IDs from the experiment to the blinded ID available to the slide observer. This sheet includes blinding pairs for all wild-caught fish, as well as the hematoxylin & eosin (H&E) stained slides of the laboratory chemical exposure trial.
- Histology_blinding_key_lab_ABPAS.csv: key connecting true individual IDs from the experiment to the blinded ID available to the slide observer. This sheet includes blinding pairs for the ABPAS-stained slides of the laboratory chemical exposure trial.
- histology_data_ABPAS_AGZC_CALL.csv: histology data collected by a blinded observer of ABPAS slides from wild-caught fish. Columns include blinded sample ID, Section (two sections were sometimes included per slide), side, the count of ABPAS-positive goblet cells on the side's olfactory rosette, and notes.
- histology_data_ABPAS_lab.csv: histology data collected by a blinded observer of ABPAS slides from lab-treated fish. Columns include blinded sample ID, Section (two sections were sometimes included per slide), side, the count of ABPAS-positive goblet cells on the side's olfactory rosette, and notes.
- histology_data_cilia_CALL_AGZC.csv: histology data collected by a blinded observer of H&E slides from wild-caught fish. Columns include blinded sample ID, side of head, sensory bed number (posterior to anterior), the proportion of that bed's surface area with visible and/or detached cilia, and notes.
- histology_data_cilia_lab.csv: histology data collected by a blinded observer of H&E slides from lab-treated fish. Columns include blinded sample ID, side of head, sensory bed number (posterior to anterior), the proportion of that bed's surface area with visible and/or detached cilia, and notes.
- lab_cilia_bed_lengths.csv: length in μm of sensory cilia beds in lab-treated fish, measured in ImageJ. Beds are numbered from posterior to anterior.
- lab_olfactory_rosette_lengths.csv: length in μm of entire olfactory rosettes (including sensory beds and surrounding non-sensory tissue) in lab-treated fish, measured in ImageJ.
File: histology_images.tar.gz
Description: slide scan images of H&E- and ABPAS-stained horizontal sections from swordtail crania. Folders include both wild-caught (Aguazarca, AGZC, and Calnali Low, CALL) and lab-treated fish. Lab-treated files are named by blinded sample ID, which can be decoded from the keys in histology_data.tar.gz
File: Assortative_mating_ABC_CALM_CALL.tar.gz
Description: summary statistics from simulations of assortative mating used in Approximate Bayesian Computation of the strength of assortative mating at Calnali Low and Calnali Mid with a step-wise function for assortative mating preference. Six files are included, each with 3,000,000 simulations, one for each population, and with three choices for the absolute difference in ancestry allowed as assortative (0.1, 0.05, or 0.025). Columns in each dataset include:
- sim_diff: the observed mean difference in ancestry proportion between mother-offspring pairs in the simulation
- sim_var: the observed variance of the difference in ancestry proportion between mother-offspring pairs in the simulation
- mating_prop: the relative strength of assortative mating by ancestry (0 = random, 1 = perfect) used to parametrize the simulation
File: Assortative_mating_ABC_CALM_CALL_betadist.tar.gz
Description: summary statistics from simulations of assortative mating used in Approximate Bayesian Computation of the strength of assortative mating at Calnali Low and Calnali Mid, with a continuous beta-distributed function for assortative mating preference. Three files are included, each with 3,000,000 simulations, one for each population, and an additional set of simulations for Calnali Mid, omitting an outlier individual from summary statistic calculations. Columns in each dataset include:
- sim_diff: the observed mean difference in ancestry proportion between mother-offspring pairs in the simulation
- sim_var: the observed variance of the difference in ancestry proportion between mother-offspring pairs in the simulation
- alpha: the relative strength of assortative mating by ancestry (1 = random, 20 = strong) used to parametrize the simulation
File: land_use.tar.gz
Description: raw and processed geographic data quantifying land use within the Rio Atlapexco drainage.
- all_sites_coords_PAPA_HZNP_Conz.csv: GPS coordinates and other geographic data for each genetic and chemical sampling site in the dataset.
- AtlapexcoDrainage_29Mar2022_compositeStrips_harmonized_psscene_analytic_sr_udm2.zip: 3 m2-resolution Planetscope Super Dove imagery of the study area from March 2022. TIF images 2022-03-29_strip_5529184_composite.tif and 2022-03-29_strip_5530607_composite.tif represent two strips of composite images with red, blue, and green bands.
- CEM_V3_20170619_R15_E13_TIF.zip:** **15 m2-resolution digital elevation model of the state of Hidalgo, Mexico, in TIF format, downloaded from INEGI.
- Sentinel_May_2020-2023_multispectral_satellite_imagery.zip: Sentinel-2 multispectral imagery from May in the years 2020–2023. Images were accessed and merged through Google Earth Engine.
- Subcatchment_LandUse_full.csv: land use classification results for subcatchments of each genetic and chemical sampling site in the study. Sentinel pixels were classified as water, forest, non-forest vegetation (column names "herbaceous" or "herb"), and developed land (column names "developed"). For each of these classes, the dataset includes:
- a count of raw pixels classified as that type in the area draining to the river between the previous site and the site in question, (column
) - the proportion of the land classified as that type in the same area (column prop_
) - a count of raw pixels classified as that type in the entire area draining to the river upstream of the site in question, (column cum_
) - the proportion of the land classified as that type in that cumulative area (column cumprop_
) - Finally, classifications were manually inspected for pixels misclassified as developed, which were removed from analysis, after which statistics were recalculated (columns corrected_
)
- a count of raw pixels classified as that type in the area draining to the river between the previous site and the site in question, (column
- Subcatchment_LandUse_50m.csv: land use classifications as in Subcatchment_LandUse_full.csv, but including only the area within 50m of a waterway (waterways defined based on digital elevation model, see manuscript methods).
- Subcatchment_LandUse_100m.csv: land use classifications as in Subcatchment_LandUse_full.csv, but including only the area within 100m of a waterway (waterways defined based on digital elevation model, see manuscript methods).
- Subcatchment_LandUse_500m.csv: land use classifications as in Subcatchment_LandUse_full.csv, but including only the area within 500m of a waterway (waterways defined based on digital elevation model, see manuscript methods).
Code/software
Code specifically associated with the manuscript can be found at https://github.com/benmoran11/anthro_xipho_hybrids, and scripts more generally used by the authors for processing ancestryinfer files can be found at https://github.com/Schumerlab/Lab_shared_scripts.
Access information
Raw FastQ files used for ancestry inference are available in the SRA archive under BioProjects PRJNA1346846, PRJNA930165, PRJNA610049, and PRJNA744894.
Please contact the corresponding authors of the associated manuscript for any clarification or help with data sharing!
This project was based on genomic DNA samples collected from fish in Hidalgo, Mexico, followed by low-coverage sequencing (raw data in NCBI SRA) and local ancestry inference using an HMM-based approach. The resulting ancestry information was summarized as the genome-wide ancestry proportion from each parent species, which was used for all genetic analyses in the paper. We also collected water quality information from the same sampling sites using a YSI multiparameter sonde, Orion turbidimeter, YSI colorimeter, Shimadzu TOC-L analyzer, and ICP-MS. ICP-MS was also used to quantify metal concentrations in whole fish carcasses. To quantify human impacts on the watersheds of each site, we combined Sentinel-2 multispectral imagery downloaded through Google Earth Engine, high-resolution satellite imagery from Planetscope, and a digital elevation model from INEGI to create land use classifications of the subcatchment upstream of each collection site. We also used simulations of assortative mating to estimate variation in mating preferences between sites. Finally, we collected formalin-fixed fish from a subset of sites and used histology to document differences in gross olfactory morphology. Further details are available in the online-only methods and supplemental information of the associated manuscript.
