Skip to main content
Dryad

Swordtail fish hybrids reveal that genome evolution is surprisingly predictable after initial hybridization

Cite this dataset

Schumer, Molly (2024). Swordtail fish hybrids reveal that genome evolution is surprisingly predictable after initial hybridization [Dataset]. Dryad. https://doi.org/10.5061/dryad.qnk98sfq1

Abstract

Over the past two decades, biologists have come to appreciate that hybridization, or genetic exchange between distinct lineages, is remarkably common – not just in particular lineages but in taxonomic groups across the tree of life. As a result, the genomes of many modern species harbor regions inherited from related species. This observation has raised fundamental questions about the degree to which the genomic outcomes of hybridization are repeatable and the degree to which natural selection drives such repeatability. However, a lack of appropriate systems to answer these questions has limited empirical progress in this area. Here, we leverage independently formed hybrid populations between the swordtail fish Xiphophorus birchmanni and X. cortezi to address this fundamental question. We find that local ancestry in one hybrid population is remarkably predictive of local ancestry in another, demographically independent hybrid population. Applying newly developed methods, we can attribute much of this repeatability to strong selection in the earliest generations after initial hybridization. We complement these analyses with time-series data that demonstrates that ancestry at regions under selection has remained stable over the past ~40 generations of evolution. Finally, we compare our results to the well-studied X. birchmanni×X. malinche hybrid populations and conclude that deeper evolutionary divergence has resulted in stronger selection and higher repeatability in patterns of local ancestry in hybrids between X. birchmanni and X. cortezi.

README: Genome evolution is surprisingly predictable after initial hybridization

https://doi.org/10.5061/dryad.qnk98sfq1

This dataset contains ancestry posterior probability files, processed ancestry files, LD map files, data used in wavelet analysis, genome annotation files, and input data used to generate figures in the manuscript.

Description of the data and file structure

Ancestry file format:

ancestry-probs-par1: posterior probability of homozygous X. birchmanni ancestry for different samples and populations output by ancestryhmm

ancestry-probs-par2: posterior probability of homozygous X. cortezi ancestry for different samples and populations output by ancestryhmm

Posterior probabilities range from 0-1 and are given for each ancestry informative marker in each individual. The probability that a site is heterozygous for X. birchmanni and X. cortezi ancestry is equal to 1-(ancestry-probs-par1+ancestry-probs-par2).

The file name contains the chromosome analyzed (allchrs), the population and individuals analyzed (CHPL - Chapulhuacanito, STAC - Santa Cruz, HUEX - Huextetitla), the collection year (2003, 2006, 2017, 2019, 2020, or 2021). Additionally, if the data was analyzed using a thinned set of ancestry informative markers, this is noted in the file name as "thinned AIMs."

Recombination map files:

LD_map post.txt files: output of LDhelmet estimating 4Ner between SNP intervals in X. birchmanni and X. cortezi population samples

Recombination maps are provided separately for each chromosome (noted in the file name, e.g. chr-01 or scaf_01). The species for which the map corresponds is also noted in the file name: xbir - X. birchmanni and xcor - X. cortezi. Details on the LDhelmet output file format can be found in the LDhelmet documentation: https://github.com/popgenmethods/LDhelmet

Genome annotation files:

Gene annotation: .gff

.gff format files are provided for annotation of the X. birchmanni (xbir-COAC-16-VIII-22-M_v2023.1.gff) and X. cortezi (xcor-PTHC-08-XII-21-M_v2023.2.gff) genome assemblies. Annotation was conducted using GeneWise, Exonerate, and AUGUSTUS, as described in the main text.

Repeat master annotation: fa.out

Predicted coordinates of repetitive elements identified in the X. birchmanni (xbir-COAC-16-VIII-22-M_v2023.1.fa.out) and X. cortezi genome assemblies (xcor-PTHC-08-XII-21-M_v2023.2.fa.out) using RepeatModeler and RepeatMasker, and repeat libraries Repbase and FishTEDB, as described in the main text. Output format of RepeatMasker is described in detail on the program's source page: https://www.repeatmasker.org/

Processed files:

Files used in analysis summarizing average ancestry, coding and conserved basepair density, and recombination rate datasets.

allContempPops_crossover_codes_0.5cM_windows - input data file for ancestry crossover principal component analysis. For each window and each individual, a 1 indicates that a crossover was observed in that window in that individual and a 0 indicates that it was not.

allSNPs_allsamples_xcorMapped_allChrs.map/ped - files in plink format for PCA analysis of SNPs for all high coverage individuals.

hybrids_only_sharedXcorSNPs_allsamples_xcorMapped_allChrs.map/ped - files in plink format for PCA analysis of SNPs for all hybrid individuals. Only SNPs that fall in X. cortezi ancestry tracts are represented in this file (see Methods).

sharedXcorSNPs_allsamples_xcorMapped_allChrs.map/ped - files in plink format for PCA analysis of SNPs for all individuals. Only SNPs that fall in X. cortezi ancestry tracts are represented in this file (see Methods).

sharedXcorSNPs_justXcorInds_xcorMapped_allChrs.map/ped - files in plink format for PCA analysis of SNPs for pure X. cortezi individuals. Only SNPs that fall in X. cortezi ancestry tracts are represented in this file (see Methods).

xbir_pacbio2023_allChrs_*cM_windows_recRate_codingBP_conservedBP_everything_minPar.txt - text files summarizing average minor parent ancestry, number of coding and conserved basepairs, number of SNPs and ancestry informative markers, for each population in windows of 0.05, 0.1, 0.25, 0.5 and 1 cM.

xbir_2023pacbio_genome_0.05cm_windows_recRate_codingBP_xcorxbir_minPar_deserts_island.txt - text file following the above format but with an additional column specifying whether a region was a desert or Island, and in which population it was determined to be a desert or island.

xbir_pacbio2023_allChrs_*kb_windows_recRate_codingBP_conservedBP_everything_minPar.txt - text files summarizing average minor parent ancestry, number of coding and conserved basepairs, number of SNPs and ancestry informative markers, for each population in windows of 5, 10 , 50, 100, 250, 500, and 1000 kb.

xbirXcor_hybridIndex.txt - text file containing hybrid index of individuals analyzed in this study

Wavelet analysis files

Wavelet analyses were run on interpolated estimates of ancestry (see above under *Ancestry file format *for raw ancestry calls).

Interpolation windows were intersected with LD-based recombination map and gene features to give interpolated estimates of recombination and gene density. This was done at two resolutions, 32kb (main text), and 1kb (supplement).

The following files contain interpolated ancestry and recombination estimates:

interpolated_ancestry_recomb_32kb.txt

interpolated_ancestry_recomb_1kb.txt

Results of wavelet analyses (wavelet variances and wavelet correlations) are provided in data tables associated with the relevant figures under "Data used to produce manuscript figures".

Data used to produce manuscript figures

Each file is named matching the figure and panel used to produce it. For example: "Fig1C_pcaDF_stacChplHuex_noOut_dataframe.txt" is the data file used to produce Figure 1 panel C in the manuscript.

Sharing/Access information

Raw sequence data is available on NCBI BioProject PRJNA1106506

Code/Software

Code used in this project is available at: https://github.com/Schumerlab/

Funding

National Institute of General Medical Sciences