Data from: Contact zones reveal restricted introgression despite frequent hybridization across a recent lizard radiation
Data files
Dec 13, 2024 version files 122.20 MB
-
mtDNA_data.zip
71.79 KB
-
phylogenies.zip
28.10 KB
-
R_data.zip
122.08 MB
-
README.md
10.20 KB
Abstract
Introgression — the exchange of genetic material through hybridization — is now recognized as common among animal species. The extent of introgression, however, can vary considerably even when it occurs: for example, introgression can be geographically restricted or so pervasive that populations merge. Such variation highlights the importance of understanding the factors mediating introgression. Here we used genome-wide SNP data to assess hybridization and introgression at 32 contact zones, comprising 21 phylogenetic independent contrasts across a recent lizard radiation (Heteronotia). We then tested the relationship between the extent of introgression (average admixture at contact zones) and genomic divergence across independent contrasts. Early-generation hybrids were detected at contact zones spanning the range of genomic divergence included here. Despite this, we found that introgression is remarkably rare and, when observed, geographically restricted. Only the two most genomically similar population pairs showed introgression beyond 5 km of the contact zone. Introgression dropped precipitously at only modest levels of genomic divergence, beyond which it was absent or extremely low. Our results contrast with the growing number of studies indicating that introgression is prevalent among animals, suggesting that animal groups will vary considerably in their propensity for introgression.
README: Data from: Contact zones reveal restricted introgression despite frequent hybridization across a recent lizard radiation
https://doi.org/10.5061/dryad.prr4xgxx6
Description of the data and file structure
Three folders are included here, the contents of each of which are outlined. The R scripts to accompany these are available via Zenodo (see link on the respective Dryad page).
Files and variables
File: mtDNA_data.zip
Description: This folder simply contains the ND2 mitochondrial DNA sequence alignment in fasta format (hets_nd2.fasta) along with the respective partition file (hets_nd2_partitions.txt) for phylogenetic analysis.
File: phylogenies.zip
Description: This folder contains two files in Newick format: 1) the IQ-TREE analysis of the mtDNA ND2 data (nd2-iqtree.tree); 2) the neighbor-joining tree created from the concatenated alignment of DArTseq data (concat_nj.tree).
File: R_data.zip
Description: This folder contains all the data and spreadsheets needed to run the R scripts available via Zenodo. Explanations for these are also given in the respective R scripts:
1. hets_dxy_2024.fasta is the 319,865 bp alignment of 5,195 concatenated DArTseq loci. These data were used to calculate pairwise genomic divergence among samples to create the neighbor-joining phylogeny and to estimate DXY among candidate species pairs.
2. hets-dart-2024.csv is the full DArT SNP data that excludes samples from Fenker et al. 2024. This is in the typical format provided by DArT, where columns represent individuals and rows represent variant sites. For variant sites, a 0 value indicates the sample is homozygous for the reference allele, a 1 value represents a heterozygote for that site, and a 2 value represents a heterozygote for that site. Data in this format can be imported and analysed in R using the package dartR.
3. hets_meta-2024.csv is the metadata associated with hets-dart-2024.csv. Columns show the sample ID, the candidate species to which the sample belongs, and the latitude and longitude of the respective sample.
4. hets-dart-old.csv is the full DArT SNP data that includes samples from Fenker et al. 2024. This is in the typical format provided by DArT, where columns represent individuals and rows represent variant sites. For variant sites, a 0 value indicates the sample is homozygous for the reference allele, a 1 value represents a heterozygote for that site, and a 2 value represents a heterozygote for that site. Data in this format can be imported and analysed in R using the package dartR.
5. hets_meta-old.csv is the metadata associated with hets-dart-old.csv. Columns show the sample ID, the candidate species to which the sample belongs, and the latitude and longitude of the respective sample.
6. hets-Dxy-2024.csv is a spreadsheet with the DXY values for each pairwise lineage combination. The values were calculated using the dist.dna function in the the R package ape v.5.6-2 using the hets_dxy_2024.fasta alignment described above.
7. het_cz_final.csv lists information for each contact zone. Each row represents a different contact zone, and columns give information on the following: "area", a shorthand name given to each contact zone but of no importance to the analyses done; "pop1", the first of the two candidate species at the contact zone; "pop2", the second of the two candidate species at the contact zone; "lat", the latitude for the centrepoint of the contact zone; "lon", the longitude for the centrepoint of the contact zone; "radius", the radius (in km) around the centrepoint within which samples were included for the respective analysis; "minLat", used for plotting, this defines the minimum latitude for the respect map showing the results of sNMF analysis; "maxLat", used for plotting, this defines the maximum latitude for the respect map showing the results of sNMF analysis; "minLon", used for plotting, this defines the minimum longitude for the respect map showing the results of sNMF analysis; "maxLon", used for plotting, this defines the maximum longitude for the respect map showing the results of sNMF analysis; "num1", the lineage number for the lineage listed under "pop1", as defined in Figure 1; "num2", the lineage number for the lineage listed under "pop2", as defined in Figure 1; "combo", the combination of candidate species at the respective contact zone as shown by the number reference.
8. hybrid_hets.csv lists the samples that were identified as early-generation hybrids (i.e., F1, F2, & BC1 hybrids) via NewHybrids analysis. The first column lists the sample name and column two states whether it is an early generation hybrid. This file is used in the code for contact zone analyses. This same information, but with more detail is shown in Table S1.
9. phylo.ind.contrats.csv gives the mean admixture and DXY for phylogenetic independent contrasts. "df_name" shows the combination of candidate species for the respective contact zone, separated by a period (full-stop); "mean_admixture" gives the mean admixture proportion for the respective contact zone; "Dxy" gives the average between-population genomic divergence; "hybrids" simply lists whether early-generation hybrids were detected at the respective contact zone.
10. all.admix.combined.csv gives the admixture proportions for all samples across all contact zones. Each row reflects an individual (some individuals appear in multiple contact zones) and columns give the following: "P1", the admixture proportion with respect to one candidate species at the contact zone; "P2", the admixture proportion with respect to the other candidate species at the contact zone; "ind", the ID number for the respective individual; "pop", the candidate species to which the sample belong, determined by the larger of the two ancestry proportions; "lat", the latitude for the respective sample; "lon", the longitude for the respective sample; "latr", used for plotting purposes, this is the latitude of the respective sample with a very small amount of random noise added so that pie charts do not perfectly sit on top of each other when plotted; "lonr", used for plotting purposes, this is the latitude of the respective sample with a very small amount of random noise added so that pie charts do not perfectly sit on top of each other when plotted; "df_name", the combination of candidate species for the respective contact zone, separated by a period (full-stop), and with .cz appended at the end; "hybrid", whether the individual is an early-generation hybrid between the two lineages at that contact zone; "admixture", the lower admixture proportion for the respective individual; "rank", used for plotting, gives the rank order of the respective contact zone in terms of DXY, with rank increasing as DXY increases; "rankCol", used for plotting, is simply used to alternate colours in the plot so that adjacent contact zones have different colours.
11. mean.cz.admix.csv gives the average admixture for each contact zone. Each row is a different contact zone, and columns show: "pop1", the first of the two candidate species at the contact zone; "pop2", the second of the two candidate species at the contact zone; "combo", the names of the two respective candidate species combined, separated by a period (full-stop); "mean_admixture", the average admixture proportion for the respective contact zone; "bootstrap", the average admixture proportion yielded from the bootstrapping sensitivity analysis; "Dxy", the average genomic divergence among samples between different lineages at the respective contact zone; "hybrids", whether or not early generation hybrids were detected at the contact zone.
12. sensi.admix.comp.csv gives average admixture values for each contact using the 2024 SNP dataset, the dataset that includes samples from Fenker et al., and the bootstrap averages from sensitivity analysis. Each row represents a contact zone, and columns show: "df_name", the combination of lineages for the respective contact zone, separated by a period; "newAdmix", the average admixture proportion using the SNP dataset that excludes samples from Fenker et al.; "sensiAdmix", the average admixture proportion calculated from bootstrapping sensitivity analysis; "oldAdmix", the average admixture proportion using the SNP dataset that includes samples from Fenker et al.; "Dxy", average between-population genomic sequence divergence; "combo", the combination of candidate species at the respective contact zone as shown by the number references defined in Figure 1; "rank", the rank order of contact zones with respect to Dxy, with the rank number increasing with increasing Dxy.
13. sept.cz.combos.csv simply gives the sample sizes and lineage number combinations for each contact zone. This is used for plotting and is required to produce the figures in het_contact-zone_analyses.R. Rows represent contact zones and columns show: "pop1", the first of the two candidate species at the contact zone; "pop2", the second of the two candidate species at the contact zone; "num1", the number used to reference "pop1" as defined in Figure 1; "num2", the number used to reference "pop2" as defined in Figure 1; "combo", the combination of "num1" and "num2"; "N", the number of samples included in the analysis of the respective contact zone; "full", used for plotting Figure 4, is the combination of "num1", "num2", and "N".
Code/software
Three scripts are included with the Zenodo files:
het_contact-zone_analyses.R — This imports the SNP data and performs a loop that does SNP filtering, IBD analysis, and sNMF analysis across all the contact zones. It then summarises and plots the results from these analyses. See annotations within.
candidate-species-delimitation.R — This does SNP filtering, Principal Coordinates Analysis (PCoA), neighbor-joining phylogenetic analysis, and fixed differences analysis. See annotations within.
admix-sensitivity.R — This performs bootstrapping sensitivity using the results from sNMF across contact zones. See annotations within.