Skip to main content

Data from: Comparative Performance of Popular Methods for Hybrid Detection using Genomic Data

Cite this dataset

Kong, Sungsik; Kubatko, Laura (2021). Data from: Comparative Performance of Popular Methods for Hybrid Detection using Genomic Data [Dataset]. Dryad.


Interspecific hybridization is an important evolutionary phenomenon that generates genetic variability in a population and fosters species diversity in nature. The availability of large genome scale datasets has revolutionized hybridization studies to shift from the examination of the presence or absence of hybrids in nature to the investigation of the genomic constitution of hybrids and their genome-specific evolutionary dynamics. Although a handful of methods have been proposed in an attempt to identify hybrids, accurate detection of hybridization from genomic data remains a challenging task. The available methods can be classified broadly as site pattern frequency based and population genetic clustering approaches, though the performance of the two classes of methods under different hybridization scenarios has not been extensively examined. Here, we use simulated data to comparatively evaluate the performance of four tools that are commonly used to infer hybridization events: the site pattern frequency based methods HyDe and the D-statistic (i.e., the ABBA-BABA test), and the population clustering approaches structure and ADMIXTURE. We consider single hybridization scenarios that vary in the time of hybridization and the amount of incomplete lineage sorting (ILS) for different proportions of parental contributions (γ); introgressive hybridization; multiple hybridization scenarios; and a mixture of ancestral and recent hybridization scenarios. We focus on the statistical power to detect hybridization, the false discovery rate (FDR) for the D-statistic and HyDe, and the accuracy of the estimates of γ as measured by the mean squared error for HyDe, structure, and ADMIXTURE. Both HyDe and the D-statistic demonstrate a high level of detection power in all scenarios except those with high ILS, although the D-statistic often has an unacceptably high FDR. The estimates of γ in HyDe are impressively robust and accurate whereas structure and ADMIXTURE sometimes fail to identify hybrids, particularly when the proportional parental contributions are asymmetric (i.e., when γ is close to 0). Moreover, the posterior distribution estimated using structure exhibits multimodality in many scenarios, making interpretation difficult. Our results provide guidance in selecting appropriate methods for identifying hybrid populations from genomic data.

Usage notes

Jupyter notebook for analyses

This repository has the Jupyter notebook that describes codes for simulating, processing, and analyzing all data sets conducted in this study.

R scripts for visualizations

This repository has all output files from the analyses and R scripts to reproduce the figures and table in this study.