Data from: Chromosome-scale inference of hybrid speciation and admixture with convolutional neural networks
Data files
Aug 06, 2020 version files 12.61 GB
Abstract
Inferring the frequency and mode of hybridization among closely related organisms is an important step for understanding the process of speciation and can help to uncover reticulated patterns of phylogeny more generally. Phylogenomic methods to test for the presence of hybridization come in many varieties and typically operate by leveraging expected patterns of genealogical discordance in the absence of hybridization. An important assumption made by these tests is that the data (genes or SNPs) are independent given the species tree. However, when the data are closely linked, it is especially important to consider their non-independence. Recently, deep learning techniques such as convolutional neural networks (CNNs) have been used to perform population genetic inferences with linked SNPs coded as binary images. Here we use CNNs for selecting among candidate hybridization scenarios using the tree topology (((P1,P2),P3),Out) and a matrix of pairwise nucleotide divergence (dXY) calculated in windows across the genome. Using coalescent simulations to train and independently test a neural network showed that our method, HyDe-CNN, was able to accurately perform model selection for hybridization scenarios across a wide-breath of parameter space. We then used HyDe-CNN to test models of admixture in Heliconius butterflies, as well as comparing it to a random forest classifier trained on introgression-based statistics. Given the flexibility of our approach, the dropping cost of long-read sequencing, and the continued improvement of CNN architectures, we anticipate that inferences of hybridization using deep learning methods like ours will help researchers to better understand patterns of admixture in their study organisms.
Usage notes
CSV Files for HyDe-CNN Tests
[hyde-cnn_tests.tar.gz] -- This archive contains the the CSV files (there are 12, three for each model) with the results of testing the trained HyDe-CNN architecture using 10,000 additional simulated data sets for each of the four models at each of the branch scaling factors. Each CSV file has the parameters used to simulate each image with msprime, the predicted best model, the best model weight, and the summary statistics calculated for training the random forest classifier.
Random Forest Classifier Results
[RF_classifier_results.txt] -- Raw output of the random forest classifier trained on introgression-specific summary statistics.
Trained Models for the HyDe-CNN Architecture
[trained_models_hyde-cnn.tar.gz] -- This archive contains the trained models for all of the neural networks f the HyDe-orCNN architecture.
Trained Models for the Flagel et al. Architecture
[trained_models_flagel.tar.gz] -- This archive contains the trained models for all of the neural networks for the Flagel et al. architecture.
hyde_cnn_*_data_*.npz
Nine compressed numpy arrays with the input images split into training, validation, and testing sets. Each file has the data for all combinations of input type (min, mean, min+mean) and branch scaling in coalescent units (0.5, 1.0, 2.0).
HyDe-CNN Code Archive
[hyde-cnn_code_archive.tar.gz] -- Archived versions of all Python and R scripts used to generate, process, and analyze data in the paper. All of these scripts are also available on GitHub.
Heliconius Chromosome Five VCF and Recombination Map
[heliconius_data.tar.gz] -- VCF file containing variants on chromosome five for Heliconius samples as well as the recombination map for simulating chromosome five.
Trained Models for Heliconius
[trained_models_heliconius.tar.gz] -- This archive contains the trained models for all of the neural networks f the HyDe-orCNN architecture.
Heliconius Resampling Results
[heliconius_res.tar.gz] -- CSV files with the predicted model weights for all 100 bootstrap replicates for the three different input types (min, mean, min+mean).
heliconius_*_data.npz
Compressed arrays with the input images split into training, validation, and testing sets for the Heliconius example. Each file has the data for the different input types (min, mean, min+mean).
Heliconius Code Archive
[heliconius_code_archive.tar.gz] -- Contains the code for simulating data to train, validate, and test a CNN, as well as a Jupyter Notebook that was used to process the observed data from chromosome five. This code is also on GitHub.