Data from: geo-genomic predictors of genetree heterogeneity explain phylogeographic and introgression history: a case study in an Amazonian bird (Thamnophilus aethiops)
Abstract
Can knowledge about genome architecture inform biogeographic and phylogenetic inference? Selection, drift, recombination, and gene flow interact to produce a genomic landscape of divergence wherein patterns of differentiation and genealogy vary nonrandomly across the genomes of diverging populations. For instance, genealogical patterns that arise due to gene flow should be more likely to occur on smaller chromosomes, which experience high recombination, whereas those tracking histories of geographic isolation (reduced gene flow caused by a barrier) and divergence should be more likely to occur on larger and sex chromosomes. In Amazonia, populations of many bird species diverge and introgress across rivers, resulting in reticulated genomic signals. Herein, we used reduced representation genomic data to disentangle the biogeographic history of four populations of an Amazonian antbird, Thamnophilus aethiops, whose biogeographic history was associated with the dynamic evolution of the Madeira River Basin. Specifically, we evaluate whether a large river capture event ca. 200 kya, gave rise to reticulated genealogies in the genome by making spatially explicit predictions about isolation and gene flow based on knowledge about genomic processes. We first estimated chromosome-level phylogenies and recovered two primary topologies across the genome. The first topology (T1) was most consistent with predictions about population divergence, and was recovered for the Z chromosome. The second (T2), was consistent with predictions about gene flow upon secondary contact. To evaluate support for these topologies, we trained a convolutional neural network to classify our data into alternative diversification models and estimate demographic parameters. The best-fit model was concordant with T1 and included gene flow between non-sister taxa. Finally, we modeled levels of divergence and introgression as functions of chromosome length, and found that smaller chromosomes experienced higher gene flow. Given that (1) gene-trees supporting T2 were more likely to occur on smaller chromosomes and (2) we found lower levels of introgression on larger chromosomes (and especially the Z-chromosome), we argue that T1 represents the history of population divergence across rivers and T2 the history of secondary contact due to barrier loss. Our results suggest that a significant portion of genomic heterogeneity arises due to extrinsic biogeographic processes such as river capture interacting with intrinsic processes associated with genome architecture. Future biogeographic studies would benefit from accounting for genomic processes, as different parts of the genome reveal contrasting, albeit complementary histories, all of which are relevant for disentangling the intricate biogeographic mechanisms of biotic diversification.
This dataset contains R scripts for replicating analyses with chromosome length, along with output from TWISST
Key to directories:
- treeslider: directory containing chromosome trees from astral (.tre files), output from treeslider analyses containing sliding window topologies (Thamno_aethio_50000bp_window_min5snps.tree_table.csv), and the key to scaffold names in the .csv results file (Scaffold_names.csv). "NA" values in "Scaffold_names.csv" are given for non-chromosome assembly scaffolds as these were not used in any analyses. Columns for these files can be defined as follows:
- X - The scaffold "number" used in the analysis, corresponds to "scaffold" column in "Thamno_aethio_50000bp_window_min5snps.tree_table.csv"
- scaffold_name - The name of the scaffold according to the reference genome used in the ipyrad assembly
- scaffold_length - The total length of each scaffold (used downstream for chromosome lengths)
- tree.file - The name of the chromosome phylogeny .tre file in the same directory
- number - the chromosome number ("NA" values given for unmapped scaffolds)
- start - The start position on a given chromosome of each window
- end - The end position on a given chromosome of each window
- sites - The total number of sites with genotype data in a given window
- snps - The total number of polymorphic sites in a given window
- samples - The total number of samples with sequence data for a given window
- missing - THe total number of missing genotypes in a given window
- tree - The inferred genetree at a given window. These are not given if the window did not fit our filters (see paper methods for details).
- pixy_results: spreadsheets with genome wide summary statistics from pixy. Here results from pixy are divided into genome wide estimates of nucleotide diversity (files with "pi" in the filename), relative divergence (files with "fst" in the filename), and absolute divergence (files with "dxy" in the filename). Each pixy run was replicated for for windows 50 kilobases in length (files with "50kb" in the filename) and 250 kilobases (files with "250kb" in the filenames). They were also replicated at different sequence depths of 6 (files with "dp6" in the filename) and 10 (files with "dp10" in the filename. Columns in each file are defined as follows (from documentation at this link: https://pixy.readthedocs.io/en/latest/output.html):
Four R-scripts can be used to replicate several figures and statistical analyses from the manuscript:
- chrom_architecture_new.R: This script will generate heatmaps in Figures 5, S6, and S7.
- chrom_architecture.R: This script will generate Figures 4, S5, and S8 and replicate statistical analyses for correlations between genome summary stats and chromosome length. It requires the files from "treeslider" and "pixy_results" directories.
- FDM_by_distance_from_center_by.R: This script will replicate Figure S9.
- fdm_correlations.R: This script will replicate Figures S10 and S11 and explores many introgression~genome architecture models. There is some overlap with chrom_architecture.R with the goal of exploration of different models.
