Data from: geo-genomic predictors of genetree heterogeneity explain phylogeographic and introgression history: a case study in an Amazonian bird (Thamnophilus aethiops)
Data files
Nov 02, 2023 version files 17.55 MB
Abstract
Can knowledge about genome architecture inform biogeographic and phylogenetic inference? Selection, drift, recombination, and gene flow interact to produce a genomic landscape of divergence wherein patterns of differentiation and genealogy vary nonrandomly across the genomes of diverging populations. For instance, genealogical patterns that arise due to gene flow should be more likely to occur on smaller chromosomes, which experience high recombination, whereas those tracking histories of geographic isolation (reduced gene flow caused by a barrier) and divergence should be more likely to occur on larger and sex chromosomes. In Amazonia, populations of many bird species diverge and introgress across rivers, resulting in reticulated genomic signals. Herein, we used reduced representation genomic data to disentangle the biogeographic history of four populations of an Amazonian antbird, Thamnophilus aethiops, whose biogeographic history was associated with the dynamic evolution of the Madeira River Basin. Specifically, we evaluate whether a large river capture event ca. 200 kya, gave rise to reticulated genealogies in the genome by making spatially explicit predictions about isolation and gene flow based on knowledge about genomic processes. We first estimated chromosome-level phylogenies and recovered two primary topologies across the genome. The first topology (T1) was most consistent with predictions about population divergence, and was recovered for the Z chromosome. The second (T2), was consistent with predictions about gene flow upon secondary contact. To evaluate support for these topologies, we trained a convolutional neural network to classify our data into alternative diversification models and estimate demographic parameters. The best-fit model was concordant with T1 and included gene flow between non-sister taxa. Finally, we modeled levels of divergence and introgression as functions of chromosome length, and found that smaller chromosomes experienced higher gene flow. Given that (1) gene-trees supporting T2 were more likely to occur on smaller chromosomes and (2) we found lower levels of introgression on larger chromosomes (and especially the Z-chromosome), we argue that T1 represents the history of population divergence across rivers and T2 the history of secondary contact due to barrier loss. Our results suggest that a significant portion of genomic heterogeneity arises due to extrinsic biogeographic processes such as river capture interacting with intrinsic processes associated with genome architecture. Future biogeographic studies would benefit from accounting for genomic processes, as different parts of the genome reveal contrasting, albeit complementary histories, all of which are relevant for disentangling the intricate biogeographic mechanisms of biotic diversification.
README: Geo-genomic predictors of genetree heterogeneity explain phylogeographic and introgression history: a case study in an Amazonian bird (Thamnophilus aethiops)
This dataset contains R scripts for replicating analyses with chromosome length, along with output from TWISST
Key to directories:
- treeslider: directory containing chromosome trees from astral (.tre files), output from treeslider analyses containing sliding window topologies (Thamno_aethio_50000bp_window_min5snps.tree_table.csv), and the key to scaffold names in the .csv results file (Scaffold_names.csv). "NA" values in "Scaffold_names.csv" are given for non-chromosome assembly scaffolds as these were not used in any analyses. Columns for these files can be defined as follows:
- X - The scaffold "number" used in the analysis, corresponds to "scaffold" column in "Thamno_aethio_50000bp_window_min5snps.tree_table.csv"
- scaffold_name - The name of the scaffold according to the reference genome used in the ipyrad assembly
- scaffold_length - The total length of each scaffold (used downstream for chromosome lengths)
- tree.file - The name of the chromosome phylogeny .tre file in the same directory
- number - the chromosome number ("NA" values given for unmapped scaffolds)
- start - The start position on a given chromosome of each window
- end - The end position on a given chromosome of each window
- sites - The total number of sites with genotype data in a given window
- snps - The total number of polymorphic sites in a given window
- samples - The total number of samples with sequence data for a given window
- missing - THe total number of missing genotypes in a given window
- tree - The inferred genetree at a given window. These are not given if the window did not fit our filters (see paper methods for details).
- pixy_results: spreadsheets with genome wide summary statistics from pixy. Here results from pixy are divided into genome wide estimates of nucleotide diversity (files with "pi" in the filename), relative divergence (files with "fst" in the filename), and absolute divergence (files with "dxy" in the filename). Each pixy run was replicated for for windows 50 kilobases in length (files with "50kb" in the filename) and 250 kilobases (files with "250kb" in the filenames). They were also replicated at different sequence depths of 6 (files with "dp6" in the filename) and 10 (files with "dp10" in the filename. Columns in each file are defined as follows (from documentation at this link: https://pixy.readthedocs.io/en/latest/output.html):
<br>
pop
- The ID of the population from the population file (pi file only)pop1
- The ID of the first population from the population file (dxy and fst files).pop2
- The ID of the second population from the population file.chromosome
- The chromosome/contigwindow_pos_1
- The first position of the genomic windowwindow_pos_2
- The last position of the genomic windowavg_pi
- Average per site nucleotide diversity for the window. More specifically, pixy computes the weighted average nucleotide diversity per site for all sites in the window, where the weights are determined by the number of genotyped samples at each site.avg_dxy
- Average per site nucleotide divergence for the window.avg_wc_fst
- Average Weir and Cockerham’s FST for the window (per SNP, not per site).no_snps
- Total number of variable sites (SNPs) in the window.no_sites
- The total number of sites in the window that have at least one valid genotype. This statistic is included for the user, and not directly used in any calculations.count_diffs
- The raw number of pairwise differences between all genotypes in the window. This is the numerator of avg_pi.count_comparisons
- The raw number of non-missing pairwise comparisons between all genotypes in the window (i.e. cases where two genotypes were compared and both were valid). This is the denominator of avg_pi.count_missing
- The raw number of missing pairwise comparisons between all genotypes in the window (i.e. cases where two genotypes were compared and at least one was missing). "NA" values represent windows where statistics could not be estimated - abba-baba: results from introgression analyses.
<br>
abbababa_100SNPs_ref-nro-sro_ein.csv: results excluding west inambari population. "nan" values are given when statistics could not be calculated due to insufficient data
<br>
abbababa_100SNPs_ref-nro-sro_win.csv: results excluding east inambari population. "nan" values are given when statistics could not be calculated due to insufficient data
<br>
Column names are defined as follows (but we suggest looking at documentation here https://github.com/gibert-Fab/ABBA-BABA):
- windowID - A unique numeric identifier for each window in order by chromosome and across each chromosome
- scaffold - The scaffold the window is on (e.g., 1 = chrom_1, 10=chrom_10)
- start - The window start position (inclusive)
- end - The window end position
- mid - The window midpoint position
- sites - The number of genotype sites in the input file in each window
- sitesUsed - The number of sites used to compute statistics (biallelic SNPs)
- ABBA - Pseudo count of ABBA sites (including polymorphic sites)
- BABA - Pseudo count of BABA sites (including polymorphic sites)
- D - D statistic
- fd - fd admixture estimation
- fdM - Malinsky's modified statistic, fdM to accommodate admixture between either P1 and P3 or P2 and P3
- chromosome - Does the window lie on an autosome or sex chromosome?
- TWISST: results from TWISST analyses: <br> 100SNPs_ein.weights...: Topology weight for three unrooted topologies for the relationships between an outgroup and three populations of T. aethiops performed in Twisst. Here, we excluded the western Inambari population to test for the relationship between Rondonia populations. Values represent the absolute weight estimated with twists. nan: windows dropped off the analyses by filtering requirements (see Materials and Methods). <br> 100SNPs_ingroup.weights...: Topology weight for three unrooted topologies for the relationships between the four populations of T. aethiops performed in Twisst. Values represent the absolute weight estimated with twists. nan: Windows dropped off the analyses by filtering requirements (see Materials and Methods). <br> 100SNPs.data.tsv: Window coordinates used in Twisst analyses.
Four R-scripts can be used to replicate several figures and statistical analyses from the manuscript:
- chrom_architecture_new.R: This script will generate heatmaps in Figures 5, S6, and S7.
- chrom_architecture.R: This script will generate Figures 4, S5, and S8 and replicate statistical analyses for correlations between genome summary stats and chromosome length. It requires the files from "treeslider" and "pixy_results" directories.
- FDM_by_distance_from_center_by.R: This script will replicate Figure S9.
- fdm_correlations.R: This script will replicate Figures S10 and S11 and explores many introgression~genome architecture models. There is some overlap with chrom_architecture.R with the goal of exploration of different models.