Data from: Genomic architecture drives population structuring in Amazonian birds
Data files
Sep 28, 2023 version files 6.65 GB
Abstract
Geographic barriers are frequently invoked to explain genetic structuring across the landscape. However, inferences on the spatial and temporal origins of population variation have been largely limited to evolutionary neutral models, ignoring the potential role of natural selection and intrinsic genomic processes known as genomic architecture in producing heterogeneity in differentiation across the genome. To test how genomic architecture impacts our ability to reconstruct general patterns of diversification in species that co-occur across geographic barriers, we sequenced the whole genomes of multiple bird populations that are distributed across rivers in southeastern Amazonia. We found that phylogenetic relationships within species and demographic parameters varied across the genome in predictable ways. Genetic diversity was positively associated with recombination rate and negatively associated with species tree support. Gene flow was less pervasive in regions of low recombination, making these windows more likely to retain patterns of population structuring that matched the species tree. We further found that approximately a third of the genome showed evidence of selective sweeps and linked selection, skewing genome-wide estimates of effective population sizes and gene flow between populations towards lower values. In sum, we showed that the effects of intrinsic genomic characteristics and selection can be disentangled from neutral processes to elucidate how inferring spatial patterns of diversification are sensitive to genomic architecture.
README: Genomic architecture drives population structuring in Amazonian birds
https://doi.org/10.5061/dryad.76hdr7t2s
Description of the data sets available in this directory:
Concatenated_matrix
Whole-genome concatenated SNPs alignment in nexus format for Phlegopsis nigromaculata (P_nigromaculata.nexus.gz), Xiphorhynchus spixii (X_spixii.nexus.gz), and Lipaugus vociferans (L_vociferans.nexus.gz). “To estimate phylogenetic relationships between individuals, we estimated supermatrix trees concatenating all SNPs using IQTree2 (Minh et al. 2020). We converted vcf files to phylip format using vcf2phylip.py (Ortiz 2019), randomly resolving heterozygous genotypes, and keeping SNPs present in at least 80% of the individuals.”
diploSHIC
Observed data, simulations, population assignment, and genomic mask used in the diploS/HIC (Kern and Schrider 2018) analyses for Phlegopsis nigromaculata, Xiphorhynchus spixii, and Lipaugus vociferans.
Simulations
Simulations_10kb
Genetic simulations for the three alternative topologies. We simulated data under three alternative topologies, matching the unrooted trees tested in our phylogenetic approach: topology 1) (out,(Belem,(Xingu,Tapajos))); topology 2) (out,(Tapajos,(Xingu,Belem))); and topology 3) (out,(Xingu,(Tapajos,Belem))). We simulated 5,000 loci of 10kb, using uniform and wide priors for all parameters (Table S19), and performed one million simulations per model.
Simulations_100kb
Genetic simulations for the two alternative topologies. We simulated data under two alternative topologies, matching the unrooted trees tested in our phylogenetic approach: topology 1) (out,(Belem,(Xingu,Tapajos))); topology 2) (out,(Tapajos,(Xingu,Belem))).We performed 100,000 simulations per model, and used the same uniform priors for all parameters as implemented above.
P_nigromaculata
Observed data (fasta files) for Phlegopsis nigromaculata for the 10kb and 100kb approaches.
X_spixii
Observed data (fasta files) for Xiphorhynchus spixii for the 10kb and 100kb approaches.
L_vociferans
Observed data (fasta files) for Lipaugus vociferans for the 10kb and 100kb approaches.
Species_tree_Astral_inputs
Gene trees for chromosomes and summary statistics partitions for the three studies species used in our species tree approach. “To estimate the posterior probability of unrooted species trees, we used Astral-III v5.1.1 (Zhang et al. 2018; Rabiee et al. 2019), using the gene trees produced with phyml as inputs.”
Twisst
Inputs and outputs from topology weighting analyses for the three studies species. To estimate unrooted topology weight for each window across the genome, we used Twisst (Martin and Van Belleghem 2017). “.vcf.gz” are phased VCF files, and “.trees.gz” are input gene trees.
Code/Software
https://github.com/GregoryThom/Genomic-architecture-Amazonian-birds
Methods
Studied species, sampling design and whole-genome sequencing
We selected three species that occur in southeastern Amazonia and occur in distinct forest strata of upland forest habitats: 1) Phlegopsis nigromaculata, an obligatory army-ant follower restricted to the understory, with three distinct subspecies that are genetically structured and isolated by the Xingu and Tocantins rivers (Aleixo et al. 2009); 2) Xiphorhynchus spixii, which occupies the midstory of eastern Amazonian forests, and has two structured populations divided by the Xingu River (Aleixo 2004); and 3) Lipaugus vociferans, a widespread canopy species that is expected to be less structured across rivers.
To optimize the spatial representation of our samples, we selected a single individual per locality, targeting approximately 10 individuals per interfluve per species (Tapajos, Xingu, and Belem), yielding a total of 33, 33, and 30 samples for P. nigromaculata, L. vociferans, and X. spixii, respectively (Table S1; Figure 1). We isolated genomic DNA from muscle tissue preserved in alcohol (n = 65) and skin from the toe pads of museum specimens (n = 31). The proportion of tissue and toe pad samples varied between species (Table S1). All samples were loaned from the Museu Paraense Emilio Goeldi (MPEG). From tissues, we extracted DNA with a Qiagen high molecular weight DNA kit (MagAttract HMW DNA Kit - Qiagen). For the toe pads, we performed a protocol specific for degraded DNA consisting of additional steps for washing the samples with H2O and EtOH prior to extracting and extra time for digestion. We modified the DNeasy extraction protocol (DNeasy Blood & Tissue Kits - Qiagen) by replacing the standard spin columns with the QIAquick PCR filter columns (QIAquick PCR Purification Kit - Qiagen), selecting for smaller fragments of DNA, typically found in degraded samples. Toe pad extractions were conducted in a dedicated lab for working with historical samples at the American Museum of Natural History (AMNH) to reduce contamination risk. We quantified DNA extracts using a Qubit 2.0 Fluorometer (Thermo Fisher Scientific). Illumina libraries with variable insert sizes were generated, and samples were sequenced by Rapid Genomics (Gainesville, Florida) to ~10x coverage using 3.5 lanes of paired-end (2x150 bp) Illumina S4 NovaSeq 6000. Given the short insert size obtained for historical samples due to DNA degradation, we used two full lanes for sequencing 31 toe pad samples (Table S1). The larger sequencing output for historical samples allowed us to obtain similar coverage between distinct tissue types after bioinformatic filtering. Raw reads were trimmed and filtered using trimmomatic v0.36 (Bolger et al. 2014).
Genomic references, gene annotation and outgroups
We obtained reference genomes from closely related species. For P. nigromaculata, we used as a reference the genome of Rhegmatorhina melanosticta (Coelho et al. 2019) with TMRCA = 9.60Ma (Harvey et al. 2020). For X. spixii, we used the genome of X. elegans (GCA_013401175.1 ASM1340117v1; NCBI genome ID: 92877; scaffold N50: 172,271bp; number of scaffolds: 89,575; Feng et al. 2020) with TMRCA = 2.36Ma (Harvey et al. 2020), and for L. vociferans we used the genome of Cephalopterus ornatus (GCA_013396775.1 ASM1339677v1; NCBI genome ID: 92752; scaffold N50: 292,884bp; number of scaffolds: 26,262; Feng et al. 2020) with TMRCA =15.10Ma (Harvey et al. 2020). Given that passerine bird chromosomes are known to have high synteny and evolutionary stasis between distantly related species (Ellegren 2010; Coelho et al. 2019; Peñalba et al. 2020), we produced a pseudo-chromosome reference genome for X. elegans and C. ornatus by ordering and orienting their scaffolds to the chromosomes of the Zebra Finch (Taeniopygia guttata; version taeGut3.2.4) with chromosemble in satsuma v3.1.0 (Grabherr et al. 2010). For R. melanosticta, we used the chromosome assignment conducted in a previous study (Coelho et al. 2019). To check the completeness of our pseudo-chromosome references, we used Busco v2.0.1 (Waterhouse et al. 2018) to search for a set of single-copy avian ortholog loci. To transfer the original genome annotations from the scaffold assemblies to the pseudo-chromosome reference genomes, we mapped the genomic coordinates of each annotated feature using gmap (Wu & Watanabe 2005). For R. melanosticta we used the annotation performed by (Mikkelsen & Weir 2020) and for X. elegans and L. vociferans, we used distinct genome annotations performed by (Feng et al. 2020). A total of 98.90% (15,195), 97.46% (14,834 genes), and 98.92% (15,599 genes) of all annotated genes in R. melanosticta, X. elegans, and C. ornatus were successfully mapped to the pseudo-chromosome reference, respectively.
We downloaded raw reads from additional closely related species that were used as outgroups in phylogenetic analyses. For P. nigromaculata, we included R. melanosticta, Sakesphorus luctuosus (GCA_013396695.1 ASM1339669v1; NCBI genome ID: 92896; Feng et al. 2020) and X. elegans as outgroups. For X. spixii, we included X. elegans, S. luctuosus, Campylorhamphus procurvoides (GCA_013396655.1 ASM1339665v1; NCBI genome ID: 92894; Feng et al. 2020), and Furnarius figulus (GCA_013397465.1 ASM1339746v1; NCBI genome ID: 92763; Feng et al. 2020). For L. vociferans, we included C. ornatus, Pachyramphus minor (GCA_013397135.1 ASM1339713v1; NCBI genome ID: 92755; Feng et al. 2020), and Tyrannus savana (GCA_013399735.1 ASM1339973v1; NCBI genome ID: 92814; Feng et al. 2020).
Read alignment, variant calling and filtering
Trimmed and filtered reads were aligned to the references in BWA v0.7.17 (Li & Durbin 2009) using default parameters. We used Picard v.2.0.1 (Broad Institute, Cambridge, MA; http://broadinstitute.github.io/picard/) to 1) sort sam files with SortSam; 2) reassign reads to groups with AddOrReplaceReadGroups; 3) identify duplicated reads with Markduplicates; 4) calculate summary statistics with CollectAlignmentSummaryMetrics, CollectInsertSizeMetrics, and CollectRawWgsMetrics; and 5) create indexes with BuildBamIndex. All Picard functions were run with default parameters. We used the standard GATK v3.8 (McKenna et al. 2010) pipeline to 1) call SNPs and Indels for each individual separately with HaplotypeCaller; 2) perform genotyping with GenotypeGVCFs, assuming a value of 0.05 for the --heterozygosity flag; 3) flag and filter variants with VariantFiltration. Given the lack of a high confidence SNP panel, we implemented hard filtering options recommended by the Broad Institute's Best Practices (https://gatk.broadinstitute.org/). We filtered SNPs with quality by depth below 2 (QD < 2.0), SNPs where reads containing the alternative allele were considerably shorter than reads with the reference allele (ReadPosRankSum < -8), SNPs with root mean square of the mapping quality lower than 40 (MQ < 40.0), SNPs with evidence of strand bias (FS > 60.0 and SOR > 3.0), and SNPs where the read with the alternative allele had a lower mapping quality than the reference allele (MQRankSumTest < − 12.5). Lastly, we filtered raw VCF files by keeping only bi-allelic sites, with no more than 50% of missing information, with a minimum read depth of 4 and maximum of 30, and read quality score > Q20 using VCFTOOLS v0.1.15 (Danecek et al. 2011). We phased the genotypes in our genomic VCF files using BEAGLE v5.1 (Browning & Browning 2007) in sliding windows of 10kb and overlap between windows of 1kb.
Recombination, window-based summary statistics, and genetic structure
To estimate recombination rate (r = recombination rate per base pair per generation) from population level data for each of the species complexes, we used ReLERNN (Adrion et al. 2020). This approach approximates the genomic landscape of recombination by leveraging recurrent neural networks using the raw genotype matrix as a feature vector, avoiding the need to convert the data into summary statistics. We used ReLERNN according to Adrion et al. (2020) by performing 100,000 coalescent simulations based on population genetic properties of the empirical data. Simulations are then used to train, test, and validate a recurrent neural network model designed to predict the per base recombination rate across sliding windows of the genome. The fully trained recurrent neural network was then used to predict the recombination landscape across the genomes of the targeted populations. Given that genetic structure could potentially influence ReLERNN results (Mezmouk et al. 2011; Mangin et al. 2012), we restricted our analyses to the Tapajos population, which was composed exclusively of tissue samples, and we did not find any sign of spatial genetic substructure in the area for all three species (see below). Although we did not calculate r for all populations, the landscape of recombination across bird lineages is considered conserved, and variation between recently diverged populations should be minimal (Singhal et al. 2015). To account for the historical demography of the populations, we provided to ReLERNN the output of our SMC++ analyses (see below) with the --demographicHistory option. To account for uncertainty in the prediction of the recombination rate, we performed five replicates with the ReLERNN_BSCORRECT function. We assessed the accuracy of our estimates by evaluating the R2 and mean absolute error of general linear models between predicted and simulated rates for 1000 test examples. We considered a mutation rate of 2.42 x 10-9 mutations per generation and a one-year generation time (Jarvis et al. 2014; Zhang et al. 2014).
We calculated population genomics summary statistics for sliding windows using scripts available at https://github.com/simonhmartin/genomics_general (Martin et al. 2015). We initially converted VCF files per species into geno format, using parseVCF.py. FST, DXY, and nucleotide diversity (π) were calculated for the different populations in each of the three interfluves using popgenWindows.py. We estimated the D statistics in sliding windows using the ABBABABAwindows.py. We used the species tree topology with the highest probability from our species tree analyses (see Phylogenomic analyses, and topology weighting), treating the Tapajos, Xingu, and Belem populations as the terminals. For all summary statistics, we used phased VCF files, setting the window size to 10kb (-w option) without overlap between windows and the minimum number of sites without missing information per window to 500 (-m option). To obtain GC content proportion across 100kb windows for our reference genomes, we used sequir v4.2 (Charif & Lobry 2007) in R. We fit general linear regressions and Pearson’s correlation index between population genetics summary statistics, phylogenetic weights, and genomic architecture in R. To account for the potential non-linearity of these relationships, we also fit a LOESS model using the R package caret (Kuhn 2008). Models were trained using leave-one-out cross-validation from 80% of the total data.
Chromosome rearrangements, fission, and fusion might affect the overall recombination rate of a genomic window and also the chromosome assignment of the reference genome’s scaffolds in our pseudochromosome approach. To investigate if differences in chromosomal structure among species could be driving biased associations between genomic architecture and demographic and phylogenetic parameters, we aligned the pseudochromosomes of Rhegmatorhina melanosticta (Coelho et al. 2019) to the chromosomes of Chiroxiphia lanceolata (bChiLan1.pri; SAMN12620979) and reran correlations using only syntenic genomic regions located on homologous chromosomes. The genome of C. lanceolata is one of the best genomes available for a suboscine passerine, which is the suborder of our targeted species, and conserved regions between these taxa will likely be stable in all target species. We performed whole-genome alignment using ProgressiveCactus with default parameters (Armstrong et al. 2020). Synteny analysis was conducted using halSynteny (Krasheninnikova et al. 2020), using --maxAnchorDistance 100000 and --minBlockSize 100000.
To explore the genome-wide pattern of genetic structure, we performed Principal Component Analysis (PCA) and individual relatedness analyses based on identity-by-descent using SNPRelate v1.20.1 (Zheng et al. 2012) in R. In order to minimize the effect of missing genotypes in the PCA, we filtered our VCF files to keep SNPs present in at least 70% of the individuals. We also used SNPRelate to perform an identity-by-state (IBS) analysis among individuals for each species. To avoid the influence of SNP clusters in our PCA and IBS analyses, we pruned SNPs in approximate linkage equilibrium (LD>0.2) with each other.
Specific regions of the genome might be differently affected by selection and gene flow, exhibiting different levels of genetic diversity and differentiation between populations (Langley et al. 2012; Ellegren et al. 2012; Li et al. 2019). To explore the genomic variation in genetic structure, we used Lostruct (Li & Ralph 2019). This approach 1) summarizes the relatedness between individuals across genomic windows using PCA, 2) calculates the pairwise dissimilarity in relatedness among windows, 3) uses multidimensional scaling (MDS) to produce a visualization of how variable patterns of relatedness are across the genome, and 4) allows the user to combine regions by similarity to inspect contrasting patterns of genetic structure across the genome. We ran Lostruct for windows with 1000 SNPs, allowing for 30% of missing genotypes. To visualize the results, we selected the 10% of the windows closer to the three further points on the two first MDS coordinates and performed individual PCA analysis on clustered windows.
Historical demography, selective sweeps, and linked selection
Although an association between recombination rate and genetic diversity would support the effect of linked selection, it does not indicate which portions of the genome are directly impacted by this process. To further explore the extent of linked selection across the genome, we used a machine learning approach implemented on diploS/HIC (Kern and Schrider 2018). We modeled population sizes through time using unphased genomes in SMC++ v1.15.3 (Terhorst et al. 2017). Our goal with this approach was to track past fluctuations in Ne to be included in ReLERNN (Adrion et al. 2020) and DiploS/HIC (Kern & Schrider 2018) models to account for historical demography. We ran SMC++ exclusively for the Tapajos population of each species, assuming a mutation rate of 2.42 x 10-9 mutations per generation and one year generation time (Jarvis et al. 2014; Zhang et al. 2014). We explored the historical demography of populations within a time window between the present and 300,000ya.
On diploS/HIC (Kern & Schrider 2018) we used coalescent simulations of genomic windows to train and test a convolutional neural network (CNN) designed to predict hard and soft selective sweeps and genetic variation linked to selective sweeps across sliding windows of the genome. Genomic windows were simulated using discoal (Kern & Schrider 2016) according to five distinct models: 1) hard selective sweep; 2) soft selective sweep; 3) neutral variation linked to soft selective sweep; 4) neutral variation linked to hard selective sweep; and 5) neutral genetic variation. We performed 5,000 simulations per model using 220kb genomic windows divided into 11 sub-windows. To account for the neutral demography of the populations, which is essential to obtain robust model classification between windows (Harris et al. 2018), we added demographic parameters obtained with SMC++ into discoal simulations. To account for uncertainty in simulated parameters, we followed the approach of (Manthey et al. 2021) by allowing current Ne to vary between ⅓ to 3x the value obtained with SMC++ within a uniform distribution. Priors for the population-scaled recombination rate (rho=4Ner; where r is the recombination rate estimated with ReLERNN) were set based on the minimum and maximum values obtained across windows with ReLERNN. We set a uniform prior for selection coefficients ranging from 0.00025 to 0.025, and we conditioned sweep completion between the present and 10,000 generations ago. We used a uniform prior between 0.01 and 0.2 for the initial frequency of adaptive variants in soft-sweep models. Simulations were converted into feature vectors consisting of population genetics summary statistics, taking into account the observed amount of missing data by using a genomic mask. We calculated the probability of alternative models for the observed windows of 20kb. We ran CNNs for 1000 epochs, stopping the run if validation accuracy did not improve for 50 consecutive epochs. We ran five independent runs and predicted the observed data with the run that provided the highest accuracy on testing data. To assess the classification power of the CNNs, we inspected the overall accuracy, the false positive rate (FPR), recall (the number of correct positive predictions made out of all positive predictions that could have been made), and area under the curve (AUC). To acknowledge the uncertainty in model selection, we only assigned a model with a probability higher than 0.7 to a genomic window. Selective sweeps, both soft and hard, are expected to produce long tracks of homozygosity around the site under selection. To further validate the results obtained with diploS/HIC, we estimated H statistics using H-scan (Schlamp et al. 2016). This approach allowed us to test if regions classified as selective sweeps by diploS/HIC had longer homozygosity tracts than neutral regions. We measured haplotype homozygosity tract lengths based on physical distance (in number of base pairs, -d 0) and a maximum gap length between neighboring SNPs of 100kb (-g 100000).
Phylogenomic analyses, and topology weighting
We explored how evolutionary relationships were distributed across the genomes of the co-occurring species to test which aspects of the genomic architecture best predicted phylogenetic signals. First, we estimated topologies for each species in IQTREE-2 v2.1.5 (Nguyen et al. 2015; Minh et al. 2020) by concatenating SNPs and controlling for ascertainment bias (Figure 1). We converted VCF files to PHYLIP format using vcf2phylip.py (Ortiz 2019), randomly resolving heterozygous genotypes, and keeping SNPs present in at least 80% of the individuals. In IQTree2, we ran a total of 1000 rapid bootstrap replicates and controlled for ascertainment bias assuming a GTR+ASC substitution model. To obtain phylogenetic trees based on sliding windows of phased VCF files, we used PHYML v3.0 (Guindon et al. 2010) following (Martin & Van Belleghem 2017). We tested windows with different amounts of information content, selecting regions with 50, 100, 500, and 1000 SNPs. We conducted 100 bootstrap replicates per window. To calculate unrooted topology weight for each window across the genome, we used Twisst (Martin & Van Belleghem 2017). This approach allowed us to quantify the relationships among taxa, providing an assessment of the most likely topology for a given genomic region. Given that windows with different information content yielded similar results for the topology weights across the genome, we only present the results for 100 SNPs windows (with an average window size of 14,503 bp, 15,637 bp, and 5,821 bp for P. nigromaculata, X. spixii, and L. vociferans, respectively) in subsequent analyses.
To estimate the posterior probability of unrooted species trees, we used Astral-III v5.1.1 (Zhang et al. 2018; Rabiee et al. 2019), using the gene trees produced with PHYML as inputs. We used Astral to score unrooted trees (-q option) by calculating their quartet score, branch lengths, and branch support. We set our main topology (outgroup,Belem(Xingu,Tapajos) and used the -t 2 option to calculate the same metrics for the first and second alternative topologies. Given we only have four terminals per lineage (3 populations + outgroup), there are three possible unrooted trees. Therefore, this approach allowed us to calculate the posterior probability of all possible topologies. We conducted this approach for the whole set of gene trees and also for subsets of the data, based on specific characteristics of each window. To assess how support for a specific topology varies based on thresholds for specific summary statistics, we selected windows across the genome with the upper and lower 10% tile for recombination rate, FST, π, DXY and D statistics.
Model-based approach to account for recombination and selection.
In order to explicitly account for gene flow while testing for alternative topologies and estimating demographic parameters of genomic windows, we used a combination of coalescent simulations and supervised machine learning. We simulated data under three alternative topologies, matching the unrooted trees tested in our phylogenetic approach: topology 1) (out,(Belem,(Xingu,Tapajos))); topology 2) (out,(Tapajos,(Xingu,Belem))); and topology 3) (out,(Xingu,(Tapajos,Belem))). We allowed for constant gene flow after the divergence between Xingu and Belem, and Xingu and Tapajos populations. We did not allow gene flow between Belem and Tapajos due to the geographic disjunction between these populations. We simulated 5,000 loci of 10kb, using uniform and wide priors for all parameters (Table S19), and performed one million simulations per model. We assumed a fixed mutation rate of 2.42 x 10-9 mutations per generation and a one year generation time (Jarvis et al. 2014; Zhang et al. 2014). Genetic data for each model was simulated in PipeMaster (Gehara et al. 2017), which allows for a user-friendly implementation of msABC (Pavlidis et al. 2010). We summarized the genetic variation of observed and simulated data in a feature vector composed of population genetic summary statistics, including mean and variance across loci: number of segregating sites per population and summed across populations; nucleotide diversity per population and for all populations combined; Watterson’s theta (Watterson 1975) per population and for all populations combined; pairwise FST between populations; number of shared alleles between pairs of populations; number of private alleles per population and between pairs of populations; and number of fixed alleles per population and between pairs of populations. To align loci across individuals, phased VCF files per population were split every 10kb windows and converted into a fasta format including monomorphic sites using bcftools (Li 2011). Fasta alignments were converted into feature vectors with PipeMaster which uses PopGenome (Pfeifer et al. 2014) in R. To obtain a genome-wide estimate of demographic parameters, we selected one 10kb genomic window every 100kb to reduce the effect of linkage between windows. This procedure yielded a total of 7,213, 9,140, and 9,693 windows for P. nigromaculata, X. spixii, and L. vociferans, respectively. We randomly subsampled 5,000 windows from these data sets. We explored how simulated models fit the observed data using PCAs by plotting the first four PCs of simulated statistics vs. observed. We also generated goodness-of-fit plots using the gfit function of abc v2.1 (Csilléry et al. 2012) in R.
To classify observed datasets into our three models, we used a neural network (nnet) implemented in Keras v2.3 (Arnold 2017; https://github.com/rstudio/keras) in R. After an initial exploration for the best architecture for our nnet, we conducted our final analyses using three hidden layers with 32 internal nodes and a “relu” activation function. The output layer was composed of three nodes and a “softmax” activation function. A quarter (25%) of the simulations were used as testing data. We ran the training step for 1,000 epochs using “adam” optimizer and a batch size of 20,000. A small percentage (5%) of the training data set was used for validation. We used the overall accuracy and the sparse_categorical_crossentropy to track improvements in model classification. For the most probable model considering genome-wide windows per species, we estimated demographic parameters with a nnet with similar architecture but designed to predict continuous variables. For this step, we used an output layer with a single node and a “relu” activation. In the training step, we used the mean absolute percentage error (MAE) as an optimizer, training the nnet for 3,000 epochs with a batch size of 10,000 and a validation split of 0.1. We ran this procedure 10 times for each demographic parameter and summarized the results by calculating the mean across replicates. To additionally assess the accuracy of parameter estimation, we calculated the coefficient of correlation between estimated and true simulated values of the testing data set. To explore how genome-wide parameters differed between regions with distinct signatures of selection and those under neutrality, we created subsets of 10kb windows that were assigned with high probability (> 0.70) to one of the five distinct models implemented in diploS/HIC. For each species, we estimated parameters based on the best model (topology) considering genome-wide windows. We selected up to 1,000 windows for each of the five selection classes and performed the same approach as described above.
To explore the associations between demographic parameters and genomic architecture, we calculated the probability of alternative topologies and estimated demographic parameters for 100kb genomic windows, taking intralocus recombination into account. We used a similar approach as described above but simulated a 100kb window size and used a modified version of PipeMaster (Gehara et al. 2017) that allowed us to simulate intra-locus recombination. By selecting a larger window size, we increased the information content and resolution of summary statistics for single genomic windows. We performed 100,000 simulations per model and used the same uniform priors for all parameters as implemented above. For intralocus recombination, we set a uniform prior ranging from 0 to the maximum value obtained with ReLERNN per species (P. nigromaculata = 3.021 x 10-9; X. spixii = 2.475 x 10-9; L. vociferans = 2.171 x 10-9).
Lastly, to explore how recombination rate and gene flow impact topology weight, we performed coalescent simulations based on demographic parameters obtained for P. nigromaculata, and calculated topology weights using Twisst (Martin & Van Belleghem 2017). We simulated 1,000 windows of 10kb for four models, varying the presence of intra-locus recombination and gene flow between Xingu and Belem, assuming topology 1 (three ingroups plus one outgroup). Simulated parameters are available on Table S20. Simulations were performed with PipeMaster, and we converted the ms output to PHYLIP format with PopGenome. We ran trees for each 10kb window with IQTREE-2 using default parameters and ran Twisst on this estimated set of trees.