Dispersal, isolation and local adaptation promote speciation in South American savannas as indicated by a phylogenomic analysis of a passerine
Data files
May 14, 2025 version files 201.93 MB
-
Phylogenomics_Stilpnia_cayana_Dryad_repo.zip
201.92 MB
-
README.md
12.92 KB
Abstract
South American savannas are a disjunct biome with an unclear evolutionary history. We tested hypotheses about their Quaternary history and evolution of savanna cores through fragmentation or dispersal from the Cerrado. We used genomic data (genotyping-by-sequencing) and ecological niche models of the Burnished-buff Tanager (Stilpnia cayana Linnaeus 1766) to evaluate intraspecific differentiation, gene flow, past range shifts, and landscape-genomics association. We found clear genomic differences between populations on each side of the Amazon basin and high admixture in the Marajó Island and Bolivia. Landscape genomics analysis indicated that the Amazon River, isolation by distance and temperature predict genomic differentiation in this bird. Taken together, the results suggest that a combination of dispersal from the Cerrado, isolation due to geographic distance and the Amazon River basin, and local adaptation shaped the species diversification. We propose that the populations on each side of the Amazon River be considered distinct species (S. cayana at the north and S. flava at the south) with subspecies huberi representing part of a hybrid zone between them, located on Marajó island at the mouth of the river.
Overview
This dataset includes all the data used in the phylogenomic study of Stilpnia cayana described in the article referenced above. The repository is organized into folders corresponding to each major analysis conducted in the study.
Within each folder, you will find a data/ subdirectory containing the input files used in that specific analysis.
Below, we describe the contents of each folder and the associated data files.
Folder: 00_IPYRAD/
Purpose:
De novo assembly of loci and SNP calling using ipyrad ver. 0.9.84 (Eaton & Overcast, 2020).
Subfolders:
Ingroup_83/: Contains results from an assembly using 83 ingroup samples of Stilpnia cayana.Ingroup-outgroup_84/: Contains results from an assembly using 83 ingroup samples and 1 outgroup sample (Tangara chilensis).
Contents:
Each subfolder includes:
params-*.txt: ipyrad configuration file used for the run.vcffiles: SNP datasets in VCF format.phyfiles: Full-length GBS loci alignments in PHYLIP format.stats.txtSummary statistics and clustering logs from ipyrad outputs
These datasets were used for variant filtering, population genomic, and phylogenetic analyses described in the main article.
Folder: 00_Sampling_sites/
Purpose:
Contains metadata about the geographic origin and population assignment of the samples used in the study, along with spatial data representing the distribution range of Stilpnia cayana subspecies.
Contents:
sampling_metadata.csv: A table with the following information for each sample:- Sample ID
- Specimen ID (Id used in the final publication)
- Latitude and longitude (in decimal degrees)
- Assigned population
Scay_subsp_range.gpkg: A GeoPackage file containing vector polygons representing the distribution ranges of the main Stilpnia cayana subspecies.
These files were used to define sampling structure, visualize geographic context, and support spatial analyses throughout the study.
Folder: 01_Fst/
Purpose:
Contains input and output files used to calculate pairwise FST values among populations using the R package hierfstat.
Contents:
Scay_83_maf_mr80_sinMarajo.recode.vcf: A filtered and reduced VCF file containing SNP data for 83 ingroup samples, excluding samples from Marajó Island. A filtered VCF file containing SNP data from the ingroup dataset, excluding samples from Marajó Island. SNPs were filtered by applying a minor allele frequency threshold of 0.05 and allowing a maximum of 20% missing data per site.
These data were used to assess population differentiation across the species’ range.
Folder: 01_PCA_and_PCoA/
Purpose:
Contains the SNP dataset used to perform principal component analysis (PCA) and principal coordinates analysis (PCoA) for visualizing patterns of genetic structure among individuals.
Contents:
Stilpnia_cayana_80_bial_maf_mr80_noIndel.recode.vcf: Three samples (UFG4028, MPEG70537, USNM622381) showed high levels of missing data (i.e., fewer than 3,500 loci) and were removed prior to PCA and PCoA analyses.
A filtered VCF file including 80 ingroup samples. SNPs were filtered using a minor allele frequency threshold of 0.05 and a maximum of 20% missing data per site.
This dataset served as the input for multivariate analyses of genetic variation across individuals and populations of Stilpnia cayana.
Folder: 02_Admixture/
Purpose:
Contains the SNP dataset used to perform individual ancestry estimation using model-based clustering approaches (e.g., ADMIXTURE).
Contents:
Stilpnia_cayana_80_bial_maf_mr80_noIndel_un.vcf: A filtered VCF file derived from the 83-sample ingroup dataset, with the same filtering criteria as used in PCA/PCoA analyses (minor allele frequency > 0.05 and maximum 20% missing data per site).
Additionally, to reduce linkage among SNPs, only one SNP per RAD locus was retained.
The three samples with high levels of missing data (UFG4028, MPEG70537, USNM622381) were excluded from this dataset.
This dataset was used to infer population structure and individual ancestry proportions across the range of Stilpnia cayana.
Folder: 03_RAxML/
Purpose:
Contains the alignment and associated metadata used to perform phylogenetic inference using RAxML.
Contents:
Stilpnia_cayana_Ingroup-outgroup_84.phy: Phylip-formatted alignment of full-length GBS loci sequences for 84 samples (83 ingroup samples plus one outgroup: Tangara chilensis). This dataset was used to infer the phylogenetic relationships among populations of Stilpnia cayana.Features_to_plot.csv: A metadata file including sample IDs, population assignments, subspecies designations, and plumage phenotypes. This information was used to map biological traits and geographic origin onto the phylogenetic tree.
Folder: 04_EEMS/
Purpose:
Contains input files used to run the Estimated Effective Migration Surfaces (EEMS) analysis to explore patterns of gene flow across the landscape.
Contents:
Scay_83_SNPRelateDisIBS.txt: Pairwise genetic dissimilarity matrix generated using the SNPRelate package in R, based on the VCF file containing 83 ingroup samples and including only biallelic SNPs.Scay_83.coords: Text file with geographic coordinates (longitude and latitude) for each sample included in the EEMS analysis.Scay_83.outer: File containing coordinates that define the outer polygon surrounding the study area. This polygon delineates the spatial boundaries within which migration surfaces were estimated.
Folder: 05_GPhoCS/
Purpose:
Contains input files used for demographic inference under a coalescent framework using GPhoCS.
Contents:
Stilpnia_cayana_Ingroup_83.gphocs: SNP dataset in GPhoCS format, generated from the ingroup assembly of 83 samples using ipyrad.Stilpnia_cayana_Ingroup_83.ctl: Control file specifying the evolutionary model used in the GPhoCS analysis, including population assignments, migration bands, and prior distributions for demographic parameters.
Folder: 06_fastsimcoal2/
Purpose:
Contains input data and configuration files used to run demographic inference with fastsimcoal2.
Contents:
Scay83_full_6pop_5samp.recode.vcf: VCF file including both variant and invariant sites, subsampled to include five samples per population (except Marajó, which includes only three). This full VCF was generated from the complete loci file exported by ipyrad and converted using a Python script provided by Isaac Overcast (see: ipyrad issue #479).easySFS_6pops_5samp.txt: File specifying population assignments for each sample, used as input in easySFS to generate the multidimensional Site Frequency Spectrum (MSFS).6pop_Model.tpland6pop_Model.est: Template and estimation files defining the demographic model and prior distributions used in fastsimcoal2.6pop_Model_MSFS.obs: Observed multidimensional Site Frequency Spectrum used as input in fastsimcoal2 simulations.
Folder: 07_ENM/
Purpose:
Contains data used to generate ecological niche models (ENMs) for Stilpnia cayana.
Contents:
occ_stilpnia.txt: Full set of occurrence records used to model the ecological niche of the species.M_Stilpnia.gpkg: Polygon vector file (in GeoPackage format) delimiting the accessible area (M) used for model calibration..tiffiles: Raster files representing habitat suitability predictions obtained for each of the studied periods: Present, Mid-Holocene (MH), Last Glacial Maximum (LGM), and Last Interglacial (LIG).
Folder: 08_LRDM_CA/
Purpose:
Contains the matrices used in the LRDM (Landscape Resistance Distance Matrix) and causal modeling analyses.
Contents:
Scay_83_SNPRelateDisIBS.txt: Genetic dissimilarity matrix between individuals, the same used in the EEMS analysis. This matrix was generated using the SNPRelate package in R from a VCF file including 83 samples and only biallelic SNPs.Matrix_distgeo.txt: Euclidean geographic distance matrix between individuals, calculated using thegeosphereR package.Matrix_ResistPresent_CS.txt: Landscape resistance matrix between individuals, generated using the present-day suitability raster (from ENM analyses) as a conductance surface in CircuitScape ver. 4.0 (Anantharaman et al., 2020).Matrix_ResistHistoric_CS.txt: Landscape resistance matrix between individuals, generated using historical suitability rasters (from ENM analyses) as a conductance surface in CircuitScape ver. 4.0.Matrix_River.txt: Binary matrix indicating whether individuals are located on the same (0) or opposite (1) riverbanks. Individuals from Marajó are considered to be on a separate riverbank from both NAS and SAS.
Folder: 09_gradientForest/
Purpose:
Contains the genetic and environmental input data used to perform gradient forest analyses.
Contents:
-
Stilpnia_79_unlink_bial_mr0.recode.vcf: VCF file containing unlinked, biallelic SNPs without missing data. Sample MZUSP83507 was removed due to high missing data (> 50%). This dataset was used to generate the site-by-SNP matrix. -
Stilpnia_79_unlink_bial_mr0.gradforest.snp.forR: Site-by-SNP matrix coded as 0, 1, or 2 (genotype dosage) generated using the SNPRelate R package. This matrix was used as the genetic input for the gradient forest model. -
Stilpnia_preds.csv: CSV file with environmental predictor variables used in the model. Each row corresponds to a sampling locality, and each column represents a predictor. The variables included are:Bioclimatic variables (from WorldClim 2):
- bio1 – Annual Mean Temperature (°C × 10)
- bio2 – Mean Diurnal Range (Mean of monthly (max temp - min temp)) (°C × 10)
- bio3 – Isothermality (bio2 / bio7) × 100 (unitless)
- bio4 – Temperature Seasonality (standard deviation × 100) (unitless)
- bio13 – Precipitation of Wettest Month (mm)
- bio14 – Precipitation of Driest Month (mm)
- bio15 – Precipitation Seasonality (coefficient of variation) (unitless)
Other environmental variables:
- MinGreen – Minimum Green Vegetation Fraction (unitless, range 0–1)
- TreeCover – Percent Tree Cover (%)
- Elevation – Elevation above sea level (m)
Spatial variables:
- PCNM1, PCNM2, ..., PCNMn – Principal Coordinates of Neighbor Matrices (unitless spatial eigenvectors capturing spatial structure at multiple scales)
These variables were selected to represent key climatic, ecological, and spatial gradients relevant to local adaptation and genetic differentiation across the species’ range.
Folder: 10_mitochondrial_DNA/
Purpose:
Contains the mitochondrial DNA alignment used in the phylogenetic and comparative analyses.
Contents:
Alignment_ND2_final4.fas: FASTA file containing the DNA alignment of the mitochondrial ND2 gene for the 79 samples for which sequence data was obtained.
Folder: 11_phylogenetic_network/
Purpose:
Contains the input files and workflow used to generate the phylogenetic network of Stilpnia cayana.
Contents:
Stilpnia_cayana_83_sinN.phy: Alignment used to construct the phylogenetic network. "N" characters were converted to gaps (“-”) to avoid being counted in the distance matrix, while still preserving ambiguity introduced by heterozygous genotypes.Stilpnia_cayana_83_sinN.wflow6: Workflow file exported from SplitsTree software, specifying the network construction steps.
Folder: 12_introgression_tests/
Purpose:
Contains the data used to perform ABBA-BABA (D-statistics) tests for introgression between selected populations or individuals of Stilpnia cayana.
Subdirectories and contents:
01_Marajo_introgression/:
Tests involving the Marajó population.02_WestCerrado_introgression/:
Tests assessing introgression in the western Cerrado populations.03_UFG4051_introgression/:
Tests evaluating potential introgression involving the individual sample UFG4051.
Each subdirectory contains:
pops_*.txt: Text file with sample IDs and their assigned populations used in the respective test.tree.nwk: Newick file specifying the tree topology relating the four groups used in the ABBA-BABA test.*.vcf: Reduced VCF file with unlinked SNPs used in the analysis.*_tree.txt: Output file containing ABBA-BABA (D-statistic) estimates.
Code
The code used in these analyses is available in the GitHub repository.
