Scripts from: Distinct intraspecific diversification dynamics in Neotropical montane versus lowland canopy birds (Thraupidae: Tangara) revealed by whole-genome comparative phylogeography
Data files
Dec 06, 2025 version files 120.90 KB
-
README.md
8.21 KB
-
Tangara_phylogeography_code.zip
112.69 KB
Abstract
Comparing phylogeographic patterns across different biogeographic regions can illuminate how different types of landscapes promote the formation and persistence of incipient species, providing insights into the evolutionary mechanisms underlying broadscale biodiversity gradients. The Neotropics are a global biodiversity hotspot, and the megadiverse Andes-Amazonia system exhibits elevational gradients in both species richness and speciation rates. Using whole genomes from a congeneric set of eight habitat-generalist canopy frugivore birds (Tangara tanagers), we compare the phylogeography of tropical Andean versus lowland Amazonian species to characterize intraspecific diversification dynamics associated with these distinct but adjacent biomes. We found that all species exhibited low genetic structure consistent with their dispersive ecology, but Andean species have relatively greater population genetic structuring across their ranges than Amazonian species. Additionally, populations separated by prominent geoclimatic barriers showed lower gene flow and higher differentiation in montane versus lowland Tangara. Lastly, all Andean species have lower genetic diversity, a proxy of effective population size. Together, these results support greater incipient speciation in the mountains owing to stronger allopatry and smaller populations, while the larger, more diverse, and well-connected populations of the lowlands may foster enhanced persistence. We discuss potential implications for the origin and maintenance of regional biodiversity gradients.
Dataset DOI: 10.5061/dryad.pnvx0k724
Description of the data and file structure:
This dataset contains all code used in the manuscript "Distinct intraspecific diversification dynamics in Neotropical montane versus lowland canopy birds (Thraupidae: Tangara) revealed by whole-genome comparative phylogeography." All shotgun-sequencing data upon which this code relies have been uploaded to GenBank. The included code can be used to clean and map the whole-genome sequencing data and conduct all analyses from the manuscript. Briefly, this includes assembling and analyzing the mitochondrial genome; analyzing SNPs in a genotype-likelihood framework; conducting demographic modeling based on the SFS; and evaluating genetic diversity at the individual and species levels.
Files for Download:
File: Tangara_phylogeography_code.zip
Description: This .zip file is comprehensive of all data and analyses in the paper, containing 5 subfolders:
- 1_Sequence_cleaning_aligning: 14 scripts from the initial acquisition of the raw sequencing data through the mapping and polishing of aligned reads.
- 2_Nuclear_SNPS: 9 scripts used to call autosomal SNPs in a genotype-likelihood framework, generating pairwise covariance matrices that form the basis of many subsequent analyses including PCAs, IBD, and Mantel tests (see 5_RCode).
- 3_Nuclear_All_Sites: 10 scripts to generate SFSs at the individual, population, and whole-species level and then conduct demographic modeling, FST, and genetic diversity analyses. Also includes a subfolder 3a for running FST analyses on alternate geographic groupings/delimitations of populations.
- 4_Mitogenomes: 4 scripts for assembling the mitochondrial genome from WGS libraries, and subsequent annotation and concatenation of the protein-coding genes.
- 5_RCode: 7 scripts for analyzing, interpreting, and plotting the results of the above HPC-based bioinformatic and evolutionary modeling scripts.
Code/software within .zip:
FOLDER 1: 1_Sequence_cleaning_aligning:
- 1.01_Data_transfer_check: Acquire raw sequencing data from the UMich Advanced Genomics Core, ensure it is not corrupted.
- 1.02_Prep_adapter_trim: Verify adapter sequences, generate a few fastqc files to assess sequencing quality.
- 1.03_Prep_refgenome_folders: Prepare the file structure for the project; index the reference genome, create a dictionary for the reference for downstream analyses with GATK.
- 1.04_Adapter_trim_array.sh: Trim adapters, trimns and trimqualities from fq files with AdapterRemoval.** **
- 1.05_PolyG_removal.sh: Trim polyG tails from the ends of reads with fastp and the --cut_right option
- 1.06 and 1.06a_Parse_fastp_stats: Parse the .html files produced by fastp to assess sample quality
- 1.07_BWA_align.sh: Align cleaned reads to the reference with bwa mem. Clip overlapping reads, mark and remove duplicates. Index bam files
- 1.08 and 1.08a_Parse_bwa_flagstat: Parse the samtools flagstat output of the bam files
- 1.09_Filter_samples: Remove 5 samples from further analysis that had poor mapping quality and low coverage.
- 1.10_GATK_indelalign_intervals.sh: Identifies regions in the bam files that require indel realignment
- 1.11_GATK_indelalign.sh: Realign around indels relative to the reference genome.
- 1.12 - 1.14: Create lists of final bam files and calculate final alignment depths.
FOLDER 2: 2_Nuclear_SNPs:
- 2.01_ANGSD_SNP_calling.sh: Call SNPs for each species using a genotype-likelihood framework with ANGSP
- 2.02-2.04: Identify putative paralogues or mismapped SNPs (those with exceptionally high coverage or heterozygosity) using ngsParalog; concatenate the list and remove them.
- 2.05-2.06: Create regions of the genome to break beagle files into smaller sizes (500kb); break beagle files for analysis of SNPs by chromosome.
- 2.07-2.08: Create a PCA for each chromosome (in the reference) with PCAngsd in order to scan for potential inversions that could confound population structure.
- 2.09_PCA_snps.sh: Use PCAngsd to generate a pairwise genetic covariance matrix for each species, and run admix analyses for each species.
FOLDER 3: 3_Nuclear_All_Sites:
- 3.01_Subsample_Genome_10pct (and other subfiles 3.01a-c): create random unlinked subsamples of the genome (2kb in length, at least 10kb apart) for downstream analyses requiring all sites (variant and invariant)
- 3.02_SFS_theta.sh: Create the species-level folded 1D SFS with angsd and winsfs. Calculate genetic diversity statistics (theta) with realSFS
- 3.03_Individual_SFS_array.sh: Create individual-level 1D SFS with angsd and winsfs.
- 3.04-3.05_Parse_IndHets: Calculate individual-level heterozygosity with the 1D SFS
- 3.06_Prep_FST: Create lists of population pairs and individual bam files for calculating FST between geographically defined populations
- 3.07_Pop_1DSAF: Create 1D SAF files for each geographic population with angsd
- 3.08_Pairwise_FST: Create the 2D SFS for each population pair delimited by a prominent geoclimatic barrier. Calculate per-site FST with realSFS, and then calculate FST across all sites.
- 3.09_Prep_GADMA_2DSFS: Convert winsfs 2DSFS to the file structure required by GADMA.
- 3.10_GADMA_2pop.sh: Run demographic modeling with moments in the GADMA software for all pairs of geographic populations.
- Subfolder 3a_Alternate_Populations: Contains scripts analogous to 3.06-3.08 to conduct FST population differentiation analyses on alternative geographic delimitations of the Andean and Amazonian populations.
FOLDER 4: 4_Mitogenomes:
- 4.01_Novoplasty.sh: Create config files and assemble individual mitogenomes from WGS with NOVOplasty
- 4.02_Parse_Novoplasty: Parse .out files from NOVOplasty runs to gather statistics on mitogenome contiguity and coverage
- 4.03_Mitochondrial_gene_alignments: Describes the use of mitochondrial contigs to verify species identity, and the annotation, extraction, and cleaning of the mitochondrial gene set.
- 4.04_Mt_genes_decipher.R: R script for analyzing the 13 protein-coding genes for each species and identifying any potential contaminant/hybrids within the dataset, using the package DECIPHER.
FOLDER 5: 5_RCode
- Mantel_tests.R: R script for performing Mantel tests and partial Mantel tests on the genetic covariance matrix, using vegan and bioclim data from geodata
- Plot_FST.R: Compare FST between populations in the Andes and Amazonia, for all of the possible combinations of geographic population definitions across geoclimatic barriers. Creates main manuscript Figure 4, Supplementary Figure S16, and Supplementary Table S6.
- Plot_GADMA.R: Parse demographic model parameter estimates from GADMA output, and compare rates of migration across barriers in the mountains and lowlands. Creates main manuscript Figure 5, Supplementary Figure S17, and data for Supplementary Table S5.
- Plot_GenDiv.R: Compare all three genetic diversity estimates -- individual-level heterozygosity, Watterson's theta, and theta-pi -- for species in the mountains and the lowlands. Creates main manuscript Figure 6, Supplementary Figure S18, and Supplementary Table S7.
- Plot_mtDNA_trees.R: Visualize dated mitochondrial gene trees from BEAST with ggtree. From these, calculate discrete clusters with GMYC and assess posterior distribution of crown ages. Creates Supplementary Figures S4-S6.
- Plot_PCAngsd.R: Conduct analyses and generate figures based on PCAngsd genetic covariance matrix and admix results. Creates main manuscript Figures 2-3, and Supplementary Figures S7-S13.
- Plot_System_figs.R: Create the system map in main manuscript Figure 1. Visualize phylogenetic tree of the genus Tangara (Supplementary Figure S1), as well as species-level sampling maps (Supplementary Figures S2-S3).
