Implications of headwater contact zones for the riverine barrier hypothesis: a case study of the Blue-capped Manakin (Lepidothrix coronata)
Data files
Oct 18, 2023 version files 1.53 GB
Abstract
Rivers frequently delimit the geographic ranges of species in the Amazon Basin. These rivers also define the boundaries between genetic clusters within many species, yet river boundaries have been documented to break down in headwater regions where rivers are narrower. To explore the evolutionary implications of headwater contact zones in Amazonia, we examined genetic variation in the Blue-capped Manakin (Lepidothrix coronata), a species previously shown to contain several genetically and phenotypically distinct populations across the western Amazon Basin. We collected restriction site-associated DNA sequence data (RADcap) for 706 individuals and found that spatial patterns of genetic structure indicate rivers, particularly the Amazon and Ucayali, are major dispersal barriers for L. coronata along a distance of more than 3000 km. We also found evidence that genetic connectivity is elevated across several headwater regions, highlighting the importance of headwater gene flow for models of Amazonian diversification. The headwaters of the Ucayali River provide a notable exception to findings of headwater gene flow by harboring non-admixed populations of L. coronata on opposite sides of a <1 km-wide river channel with a known dynamic history, potentially suggesting that additional prezygotic barriers are limiting gene flow in this region.
README: Implications of headwater contact zones for the riverine barrier hypothesis: a case study of the Blue-capped Manakin (Lepidothrix coronata)
https://doi.org/10.5061/dryad.5tb2rbp9s
Description of the data and file organization
This Dryad submission contains VCF datasets, custom scripts, and input/output for sNMF, Clumpak, EEMS, Fst, and momi analyses.
1) "1_VCF_Datasets" zipped folder
Within the "1_VCF_Datasets" folder there are the seven Variant Call Format (VCF) files used in the study. The beginning of each filename (Dataset 1–7) corresponds to the files outlined in Table 1 of the manuscript.
2) "2_sNMF" zipped folder
Within the "2_sNMF" folder are two .txt files (which we discuss further below) and 16 subfolders containing the 16 sNMF (sparse non-negative matrix factorization) projects in the study. There is one sNMF project for each of Datasets 1–4 across four different alpha regularization parameter values. The sNMF projects are stored as .zip files, which can be conveniently imported as an object with the R package 'LEA' using the function "import.snmfProject("name_of_project.zip")".
Within each zipped sNMF project there is a .snmfProject file that serves as a map for the sNMF project and a .snmf folder. The .snmf folder contains 11 subfolders: a subfolder for each K value ("K1" through "K10") and also a subfolder ("masked") that contains a single .geno file with masked genotypes used in calculating the cross-entropy criterion. The subfolders for each K value ("K1" through "K10") each contain 100 subfolders (for each replicate sNMF run) which in turn each contain three files: a .G file with the ancestral genotype frequency matrix, a .Q file with the ancestry coefficient matrix, and a .snmfClass file that stores information associated with the specific replicate run. We recommended accessing these data files through the R package 'LEA' as mentioned above.
Also within the "2_sNMF" folder are two .txt files ("sNMF_K6_387ind_83Loc_95maxm_alpha100_NOT_AVERAGED.txt" and "sNMF_K6_387ind_83Loc_95maxm_alpha100_AVERAGED_Fig1.txt") with ancestry coefficients for the optimal K=6 sNMF run, which served as the basis for the locality-averaged ancestry coefficients shown in Figure 1.
Header columns for "sNMF_K6_387ind_83Loc_95maxm_alpha100_NOT_AVERAGED.txt":
- sNMF_original_order: the original order of individuals as found in sNMF output files and the associated VCF file (Dataset 4)
- sample_name: the name of each sample as found in the associated VCF file (Dataset 4)
- tube_number: the unique lab identifier number written on the DNA extract tube for each individual
- locality: locality number that corresponds to localities in Figure 1 and Table S1
- pop1–pop6 (collectively): sNMF ancestry coefficients for individuals in each of the six populations
Header columns for "sNMF_K6_387ind_83Loc_95maxm_alpha100_AVERAGED_Fig1.txt":
- locality: locality number that corresponds to localities in Figure 1 and Table S1
- lat: latitude for locality
- long: longitude for locality
- pop1–pop6 (collectively): sNMF ancestry coefficients averaged across individuals at a given locality in each of the six populations (values used for Fig. 1)
3) "3_CLUMPAK" zipped folder
Within the "3_CLUMPAK" folder there are four subfolders. Each of these subfolders contains a CLUMPAK input .zip file with 100 replicate sNMF runs for each value of K from K=2 through K=10 populations and a job pipeline summary .pdf file. Additionally, the "3_CLUMPAK" folder contains two .txt files ("CLUMPAK_706ind_map.txt" and "CLUMPAK_387ind_map.txt") that show individual samples and their sequence order in the sNMF runs and CLUMPAK plots, allowing anyone who reviews Figures S1–S4 or the job pipeline summaries to find the exact individual associated with any column in the CLUMPAK plots.
4) "4_EEMS" zipped folder
Within the "4_EEMS" folder there are four subfolders. Three of these subfolders ("EEMS_data_chain1", "EEMS_data_chain2", "EEMS_data_chain3") contain the data output (31 .txt files each) from the three MCMC chains of our EEMS analysis. Here we list the 31 .txt files that are found in each of these three subfolders and provide brief descriptions of the files when possible (most of these files have no developer documentation, are not typically accessed by users, and are only needed for background code operations to reproduce the EEMS plots in Figure 3):
- demes.txt: decimal coordinates for each deme
- edges.txt: integers for plotting grid edges
- eemsrun.txt: input parameter values, acceptance proportions, log prior, and log likelihood
- ipmap.txt
- lastdfpars.txt
- lastmeffct.txt
- lastmhyper.txt
- lastmseeds.txt
- lastmtiles.txt
- lastpilogl.txt
- lastqeffct.txt
- lastqhyper.txt
- lastqseeds.txt
- lastqtiles.txt
- lastthetas.txt
- mcmcmhyper.txt
- mcmcmrates.txt
- mcmcmtiles.txt
- mcmcpilogl.txt
- mcmcqhyper.txt
- mcmcqrates.txt
- mcmcqtiles.txt
- mcmcthetas.txt
- mcmcwcoord.txt
- mcmcxcoord.txt
- mcmcycoord.txt
- mcmczcoord.txt
- outer.txt: decimal coordinates (listed counterclockwise) that form a closed polygon defining the area over which to estimate the effective migration surface
- rdistJtDhatJ.txt
- rdistJtDobsJ.txt
- rdistoDemes.txt
The fourth subfolder, "EEMS_plots", contains the R script ("EEMS_plot.R") used to produce the EEMS plots using the three previously mentioned subfolders (with the MCMC chain data) as input. The "EEMS_plots" subfolder also contains the eight .pdf files produced as output from the R script "EEMS_plot.R"; two of these pdf files ("-shapefile-projected-mrates01" and "-shapefile-projected-mrates02") formed the basis of Figure 3. Here we list the eight .pdf files produced by "EEMS_plot.R" and provide brief descriptions of each file:
- -shapefile-projected-mrates01.pdf: spatial plot of estimated effective migration rates
- -shapefile-projected-mrates02.pdf: spatial plot of areas where estimated effective migration rates are likely (posterior probability > 0.9) to differ from zero
- -shapefile-projected-pilogl01.pdf: trace plot of posterior probabilities for the three chains analyzed
- -shapefile-projected-qrates01.pdf: spatial plot of estimated effective diversity rates
- -shapefile-projected-qrates02.pdf: spatial plot of areas where estimated effective diversity rates are likely (posterior probability > 0.9) to differ from zero
- -shapefile-projected-rdist01.pdf: scatter plot of observed versus fitted genetic dissimilarities between pairs of sampled demes
- -shapefile-projected-rdist02.pdf: scatter plot of observed versus fitted genetic dissimilarities within pairs of sampled demes
- -shapefile-projected-rdist03.pdf: scatter plot of observed genetic dissimilarity between demes versus geographic distances between demes
5) "5_Fst" zipped folder
Within the "5_Fst" folder there are three subfolders. The "Pop_files" subfolder contains the 15 .txt files that designate the individuals in each of the 15 populations used in our Fst analyses. The "Fst_output" subfolder includes eight .txt files with the the per-site Fst values for each of the eight comparisons of opposite-bank populations. Note that Population 44 is involved in two comparisons. The "Log_files" subfolder contains eight .txt files that provide the log information from VCFtools for each Fst comparison. Additionally, the "5_Fst" folder contains a markdown file "Fst_calcs.md" that details the commands used to run the Fst analyses in VCFtools.
6) "6_Momi" zipped folder
Within the "6_Momi" folder there are seven subfolders. The "Jupyter_notebooks_for_testing" subfolder contains three .ipynb files that we used to visualize momi demographic models and to troubleshoot analyses. The "momi_analyses_with_AIC_calculations" subfolder contains three .xlsx files that summarize results and show AIC calculations for the momi analyses based on three different initial values of ancestral effective population (Ne = 100,000; Ne = 500,000; and Ne = 2,000,000). The "momi_runs_Ne1e5", "momi_runs_Ne5e5", and "momi_runs_Ne2e6" subfolders each contain the .py files used to run each momi analysis (12 per subfolder) and the resulting raw output .txt files (12 per subfolder). The "Population_files" subfolder contains three .txt files that designate the specific individuals in each population considered in the momi analyses. The "SFS_files" subfolder contains the three site-frequency spectrum files used in the momi analyses.
Sharing/access information
Raw reads for data used in this study can be found in the NCBI Sequence Read Archive under BioProjects PRJNA782327 and PRJNA787238:
- https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA787238
- https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA782327