De novo assembly of SNPs in VCF format for 112 individualss of Campylorhynchus in western Ecuador
Data files
Jun 20, 2024 version files 16.45 MB
Abstract
Climate variability has a significant impact on the evolution of biodiversity, which results in genetic and phenotypic diversity within species. A balance between gene flow and selection maintains changes in the frequency of genetic and phenotypic variants that occur along an environmental gradient. Here, we investigate a hybrid zone in western Ecuador involving C. zonatus, C. fasciatus, and admixed populations. We hypothesized that different ecological preferences and geographical distances result in limited dispersal between populations along the precipitation gradient in western Ecuador.
In the context of testing IBE and IBD shaping distributions of C. zonatus, C. fasciatus, and potential hybrids, we asked (1) Is there evidence of genetic admixture and introgression between these taxa in Western Ecuador? And (2) What is the relative contribution of IBE and IBD on patterns of genetic differentiation and admixture patterns? We analyzed 4409 SNPs from the blood of 112 individuals sequenced using ddRadSeq. The most likely clusters ranged from K=2-4, corresponding to categories defined by geographic origins, known phylogenetics, and physical or ecological constraints. Evidence for IBE was weak but stronger for IBD. We observed gradual changes in genetic admixture between C. f. pallescens and C. zonatus along the environmental gradient. Genetic differentiation of the two populations of C. f. pallescens could be driven by a previously undescribed potential physical barrier near the center of western Ecuador. Lowland habitats in this region may be limited due to the proximity of the Andes to the coastline, limiting dispersal and gene flow, particularly among dry-habitat specialists.
README: De novo assembly of SNPs in VCF format for 112 individuals of Campylorhynchus in Western Ecuador.
We have submitted the resulting populations.snps.vcf (Variant Call Format) file from the de novo assembly performed using the STACK software. This file is the primary dataset utilized for the various analyses described in our article, "Unraveling the Genomic Landscape of Campylorhynchus Wrens along Western Ecuador's Precipitation Gradient: Insights into Hybridization, Isolation by Distance, and Isolation by Environment." The article investigates the population structure and hybridization patterns between Campylorhynchus zonatus and C. fasciatus across western Ecuador. Additionally, we have included the file DeNovoAssemblyStack.bash, which contains the Unix code used to perform the de novo assembly with STACK, this code uses as input the raw FASTA files deposited at NCBI Sequence Read Archive under Bioproject accession number PRJNA925654.
Description of the data and file structure
Dataset Overview
This Variant Call Format (VCF) file contains Single Nucleotide Polymorphism (SNP) data from 112 individuals across 16 sampling locations. It comprises 4409 SNPs with an average coverage of 35.4x per individual (min=12.8x, max=77.9x) and an average genotype call rate of 93.5%.
populations.snps.vcf
File Details
- File Name: populations.snps.txt
- File Format: VCF (Variant Call Format) Version 4.2
- Date: September 6, 2020
- Source Software: Stacks v2.41
Data Description
The VCF file includes the following columns standard to the format:
#CHROM
: Chromosome numberPOS
: Position of the SNP on the chromosomeID
: Identifier of the SNPREF
: Reference baseALT
: Alternate base(s)QUAL
: Quality score of the SNPFILTER
: Filter statusINFO
: Additional information (e.g., allele frequency, number of samples)FORMAT
: Data format- Sample columns: One per individual, containing genotype information
Specific Fields in INFO
NS
: Number of samples with dataAF
: Allele frequencyDP
: Total depth of reads
Specific Fields in FORMAT
GT
: GenotypeAD
: Allele depthDP
: Read depthGQ
: Genotype qualityGL
: Genotype likelihood
How to Open and Analyze the Data
The VCF file can be opened and analyzed using various bioinformatics tools and software, some of the most common include:
- IGV (Integrative Genomics Viewer): Useful for visualizing genomic data.
- BCFtools: Command-line tool for working with VCF/BCF files, including filtering, viewing, and converting.
- GATK (Genome Analysis Toolkit): Provides tools to analyze high-throughput sequencing data with a focus on variant discovery.
- vcftools: Command-line program designed to work specifically with VCF files to perform various types of analyses.
Sharing/Access information
All sequencing data from this study has been deposited at NCBI Sequence Read Archive under Bioproject accession number PRJNA925654.
All scripts for the analyses derived below can be accessed at https://github.com/ldmontalvo/Landscape-Genomic-Wrens
DeNovoAssemblyStack.bash
Content Description
- SLURM Directives: The script begins with SLURM directives for job scheduling on a high-performance computing (HPC) cluster:
--ntasks=1
: Requests one task.--cpus-per-task=8
: Allocates eight CPUs per task.--mem-per-cpu=1gb
: Allocates 1 GB of memory per CPU.--time=05:00:00
: Sets the maximum job run time as five hours.--job-name=check_test
: Names the job "check_test".--mail-type=END,FAIL
: Configures SLURM to send an email when the job ends or fails.--mail-user=ldmontalvo@ufl.edu
: Specifies the email address to receive job notifications.--output=check_test_%j.log
: Directs the output to a log file named with the job ID.
- Environment Setup: The script prints the current directory, hostname, and date to the output file, providing a log of where and when the job was executed.
- Module Loading: It loads the STACKS software module, preparing the environment for executing STACKS-related commands.
- Directory and File Preparation:
- Creates a new directory
./denovo_n5nb
to store the output of the de novo assembly. - Runs
denovo_map.pl
, a script from STACKS, with various parameters like minimum stack depth (-m 3
), mismatches between loci within individuals (-M 5
), mismatches between loci in the catalog (-n 5
), using 15 threads (-T 15
), and specifies input and output directories.
- Creates a new directory
- Population Analysis:
- Executes the
populations
program from STACKS with parameters to specify the input directory, population map, minimum percentage of individuals in a population required to process a locus (-r 0.8
), minimum number of populations a locus must be present in to process it (-p 1
), and output format flags for STRUCTURE, VCF, and Genepop files.
- Executes the
- Background Execution:
- Runs the script
Wren_denovo_n5nb.sh
in the background usingnohup
to prevent it from stopping if the session disconnects and appends the output tonohup.out
.
- Runs the script
How to Run the Script
To run this script on a Unix-like system (assuming you have access to a cluster using SLURM workload manager):
- Transfer the Script: Upload the
DeNovoAssemblyStack.bash
script to your cluster's user directory. - Permission: Change the permission of the file to make it executable:
bash
Copy code
chmod +x DeNovoAssemblyStack.txt
3. Submission: Submit the script to the SLURM scheduler:
Copy code
sbatch DeNovoAssemblyStack.bash
Methods
Data Collection
We collected blood samples from the brachial vein of 112 individuals of Campylorhynchus along Western Ecuador. Most samples were collected between July and December 2018. We also obtained 27 tissue samples of C. fasciatus from Peru at the Florida Museum of Natural History (FLMNH). DNA for all samples was extracted using Qiagen DNeasy blood & tissue kits (Qiagen, Maryland USA). We followed the ddRAD-seq protocol by Peterson et al. (2012) and modified by Thrasher et al. (2018). We trimmed all sequences to 147bp using fastX_trimmer (FASTX-Toolkit) to exclude low-quality calls near the 3 of the reads. We removed reads containing at least a single base with a Phred quality score of less than 10 (using fastq_quality_filter). We removed sequences if more than 5% of the bases had a Phred quality score of less than 20. Using the process_radtags module from the STACKS version 2.3 (Catchen et al., 2013), we demultiplexed the reads to obtain files with specific sequences for each individual.
We assembled the sequences de novo using the STACKS pipeline (Catchen et al., 2013). First, we selected 12 samples with the highest number of reads and ran denovo_map.pl testing values from one to nine for -M (number of mismatches allowed between stacks within individuals) and n (number of mismatches permitted between stacks between individuals) following the n=M rule (Paris et al., 2017) while keeping m=3 (stack depth/number of identical reads required to initiate a new allele). We kept r=0.8 (the minimum percentage of individuals in a population required to process a locus for that population). It has been shown that at least 80% of the population should present a specific locus to be included, known as the 80% rule or r80 loci (Paris et al., 2017). We set all samples to the same population (p=1) for the parameter testing assembly. We plotted the number of SNPs called against the M parameters to find the optimum M, after which no additional SNP calling was observed. We found an optimum value for n=m=5 for the final de novo assemblies. After the parameters testing assembly, we performed de novo assembly with all the samples and the parameters described above and set p=1. When a RAD locus had more than one SNP, the data were restricted to the first (--write_single_snp) to avoid including SNPs in high linkage disequilibrium (LD). We required a minor allele frequency of at least 0.05 to process a nucleotide site (--min_maf).
Analyses Performed using the Data Set
In our study, we analyzed the population structure and admixture patterns of Campylorhynchus wrens using a combination of spatial principal component analysis (sPCA) and Bayesian clustering. The sPCA, conducted through the Adegenet package in R (Jombart, 2008), helped visualize genetic structures by building a distance-based connection network based on the maximum dispersal distances observed in Cactus Wrens. This analysis differentiated global structures (clines and broad genetic structures) from local structures (sharp genetic divergences between neighbors), and its significance was tested via a Monte Carlo procedure with extensive permutations. Complementarily, STRUCTURE software version 2.3.4; (Pritchard et al., 2000) was used to estimate individual membership coefficients across potentially mixed populations, incorporating prior geographical information and allowing for admixture. The optimal number of genetic clusters was determined through an analysis in STRUCTURE HARVESTER (Earl & Vonholdt, 2012) and further refined by discriminant analysis of principal components (DAPC), enhancing our understanding of the genetic grouping and population structure.
Further, we delved into the genetic diversity from the R package dartR (Gruber et al., 2018) within these populations by employing DAPC to ascertain the most probable number of genetic clusters, which was corroborated by STRUCTURE when considering K=4. This analysis not only allowed a detailed classification within the population but also aligned the genetic clusters with previously recognized biogeographical regions, indicating potential barriers within Western Ecuador. By leveraging allele frequencies and inbreeding coefficients, we explored the genetic health and diversity of these clusters, ensuring robustness in our estimates by adjusting for potential biases from related individuals. This comprehensive genetic profiling was bolstered by an analysis of molecular variance (AMOVA) to quantify genetic differentiation, providing a nuanced understanding of the genetic landscape that influences these bird populations across varied ecological zones.
References
Earl, D. A., & Vonholdt, B. M. (2012). STRUCTURE HARVESTER: a website and program for visualizing STRUCTURE output and implementing the EVANNO method. Conservation Genetics Resources, 4(2), 359–361. https://doi.org/10.1007/s12686-011-9548-7
Catchen, J., Hohenlohe, P. A., Bassham, S., Amores, A., & Cresko, W. A. (2013). Stacks: An analysis tool set for population genomics. Molecular Ecology, 22(11), 3124–3140. https://doi.org/10.1111/mec.12354
Gruber, B., Unmack, P. J., Berry, O. F., & Georges, A. (2018). dartR: An r package to facilitate analysis of SNP data generated from reduced representation genome sequencing. Molecular Ecology Resources, 18(3), 691–699. https://doi.org/10.1111/1755-0998.12745
Jombart, T. (2008). adegenet: A R package for the multivariate analysis of genetic markers. Bioinformatics, 24(11), 1403–1405. https://doi.org/10.1093/bioinformatics/btn129
Peterson, B. K., Weber, J. N., Kay, E. H., Fisher, H. S., & Hoekstra, H. E. (2012). Double Digest RADseq: An Inexpensive method for De Novo SNP discovery and genotyping in model and non-model species. PLoS ONE, 7(5), e37135.
Pritchard, J. K., Stephens, M., & Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2), 945–959.
Methods
We collected blood samples from the brachial vein of 112 individuals of Campylorhynchus along Western Ecuador. Most samples were collected between July and December 2018. We also obtained 27 tissue samples of C. fasciatus from Peru at the Florida Museum of Natural History (FLMNH). DNA for all samples was extracted using Qiagen DNeasy blood & tissue kits (Qiagen, Maryland USA). We followed the ddRAD-seq protocol by Peterson et al. (2012) and modified by Thrasher et al. (2018). We trimmed all sequences to 147bp using fastX_trimmer (FASTX-Toolkit) to exclude low-quality calls near the 3’ of the reads. We removed reads containing at least a single base with a Phred quality score of less than 10 (using fastq_quality_filter). We removed sequences if more than 5% of the bases had a Phred quality score of less than 20. Using the process_radtags module from the STACKS version 2.3 (Catchen et al., 2013), we demultiplexed the reads to obtain files with specific sequences for each individual.