De novo assembly of SNPs in VCF format for 112 individualss of Campylorhynchus in western Ecuador

Montalvo, Luis 1 ; Kimball, Rebecca1 ; Austin, James1 ; Robinson, Scott1

Published Jun 20, 2024 on Dryad. https://doi.org/10.5061/dryad.18931zd25

Data files

Jun 20, 2024 version files 16.45 MB

populations.snps.vcf
16.43 MB
README.md
11.67 KB

Abstract

Climate variability has a significant impact on the evolution of biodiversity, which results in genetic and phenotypic diversity within species. A balance between gene flow and selection maintains changes in the frequency of genetic and phenotypic variants that occur along an environmental gradient. Here, we investigate a hybrid zone in western Ecuador involving C. zonatus, C. fasciatus, and admixed populations. We hypothesized that different ecological preferences and geographical distances result in limited dispersal between populations along the precipitation gradient in western Ecuador.

In the context of testing IBE and IBD shaping distributions of C. zonatus, C. fasciatus, and potential hybrids, we asked (1) Is there evidence of genetic admixture and introgression between these taxa in Western Ecuador? And (2) What is the relative contribution of IBE and IBD on patterns of genetic differentiation and admixture patterns? We analyzed 4409 SNPs from the blood of 112 individuals sequenced using ddRadSeq. The most likely clusters ranged from K=2-4, corresponding to categories defined by geographic origins, known phylogenetics, and physical or ecological constraints. Evidence for IBE was weak but stronger for IBD. We observed gradual changes in genetic admixture between C. f. pallescens and C. zonatus along the environmental gradient. Genetic differentiation of the two populations of C. f. pallescens could be driven by a previously undescribed potential physical barrier near the center of western Ecuador. Lowland habitats in this region may be limited due to the proximity of the Andes to the coastline, limiting dispersal and gene flow, particularly among dry-habitat specialists.

We have submitted the resulting populations.snps.vcf (Variant Call Format) file from the de novo assembly performed using the STACK software. This file is the primary dataset utilized for the various analyses described in our article, “Unraveling the Genomic Landscape of Campylorhynchus Wrens along Western Ecuador’s Precipitation Gradient: Insights into Hybridization, Isolation by Distance, and Isolation by Environment.” The article investigates the population structure and hybridization patterns between Campylorhynchus zonatus and C. fasciatus across western Ecuador. Additionally, we have included the file DeNovoAssemblyStack.bash, which contains the Unix code used to perform the de novo assembly with STACK, this code uses as input the raw FASTA files deposited at NCBI Sequence Read Archive under Bioproject accession number PRJNA925654.

Description of the data and file structure

Dataset Overview

This Variant Call Format (VCF) file contains Single Nucleotide Polymorphism (SNP) data from 112 individuals across 16 sampling locations. It comprises 4409 SNPs with an average coverage of 35.4x per individual (min=12.8x, max=77.9x) and an average genotype call rate of 93.5%.

populations.snps.vcf

File Details

File Name: populations.snps.txt
File Format: VCF (Variant Call Format) Version 4.2
Date: September 6, 2020
Source Software: Stacks v2.41

Data Description

The VCF file includes the following columns standard to the format:

#CHROM: Chromosome number
POS: Position of the SNP on the chromosome
ID: Identifier of the SNP
REF: Reference base
ALT: Alternate base(s)
QUAL: Quality score of the SNP
FILTER: Filter status
INFO: Additional information (e.g., allele frequency, number of samples)
FORMAT: Data format
Sample columns: One per individual, containing genotype information

Specific Fields in `INFO`

NS: Number of samples with data
AF: Allele frequency
DP: Total depth of reads

Specific Fields in `FORMAT`

GT: Genotype
AD: Allele depth
DP: Read depth
GQ: Genotype quality
GL: Genotype likelihood

How to Open and Analyze the Data

The VCF file can be opened and analyzed using various bioinformatics tools and software, some of the most common include:

IGV (Integrative Genomics Viewer): Useful for visualizing genomic data.
BCFtools: Command-line tool for working with VCF/BCF files, including filtering, viewing, and converting.
GATK (Genome Analysis Toolkit): Provides tools to analyze high-throughput sequencing data with a focus on variant discovery.
vcftools: Command-line program designed to work specifically with VCF files to perform various types of analyses.

Sharing/Access information

All sequencing data from this study has been deposited at NCBI Sequence Read Archive under Bioproject accession number PRJNA925654.

All scripts for the analyses derived below can be accessed at https://github.com/ldmontalvo/Landscape-Genomic-Wrens

DeNovoAssemblyStack.bash

Content Description

SLURM Directives: The script begins with SLURM directives for job scheduling on a high-performance computing (HPC) cluster:
- --ntasks=1: Requests one task.
- --cpus-per-task=8: Allocates eight CPUs per task.
- --mem-per-cpu=1gb: Allocates 1 GB of memory per CPU.
- --time=05:00:00: Sets the maximum job run time as five hours.
- --job-name=check_test: Names the job “check_test”.
- --mail-type=END,FAIL: Configures SLURM to send an email when the job ends or fails.
- --mail-user=ldmontalvo@ufl.edu: Specifies the email address to receive job notifications.
- --output=check_test_%j.log: Directs the output to a log file named with the job ID.
Environment Setup: The script prints the current directory, hostname, and date to the output file, providing a log of where and when the job was executed.
Module Loading: It loads the STACKS software module, preparing the environment for executing STACKS-related commands.
Directory and File Preparation:
- Creates a new directory ./denovo_n5nb to store the output of the de novo assembly.
- Runs denovo_map.pl, a script from STACKS, with various parameters like minimum stack depth (-m 3), mismatches between loci within individuals (-M 5), mismatches between loci in the catalog (-n 5), using 15 threads (-T 15), and specifies input and output directories.
Population Analysis:
- Executes the populations program from STACKS with parameters to specify the input directory, population map, minimum percentage of individuals in a population required to process a locus (-r 0.8), minimum number of populations a locus must be present in to process it (-p 1), and output format flags for STRUCTURE, VCF, and Genepop files.
Background Execution:
- Runs the script Wren_denovo_n5nb.sh in the background using nohup to prevent it from stopping if the session disconnects and appends the output to nohup.out.

How to Run the Script

To run this script on a Unix-like system (assuming you have access to a cluster using SLURM workload manager):

Transfer the Script: Upload the DeNovoAssemblyStack.bash script to your cluster’s user directory.
Permission: Change the permission of the file to make it executable:
```
bash
```
Copy code

chmod +x DeNovoAssemblyStack.txt
Submission: Submit the script to the SLURM scheduler:
```
Copy code
```
sbatch DeNovoAssemblyStack.bash

Methods

Data Collection

We collected blood samples from the brachial vein of 112 individuals of Campylorhynchus along Western Ecuador. Most samples were collected between July and December 2018. We also obtained 27 tissue samples of C. fasciatus from Peru at the Florida Museum of Natural History (FLMNH). DNA for all samples was extracted using Qiagen DNeasy blood & tissue kits (Qiagen, Maryland USA). We followed the ddRAD-seq protocol by Peterson et al. (2012) and modified by Thrasher et al. (2018). We trimmed all sequences to 147bp using fastX_trimmer (FASTX-Toolkit) to exclude low-quality calls near the 3 of the reads. We removed reads containing at least a single base with a Phred quality score of less than 10 (using fastq_quality_filter). We removed sequences if more than 5% of the bases had a Phred quality score of less than 20. Using the process_radtags module from the STACKS version 2.3 (Catchen et al., 2013), we demultiplexed the reads to obtain files with specific sequences for each individual.

We assembled the sequences de novo using the STACKS pipeline (Catchen et al., 2013). First, we selected 12 samples with the highest number of reads and ran denovo_map.pl testing values from one to nine for -M (number of mismatches allowed between stacks within individuals) and n (number of mismatches permitted between stacks between individuals) following the n=M rule (Paris et al., 2017) while keeping m=3 (stack depth/number of identical reads required to initiate a new allele). We kept r=0.8 (the minimum percentage of individuals in a population required to process a locus for that population). It has been shown that at least 80% of the population should present a specific locus to be included, known as the 80% rule or r80 loci (Paris et al., 2017). We set all samples to the same population (p=1) for the parameter testing assembly. We plotted the number of SNPs called against the M parameters to find the optimum M, after which no additional SNP calling was observed. We found an optimum value for n=m=5 for the final de novo assemblies. After the parameters testing assembly, we performed de novo assembly with all the samples and the parameters described above and set p=1. When a RAD locus had more than one SNP, the data were restricted to the first (–write_single_snp) to avoid including SNPs in high linkage disequilibrium (LD). We required a minor allele frequency of at least 0.05 to process a nucleotide site (–min_maf).

Analyses Performed using the Data Set

In our study, we analyzed the population structure and admixture patterns of Campylorhynchus wrens using a combination of spatial principal component analysis (sPCA) and Bayesian clustering. The sPCA, conducted through the Adegenet package in R (Jombart, 2008), helped visualize genetic structures by building a distance-based connection network based on the maximum dispersal distances observed in Cactus Wrens. This analysis differentiated global structures (clines and broad genetic structures) from local structures (sharp genetic divergences between neighbors), and its significance was tested via a Monte Carlo procedure with extensive permutations. Complementarily, STRUCTURE software version 2.3.4; (Pritchard et al., 2000) was used to estimate individual membership coefficients across potentially mixed populations, incorporating prior geographical information and allowing for admixture. The optimal number of genetic clusters was determined through an analysis in STRUCTURE HARVESTER (Earl & Vonholdt, 2012) and further refined by discriminant analysis of principal components (DAPC), enhancing our understanding of the genetic grouping and population structure.

Further, we delved into the genetic diversity from the R package dartR (Gruber et al., 2018) within these populations by employing DAPC to ascertain the most probable number of genetic clusters, which was corroborated by STRUCTURE when considering K=4. This analysis not only allowed a detailed classification within the population but also aligned the genetic clusters with previously recognized biogeographical regions, indicating potential barriers within Western Ecuador. By leveraging allele frequencies and inbreeding coefficients, we explored the genetic health and diversity of these clusters, ensuring robustness in our estimates by adjusting for potential biases from related individuals. This comprehensive genetic profiling was bolstered by an analysis of molecular variance (AMOVA) to quantify genetic differentiation, providing a nuanced understanding of the genetic landscape that influences these bird populations across varied ecological zones.

References

Earl, D. A., & Vonholdt, B. M. (2012). STRUCTURE HARVESTER: a website and program for visualizing STRUCTURE output and implementing the EVANNO method. Conservation Genetics Resources, 4(2), 359–361. https://doi.org/10.1007/s12686-011-9548-7

Catchen, J., Hohenlohe, P. A., Bassham, S., Amores, A., & Cresko, W. A. (2013). Stacks: An analysis tool set for population genomics. Molecular Ecology, 22(11), 3124–3140. https://doi.org/10.1111/mec.12354

Gruber, B., Unmack, P. J., Berry, O. F., & Georges, A. (2018). dartR: An r package to facilitate analysis of SNP data generated from reduced representation genome sequencing. Molecular Ecology Resources, 18(3), 691–699. https://doi.org/10.1111/1755-0998.12745

Jombart, T. (2008). adegenet: A R package for the multivariate analysis of genetic markers. Bioinformatics, 24(11), 1403–1405. https://doi.org/10.1093/bioinformatics/btn129

Peterson, B. K., Weber, J. N., Kay, E. H., Fisher, H. S., & Hoekstra, H. E. (2012). Double Digest RADseq: An Inexpensive method for De Novo SNP discovery and genotyping in model and non-model species. PLoS ONE, 7(5), e37135.

Pritchard, J. K., Stephens, M., & Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2), 945–959.

De novo assembly of SNPs in VCF format for 112 individualss of Campylorhynchus in western Ecuador

Data files

Abstract

README: De novo assembly of SNPs in VCF format for 112 individuals of Campylorhynchus in Western Ecuador.

Description of the data and file structure

Dataset Overview

populations.snps.vcf

File Details

Data Description

Specific Fields in INFO

Specific Fields in FORMAT

How to Open and Analyze the Data

Sharing/Access information

DeNovoAssemblyStack.bash

Content Description

How to Run the Script

Methods

Data Collection

Analyses Performed using the Data Set

References

Methods

Works referencing this dataset

Specific Fields in `INFO`

Specific Fields in `FORMAT`