Selection pressure analysis of dengue virus complete genome and E gene nucleotide sequences from Pakistan
Data files
May 31, 2024 version files 571.85 KB
Abstract
This dataset comprises 43 E gene and 44 complete genome nucleotide sequences of the dengue virus from serotypes DENV-1 to DENV-4, representing all documented sequences in Pakistan to date, sourced from the Virus Pathogen Resource (ViPR) database and NCBI. The E gene is critical as it is involved in serotype changes of the dengue virus, making it a pivotal target for understanding shifts in viral pathogenicity and immune escape mechanisms. The aim of compiling this dataset is to facilitate comprehensive genetic analysis and enhance understanding of the evolutionary dynamics of the dengue virus within the region. To assess the evolutionary pressures acting on these sequences, we conducted a selection pressure analysis utilizing computational methods. These methods include the Single Likelihood Ancestor Counting (SLAC), Fixed Effects Likelihood (FEL), adaptive Branch Site Random Effects Likelihood (aBSREL), Mixed Effects Model of Evolution (MEME), and the Genetic Algorithm for Recombination Detection (GARD), all implemented in the HyPhy software package. Our analysis focused on identifying genomic sites under both positive and negative selection pressures, providing insights into the adaptive evolutionary processes affecting the E gene of the dengue virus in Pakistan. Understanding the molecular evolution of this gene is crucial for predicting serotype evolution, potentially aiding in the development of effective vaccines and therapeutic strategies.
README: Selection pressure analysis of dengue virus complete genome and E gene nucleotide sequences from Pakistan
https://doi.org/10.5061/dryad.cjsxksnff
Dataset Summary:
This dataset contains 43 E gene and 44 complete genome nucleotide sequences of the dengue virus, encompassing all four serotypes (DENV-1, DENV-2, DENV-3, and DENV-4) identified in Pakistan to date. Additionally, the dataset includes four reference sequences of the dengue virus and six sequences from regions outside Pakistan to provide a broader comparative perspective. All sequences were retrieved from the Virus Pathogen Resource (ViPR) database.
Experimental Procedures:
1. Data Collection and Sequence Alignment: Sequences were aligned using MUSCLE for initial processing and MEGA X for detailed phylogenetic analyses. This dual approach ensures robust sequence alignment critical for accurate downstream analysis.
2. Phylogenetic Analysis: After alignment, a phylogenetic tree was constructed using the best nucleotide substitution model, selected based on the highest likelihood ratio test (LRT) value. This step is crucial for understanding the evolutionary relationships and mutational patterns within the sequences.
3. Selection Pressure Analysis: Advanced methods implemented in the HyPhy software package were employed to analyze evolutionary pressures. Techniques such as SLAC, FEL, aBSREL, MEME, and GARD were used to identify sites under positive and negative selection, providing insights into the adaptive evolutionary processes of the virus.
Results:
The analysis revealed distinct patterns of selection pressures on the E gene across different serotypes of the dengue virus in Pakistan. Sites under positive selection suggest adaptive evolution potentially linked to serotype changes and immune escape mechanisms. Conversely, sites under negative selection highlight evolutionary conservation, crucial for maintaining essential viral functions. These findings contribute to a deeper understanding of the E gene's role in the pathogenicity and epidemiology of the dengue virus, offering potential targets for therapeutic and vaccine strategies.
Description of the data and file structure
Data Files:
- denvE.txt: This FASTA file contains aligned nucleotide sequences of the E gene from 43 Pakistani dengue virus isolates, covering all four serotypes (DENV-1 to DENV-4).
- denvcomp.txt: This FASTA file contains aligned nucleotide sequences of the complete genome of 44 isolates from Pakistan, covering all four serotypes (DENV-1 to DENV-4).
File Content:
· Sequences are aligned, and any gaps introduced during alignment should be removed beforet comparative analyses.
Using the Data:
1. Sequence Analysis: Users can employ the provided sequences to perform phylogenetic analysis, molecular dating, or further selection pressure analysis. Tools like MEGA X, HyPhy, or similar bioinformatics software can be used to analyze these sequences.
2. Comparative Studies: The inclusion of reference sequences and isolates from other regions allows for comparative studies to understand regional variations and evolutionary trends in the E gene among dengue virus serotypes.
Missing Data:
· In the context of this dataset, missing data primarily refers to gaps in the nucleotide sequences that might have been present after alignment and should be removed before further analysis i.e., selection pressure analysis.
Potential Uses:
· This dataset is particularly useful for researchers studying the genetic diversity and evolutionary dynamics of the dengue virus. It can aid in identifying mutation patterns linked to virus transmission and pathogenicity.
· Public health researchers might use the data to track changes in virus strains over time or across geographic locations, which is crucial for developing targeted vaccines and therapies.
Access information
Data was derived from the following sources:
Software
MEGA-X was used for analysis.
Methods
Data Collection: The dataset consists of all documented Pakistani dengue virus E gene and complete genome nucleotide sequences available as of the latest update. These sequences were acquired from the Virus Pathogen Resource (ViPR) and database, which is a comprehensive and freely accessible resource providing sequence data and related information on viral pathogens. The selected sequences span all four serotypes of the dengue virus, specifically DENV-1, DENV-2, DENV-3, and DENV-4, encompassing a total of 43 E gene sequences.
Data Processing and Analysis: Upon collection, the sequences were subjected to several preprocessing steps to ensure data integrity and uniformity. Initially, sequence alignment for the 43 E gene sequences was performed using MEGA-X, a robust tool for aligning nucleotide sequences and conducting phylogenetic analyses. This was followed by further alignment using MUSCLE (Multiple Sequence Comparison by Log-Expectation), which creates high-quality alignments of nucleotide sequences. The aligned sequences were then analyzed to detect any potential recombination events using the Genetic Algorithm for Recombination Detection (GARD) to ensure that subsequent analyses on evolutionary pressures were not confounded by these genetic events.
Following preprocessing, the dataset underwent a comprehensive selection pressure analysis to identify the evolutionary forces acting on the E gene. This analysis employed a suite of methods implemented in the HyPhy software package:
- Single Likelihood Ancestor Counting (SLAC): Used to infer site-by-site selection pressures, estimating synonymous and non-synonymous substitution rates to identify negatively selected sites.
- Fixed Effects Likelihood (FEL): Applied to detect both pervasive positive and negative selection across the phylogeny.
- Adaptive Branch Site Random Effects Likelihood (aBSREL): Utilized to test for episodic diversifying selection on different branches of the phylogenetic tree.
- Mixed Effects Model of Evolution (MEME): Employed to detect sites undergoing episodic positive selection that might not be pervasive throughout the tree.
- Genetic Algorithm for Recombination Detection (GARD): Used in the initial processing stage to analyze and adjust for the presence of recombination within the dataset, which can significantly affect the accuracy of phylogenetic inference and subsequent selection pressure analyses.
These analyses provided insights into both positive and negative selection pressures on the E gene, highlighting the evolutionary trends that could influence serotype changes and viral adaptability.