In silico comparative genomic analysis of arsenic resistance operons in bacterial representatives associated with Urmia and Van lakes
Data files
Mar 04, 2026 version files 243.77 KB
-
ars_operons.fasta
76.37 KB
-
ars_operons.gff
104.59 KB
-
README.md
4.91 KB
-
Supplementary_Table_1.csv
1.23 KB
-
Supplementary_Table_2.csv
53.70 KB
-
Supplementary_Table_3.csv
2.98 KB
Abstract
Arsenic contamination poses major ecological and health risks, and microorganisms play key roles in arsenic cycling and detoxification. This study presents a comparative in silico analysis of arsenic resistance operons in representative bacterial genomes associated with two extreme ecosystems, Urmia Hypersaline Lake (Iran) and Van Soda Lake (Türkiye). Six representative genomes were selected based on 16S rRNA gene sequence homology, and their operons were characterised. Both canonical and noncanonical arrangements were identified, reflecting lineage-specific adaptations and horizontal gene transfer. The predicted promoters and terminators indicate regulatory diversity among the operons. Phylogenetic analysis of 16S rRNA gene sequences from 17 isolates showed low nucleotide diversity, whereas functional genes displayed high polymorphism. Haplotype diversity was the greatest for arsR (Hd = 0.960), followed by arsB and arsC (Hd = 0.933 each). Population differentiation analysis showed a significant divergence of arsR between Urmia and Van isolates (F_ST = 0.759), highlighting its role in local adaptation to arsenic-rich environments. These findings suggest a dual strategy for microbial adaptation, maintaining conserved core operons while incorporating diverse accessory genes to broaden detoxification potential. This study provides insights into microbial survival strategies in extreme saline and soda lakes and offers a genomic framework for future functional investigations.
Dataset DOI: 10.5061/dryad.1jwstqk83
The dataset contains nucleotide sequences, annotation files, and tabulated genomic and protein-level information used for the identification and comparative genomic analysis of arsenic resistance (ars) operons in representative bacterial genomes associated with Urmia Hypersaline Lake (Iran) and Van Soda Lake (Türkiye).
All analyses were conducted in silico using publicly available genome and protein sequences.
FILE DESCRIPTIONS AND VARIABLE DEFINITIONS
1. ars_operons.fasta
Description:
FASTA file containing nucleotide sequences of predicted arsenic resistance operons identified in representative genomes.
Each FASTA entry corresponds to one predicted ars operon sequence.
2. ars_operons.gff
Description:
GFF annotation file containing gene coordinates and functional annotations corresponding to the operon sequences listed in ars_operons.fasta.
3. Supplementary_Table_1.csv
Description:
This table lists bacterial isolates reported from Lake Van and Lake Urmia, together with the representative genomes selected for comparative genomic analysis.
Each row corresponds to one isolate.
Variables
Microorganism
Name of the bacterial isolate as originally reported.
Source
Sampling location of the isolate (e.g., Lake Van or Lake Urmia).
Partial_16S_rRNA_accession
NCBI accession number of the partial 16S rRNA gene sequence used for homology comparison.
Selected_representative_genome
Name of the genome selected as representative based on highest sequence similarity.
Homology_percent
Percentage sequence similarity between the isolate 16S rRNA gene and the selected representative genome.
Genome_assembly_accession
NCBI genome assembly accession number corresponding to the selected representative genome.
4. Supplementary_Table_2.csv
Description:
This table summarizes protein-level physicochemical characteristics of arsenic resistance operon-associated proteins analyzed in this study. Each row represents a single protein sequence retrieved from public databases.
Variables
Description
Functional annotation of the protein, including protein name and source organism.
Accession
NCBI or UniProt protein accession number corresponding to the listed protein sequence.
Molecular Weight
Calculated molecular weight of the protein in kilodaltons (kDa).
Sequence Length (bp)
Length of the protein sequence.
(Note: values correspond to amino acid sequence length.)
% Acidic Amino Acids
Percentage of acidic amino acids (Aspartic acid and Glutamic acid) in the protein sequence.
% Basic Amino Acids
Percentage of basic amino acids (Lysine, Arginine, and Histidine).
% Charged Amino Acids
Percentage of charged amino acids (acidic + basic residues).
% Hydrophobic Amino Acids
Percentage of hydrophobic amino acids in the protein sequence.
% Polar Uncharged Amino Acids
Percentage of polar but uncharged amino acids in the protein sequence.
Molecule Type
Molecule category (AA = amino acid sequence).
Sequence
Amino acid sequence of the protein.
Topology
Predicted protein topology (e.g., linear).
5. Supplementary_Table_3.csv
Description:
This table lists arsenic resistance operons identified in representative bacterial genomes. Each row corresponds to a single protein encoded within an operon.
Operons are numbered separately for each microorganism.
Variables
Microorganism
Name of the representative bacterial genome in which the operon was identified.
Operon number
Numerical identifier assigned to each operon within a given microorganism. Operon numbering restarts for each genome.
Protein
Name of the protein encoded within the operon (e.g., ArsR, ArsB, ArsC, ArsM, P-type ATPase, etc.).
Accession
NCBI UniProt protein accession number corresponding to the listed protein.
Software Used
No custom scripts were generated for this study.
Sequence retrieval, alignment, operon identification, and annotation were performed using:
Geneious Prime v2025.2.2
SnapGene v8.2.1
Unipro UGENE v52.1
Phylogenetic analyses were conducted using MEGA v12.0.11.
Homology searches were performed using NCBI BLAST+ (blastn and tblastn).
Population genetic analyses were conducted using DnaSP v6.12.03.
Promoter prediction was performed using BPROM.
Transcription terminator and RNA secondary structure predictions were conducted using ARNold and RNAfold.
Data Access Information
All nucleotide and protein sequences were retrieved from publicly available databases, including the National Center for Biotechnology Information (NCBI) and UniProt.
Genome assemblies corresponding to representative isolates were obtained from NCBI.
No controlled-access, restricted, or proprietary data were used in this study.
