In silico comparative genomic analysis of arsenic resistance operons in bacterial representatives associated with Urmia and Van lakes

Haghi, Morteza 1 ; Haghi, Tugce1; Abbaszade, Zaka2 1

Published Mar 04, 2026 on Dryad. https://doi.org/10.5061/dryad.1jwstqk83

Data files

Mar 04, 2026 version files 243.77 KB

Abstract

Arsenic contamination poses major ecological and health risks, and microorganisms play key roles in arsenic cycling and detoxification. This study presents a comparative in silico analysis of arsenic resistance operons in representative bacterial genomes associated with two extreme ecosystems, Urmia Hypersaline Lake (Iran) and Van Soda Lake (Türkiye). Six representative genomes were selected based on 16S rRNA gene sequence homology, and their operons were characterised. Both canonical and noncanonical arrangements were identified, reflecting lineage-specific adaptations and horizontal gene transfer. The predicted promoters and terminators indicate regulatory diversity among the operons. Phylogenetic analysis of 16S rRNA gene sequences from 17 isolates showed low nucleotide diversity, whereas functional genes displayed high polymorphism. Haplotype diversity was the greatest for arsR (Hd = 0.960), followed by arsB and arsC (Hd = 0.933 each). Population differentiation analysis showed a significant divergence of arsR between Urmia and Van isolates (F_ST = 0.759), highlighting its role in local adaptation to arsenic-rich environments. These findings suggest a dual strategy for microbial adaptation, maintaining conserved core operons while incorporating diverse accessory genes to broaden detoxification potential. This study provides insights into microbial survival strategies in extreme saline and soda lakes and offers a genomic framework for future functional investigations.

Dataset DOI: 10.5061/dryad.1jwstqk83

The dataset contains nucleotide sequences, annotation files, and tabulated genomic and protein-level information used for the identification and comparative genomic analysis of arsenic resistance (ars) operons in representative bacterial genomes associated with Urmia Hypersaline Lake (Iran) and Van Soda Lake (Türkiye).

All analyses were conducted in silico using publicly available genome and protein sequences.

FILE DESCRIPTIONS AND VARIABLE DEFINITIONS

1. ars_operons.fasta

Description:

FASTA file containing nucleotide sequences of predicted arsenic resistance operons identified in representative genomes.

Each FASTA entry corresponds to one predicted ars operon sequence.

2. ars_operons.gff

Description:

GFF annotation file containing gene coordinates and functional annotations corresponding to the operon sequences listed in ars_operons.fasta.

3. Supplementary_Table_1.csv

Description:

This table lists bacterial isolates reported from Lake Van and Lake Urmia, together with the representative genomes selected for comparative genomic analysis.

Each row corresponds to one isolate.

Variables

Microorganism

Name of the bacterial isolate as originally reported.

Source

Sampling location of the isolate (e.g., Lake Van or Lake Urmia).

Partial_16S_rRNA_accession

NCBI accession number of the partial 16S rRNA gene sequence used for homology comparison.

Selected_representative_genome

Name of the genome selected as representative based on highest sequence similarity.

Homology_percent

Percentage sequence similarity between the isolate 16S rRNA gene and the selected representative genome.

Genome_assembly_accession

NCBI genome assembly accession number corresponding to the selected representative genome.

4. Supplementary_Table_2.csv

Description:

This table summarizes protein-level physicochemical characteristics of arsenic resistance operon-associated proteins analyzed in this study. Each row represents a single protein sequence retrieved from public databases.

Variables

Description

Functional annotation of the protein, including protein name and source organism.

Accession

NCBI or UniProt protein accession number corresponding to the listed protein sequence.

Molecular Weight

Calculated molecular weight of the protein in kilodaltons (kDa).

Sequence Length (bp)

Length of the protein sequence.

(Note: values correspond to amino acid sequence length.)

% Acidic Amino Acids

Percentage of acidic amino acids (Aspartic acid and Glutamic acid) in the protein sequence.

% Basic Amino Acids

Percentage of basic amino acids (Lysine, Arginine, and Histidine).

% Charged Amino Acids

Percentage of charged amino acids (acidic + basic residues).

% Hydrophobic Amino Acids

Percentage of hydrophobic amino acids in the protein sequence.

% Polar Uncharged Amino Acids

Percentage of polar but uncharged amino acids in the protein sequence.

Molecule Type

Molecule category (AA = amino acid sequence).

Sequence

Amino acid sequence of the protein.

Topology

Predicted protein topology (e.g., linear).

5. Supplementary_Table_3.csv

Description:

This table lists arsenic resistance operons identified in representative bacterial genomes. Each row corresponds to a single protein encoded within an operon.

Operons are numbered separately for each microorganism.

Variables

Microorganism

Name of the representative bacterial genome in which the operon was identified.

Operon number

Numerical identifier assigned to each operon within a given microorganism. Operon numbering restarts for each genome.

Protein

Name of the protein encoded within the operon (e.g., ArsR, ArsB, ArsC, ArsM, P-type ATPase, etc.).

Accession

NCBI UniProt protein accession number corresponding to the listed protein.

Software Used

No custom scripts were generated for this study.

Sequence retrieval, alignment, operon identification, and annotation were performed using:

Geneious Prime v2025.2.2

SnapGene v8.2.1

Unipro UGENE v52.1

Phylogenetic analyses were conducted using MEGA v12.0.11.

Homology searches were performed using NCBI BLAST+ (blastn and tblastn).

Population genetic analyses were conducted using DnaSP v6.12.03.

Promoter prediction was performed using BPROM.

Transcription terminator and RNA secondary structure predictions were conducted using ARNold and RNAfold.

Data Access Information

All nucleotide and protein sequences were retrieved from publicly available databases, including the National Center for Biotechnology Information (NCBI) and UniProt.

Genome assemblies corresponding to representative isolates were obtained from NCBI.

No controlled-access, restricted, or proprietary data were used in this study.

In silico comparative genomic analysis of arsenic resistance operons in bacterial representatives associated with Urmia and Van lakes

Data files

Abstract

README: In silico comparative genomic analysis of arsenic resistance operons in bacterial representatives associated with Urmia and Van lakes