Data from: Sequencing of seven haloarchaeal genomes reveals patterns of genomic flux

Lynch, Erin A.1; Langille, Morgan G. I.2; Darling, Aaron1; Wilbanks, Elizabeth G.1; Haltiner, Caitlin3; Shao, Katie S. Y.4; Starr, Michael O.1; Teiling, Clotilde5; Harkins, Timothy T.6; Edwards, Robert A.7; Eisen, Jonathan A.1; Facciotti, Marc T.1; Randau, Lennart

Published Aug 08, 2012 on Dryad. https://doi.org/10.5061/dryad.j08jp

Data files

Aug 08, 2012 version files 23.32 MB

Dataset_S1.cdt

6.32 MB
Dataset_S3.zip

16.97 MB
journal-2.pone.0041389.s013.txt

32.41 KB

Abstract

We report the sequencing of seven genomes from two haloarchaeal genera, Haloferax and Haloarcula. Ease of cultivation and the existence of well-developed genetic and biochemical tools for several diverse haloarchaeal species make haloarchaea a model group for the study of archaeal biology. The unique physiological properties of these organisms also make them good candidates for novel enzyme discovery for biotechnological applications. Seven genomes were sequenced to ~20×coverage and assembled to an average of 50 contigs (range 5 scaffolds - 168 contigs). Comparisons of protein-coding gene compliments revealed large-scale differences in COG functional group enrichment between these genera. Analysis of genes encoding machinery for DNA metabolism reveals genera-specific expansions of the general transcription factor TATA binding protein as well as a history of extensive duplication and horizontal transfer of the proliferating cell nuclear antigen. Insights gained from this study emphasize the importance of haloarchaea for investigation of archaeal biology.

Dataset S3 - Genome assemblies in fasta format

Genome assemblies of six halophilic archaea. Organisms were obtained from culture collections, grown, and then each genome was shotgun sequenced using 454 pyrosequencing. The genomes were then assembled into contigs and scaffolds. Data for each organism is in a separate file in fasta format. Each scafold is labelled with a header delineated by the '>' character.

Dataset_S3.zip

Dataset_S1 Syntenic Halophilic Tribes matrix

In order to determine phylogenetic distribution of haloarchaeal genes, a gene presence/absence matrix was constructed by the following process. Independent multi-genome alignments were made for the Haloferax and Haloarcula genera using the whole genome alignment method progressiveMauve [64]. The contigs for each alignment were reordered to match the published genomes of Haloferax volcanii [19] and Haloarcula marismortui [18], respectively, using Mauve’s built-in contig reordering program (Figures S3 and S4). Sets of functionally homologous genes (orthologs), referred to hereafter as Syntenic Halophile Tribes (SHTs), were determined from alignments and joined by the following process. The proteins in each SHT from the Haloferax alignment were searched against all proteins in each SHT from the Haloarcula genomes using BLAST [37] and a bit score for each pair of SHTs was calculated by averaging the bit scores from each BLAST hit. A traditional reciprocal best hit (RBH) BLAST approach was used to produce one-to-one mappings between SHTs in the two genera. Each joined SHT was assigned a function using the most commonly occurring functional annotation of the protein products of the genes in the SHT. This resulted in a set of 398 SHTs present in all nine genomes. Hidden Markov Models (HMMs) were generated for each SHT using HMMER 3, resulting in 13,276 HMMs. The 1,303 completed archaeal and bacterial genomes available from NCBI as of March 15, 2011 were downloaded and a single genome from each genus selected at random, resulting in 396 genomes. Each SHT HMM was searched against these 396 genomes and the eight halophile genomes generated for this study using HMMER 3. Each gene was counted as belonging to the HMM if it had an E-value below 0.0001 and the hit covered greater than 80% of the length of both the gene and the HMM. If a gene hit more than one HMM it was counted only for the HMM with the best E-value. These hits were then used to generate a 13,276 x 405 presence/absence matrix. The genomes and HMMs were clustered using the ‘ctc’ library in R [65] with manhattan distance and complete linkage clustering. The clustering was viewed with the Java Treeview program [66]. Cluster file can be accessed at our website [60] and as Dataset S1 and Figure S5.

Dataset_S1.cdt

Dataset S2. Full alignment of Proliferating Cell Nuclear Antigen (PCNA) homologs

Untrimmed alignment of sixty-one PCNA homologs from fifty-seven archaeal and eukaryotic species constructed with MUSCLE.

journal-2.pone.0041389.s013.txt