The consolidated and reconciled annotations from all of the WGS strains used in this study
Data files
Sep 07, 2024 version files 203.64 MB
-
DatasetS1.zip
203.63 MB
-
README.md
4.13 KB
Abstract
Pantoea agglomerans is one of four Pantoea species reported in the USA to cause bacterial rot of onion bulbs. However, not all P. agglomerans strains are pathogenic to onion. We characterized onion-associated strains of Pagg to elucidate the genetic and genomic signatures of onion-pathogenic P. agglomerans. We collected >300 P. agglomerans strains associated with symptomatic onion plants and bulbs from public culture collections, research laboratories, and a multi-year survey in 11 states in the USA. Combining the 87 genome assemblies with 100 high-quality, public P. agglomerans genome assemblies we identified two well-supported P. agglomerans phylogroups. Strains causing severe symptoms on onion were only identified in Phylogroup II and encoded the HiVir pantaphos biosynthetic cluster, supporting the role of HiVir as a pathogenicity factor. The P. agglomerans HiVir cluster was encoded in two distinct plasmid contexts: 1) as an accessory gene cluster on a conserved P. agglomerans plasmid (pAggl), or 2) on a mosaic cluster of plasmids common among onion strains (pOnion). Analysis of closed genomes revealed that the pOnion plasmids harbored alt genes conferring tolerance to Allium thiosulfinate defensive chemistry and many harbored cop genes conferring resistance to copper. We demonstrated that the pOnion plasmid pCB1C can act as a natively mobilizable pathogenicity plasmid that transforms P. agglomerans Phylogroup I strains, including environmental strains, into virulent pathogens of onion. This work indicates a central role for plasmids and plasmid ecology in mediating P. agglomerans interactions with onion plants, with potential implications for onion bacterial disease management.
README
DatasetS1.zip
This dataset contains the consolidated and reconciled annotations from all of the WGS strains of Pantoea agglomerans used in this study. In this study, Prokka was used, independently, by both the UGA and USDA authors to reannotate genomes for input to Roary (https://dx.doi.org/10.1093/bioinformatics/btv421. For genomes available from NCBI, both GenBank and RefSeq annotations are available. Hence, for each genome, multiple sets of gene annotations are available.
The ZIP file DatasetS1.zip
contains 187 tab-separated values (.tsv
) files, one for each of the 187 genomes analyzed by Roary in this study.
For each WGS strain, we have provided a table (i.e., .tsv
file) listing which CDS annotations are equivalent in the available annotation sets. That is, each row of each table lists the equivalent annotations for each CDS. Annotations are considered equivalent if they refer to CDS features whose amino acid sequences are equal.
A number of attributes are reported for each set of equivalent CDS. For instance, below is the set (i.e., row) of results reported in AR1a.tsv
. For ease of reading, this row and the corresponding column headers have been transposed (i.e., each row of this table corresponds to a row AR1a.tsv
.
column | value |
---|---|
interesting |
|
genbank_ncbi:locus_tag |
H0Z11_03125 |
genbank_ncbi:gene |
|
genbank_ncbi:product |
aspartate aminotransferase family protein |
prokka_plaster:locus_tag |
OFPLNBHF_00389 |
prokka_plaster:gene |
argD |
prokka_plaster:product |
Acetylornithine/succinyldiaminopimelate aminotransferase |
prokka_roary:locus_tag |
DBIDKNDH_00633 |
prokka_roary:gene |
argD |
prokka_roary:product |
Acetylornithine/succinyldiaminopimelate aminotransferase |
refseq:locus_tag |
H0Z11_RS03125 |
refseq:gene |
|
refseq:product |
aspartate aminotransferase family protein |
aa_seq |
MAAEKIAVTRETFDNV[...] |
Four sets of annotations are reported for each CDS:
genbank_ncbi
- The annotations found in the GenBank version of the NCBI assemblyrefesq
- The annotations found in the RefSeq version of the NCBI assembly.prokka_roary
- The Prokka annotations generated for the whole-genome Roary pangenome analysis.prokka_plaster
- The Prokka annotations generated for the replicon-level, "plaster"-level Roary pangenome analysis.
The whole-genome and replicon-level Roary pangenome analyses were performed separately in two different labs. By the time it was realized that both sets of results were to be reported in this study, it was not practical to rerun both analyses using a single set of Prokka annotations.
For each set of annotations, the following values are reports:
locus_tag
- the unique locus tag assigned by the respective annotation pipeline.gene
- the human-friendly gene name assigned by the respective annotation pipeline, if one exists.product
- a description of the protein predicted by the respective annotation pipeline, if one exists.
In addition to these, there are two additional columns:
interesting
- a*
will appear in this column if the CDS is found to be "interested". This happens when annotations for the CDS are missing from one or more sets of annotations.aa
- the amino acid sequence of the CDS.
Methods
Eighty-one genomes assembled as part of this study were combined with 100 high-quality genome assemblies from NCBI Genbank (Table S3). Genomes were re-annotated with Prokka [@Seemann2014] for pangenome analysis with Roary [@Page2015] to identify core and accessory genes. Custom scripts were used to reconcile the CDS records of the Prokka, Genbank, and RefSeq annotations for each genome. For the process, CDS records were considered to be synonymous, if the coordinates for their stop codon were equal. If the coordinates of the start codon of synonymous CDS records were not equal, the records were marked as "interesting".