Skip to main content
Dryad logo

Yersinia plasmid pYV gene phylogenies and alignments

Citation

Cohen, James; Moorman, Veronica (2021), Yersinia plasmid pYV gene phylogenies and alignments, Dryad, Dataset, https://doi.org/10.5061/dryad.905qftthh

Abstract

Pathogenic Yersinia bacteria (including Y. pseudotubuclosis Y. enterocolitica, and Y. pestis) contain the mosaic plasmid pYV that encodes for, among other things, a number of proteinaceous virulence factors.  While the evolutionary histories of many of the biovars and strains of pathogenic Yersinia species are well documented, the origins of many of the individual virulence factors have not been comprehensively examined.  Here, the evolutionary origins of the genes coding for a set of Yersiniaouter protein (Yop) virulence factors were investigated through phylogenetic reconstruction and subsequence analysis.  It was found that many of these genes had only a few sequenced homologs and none of the resolved phylogenies recovered the same relationships as was resolved from 16S rRNA.  Many of the evolutionary relationships differ greatly among genes on the plasmid, and variation is also found across different domains of the same gene, which provides evidence of the mosaic nature of the plasmid as well as multiple genes on the plasmid.  This mosaic aspect also relates to patterns of selection, which vary among the studied domains.  

Methods

The identification of homologous sequences was initiated using the KIM5 strain of Yersinia pestis.  Specifically, sequences of genes located on the pYV plasmid were obtained from AF053946.1 (yopEyopHyopTsycTsycEsycHyopJyopO, and yopM); each gene was annotated except for sycT and yopT, which were found by comparing the full plasmid sequence to that of Y. pestis strain CO92 where they were annotated (NC_003131.1).  Alternatively, the 16S rRNA gene was obtained from NZ_CP009836.1.  These sequences, as well as their individual protein domain sequences (when applicable), were BLASTed using The National Center for Biotechnology Information (NCBI)’s tblastn function which involves a Basic Local Alignment Search Tool (BLAST) algorithm to search translated nucleotide databases using a protein query.  Default settings were employed, except for increasing the maximum number of sequences to 20,000.  Because most of the matrices included multiple individuals for each species, a separate matrix that only included one arbitrarily selected representative from each species was constructed from the full matrix. These matrices are referred to as full and strict respectively.

The 23 DNA regions, inclusive of both full genes and domain sections, of the full and strict sets of species were aligned using TranslatorX, with MAFFT as the alignment program and guessing the most likely reading frame to ensure a codon-based alignment for the gene regions.  For 16S, a non-coding region, MAFFT alone was used for the alignment.  Phylogenetic analyses were conducted for all alignments using RAxML on the Kettering University High-Performance Computing cluster (KUHPC) or the CIPRES web server (www.phylo.org).  For these analyses, a GTR + G + I model (CAT for 16S due to the large number of species) was employed to resolve the best scoring Maximum Likelihood (ML) tree and to conduct 1000 or 10000 rapid bootstrap analyses (100 for 16S due to its large size), with the number of bootstrap analyses depending on the size of the dataset.  The resulting trees were examined and compared.  Phylogenetic analyses were not conducted for the strict sets of the YopH linker, YopE N-terminal, and YopM C-terminal as these datasets included only one, two, and three species, respectively. 

Usage Notes

The files include the input and output from RAxML analyses.  The input is aligned DNA sequence data downloaded from NCBI, and the output is the results of the RAxML analyses, including the best ML tree, the bipartition tree, bootstrap trees, and output printed to the screen.  The name of each gene and domain are included on the name of the folder.  Because most of the matrices included multiple individuals for each species, a separate matrix that only included one arbitrarily selected representative from each species was constructed from the full matrix. These matrices are referred to as full and strict, respectively.

Funding

National Science Foundation