Data for: Plasmids do not consistently stabilize cooperation across bacteria, but may promote broad pathogen host-range
Dewar, Anna (2021), Data for: Plasmids do not consistently stabilize cooperation across bacteria, but may promote broad pathogen host-range, Dryad, Dataset, https://doi.org/10.5061/dryad.gxd2547n4
Horizontal gene transfer via plasmids could favour cooperation in bacteria, because transfer of a cooperative gene turns non-cooperative cheats into cooperators. This hypothesis has received support from theoretical, genomic and experimental analyses. In contrast, we show here, with a comparative analysis across 51 diverse species, that genes for extracellular proteins, which are likely to act as cooperative ‘public goods’, were not more likely to be carried on either: (i) plasmids compared to chromosomes; or (ii) plasmids that transfer at higher rates. Our results were supported by theoretical modelling which showed that while horizontal gene transfer can help cooperative genes initially invade a population, it has less influence on the longer-term maintenance of cooperation. Instead, we found that genes for extracellular proteins were more likely to be on plasmids when they coded for pathogenic virulence traits, in pathogenic bacteria with a broad host-range.
The first dataset 'genome_data' contains information on the 1632 genomes analysed in our study. Each row in the data file corresponds to a replicon, which is lablled as either 'chromosome' or 'plasmid' in the 'replicon_type' column. Each replicon has its own 'accession_number'. Each genome has one chromosome (except for Vibrio parahaemolyticus which has two) and at least one plasmid. All replicons from a genome have the same value in the columns 'species_name' and 'strain_name', which can be used by grouping variables in R. For each replicon, we used PSORTb v3.0 to predict the subcellular localisation of every protein. The number of proteins in each of the possible localisations (depending on the gram stain) is listed in the corresponding columns i.e. extracellular, cytoplasmic, periplasmic. Other data analysed in the paper are included in this dataset, including the pathogenicity & host-range category of the species, MOBsuite mobility predictions for each plasmid, and pangenome core & accessory %s from PanX. Also included is data on the number of environments each species's 16S rRNA was sequenced in, with a column for each of the five habitats and the total number. A '1' denotes it was sequenced in that environment, a '0' denotes it was not. These occur twice, once for the original data from Garcia-Garcera & Rocha (2020), labelled with 'gg' at the beginning, and second for our categorisation of species based on the literature.
The second dataset 'MP3_protein_predictions' contains MP3 pathogenicity predictions for every extracellular protein in all broad and narrow host-range genomes (one protein per row). The 'accession_number', 'strain_name' and 'species_name' columns indicate which replicon, genome and species each protein is from. The hybrid prediction we used is in the 'Hybrid.Prediction' column. Using grouping variables in R means the number of 'pathogenic' and 'non-pathogenic' proteins can be tallied for each replicon, and then combined with the data in 'genome_data' for analysis.