Data from: Genes for cooperation are not more likely to be carried by plasmids
Data files
Feb 13, 2024 version files 471.63 KB
-
README.md
-
SOC_plas_species_tree.nex
-
SOC_summary_data_wide_146.csv
-
species_146.csv
-
species_gtdb.tree
Abstract
Cooperation is prevalent across bacteria, but risks being exploited by non-cooperative cheats. Horizontal gene transfer, particularly via plasmids, has been suggested as a mechanism to stabilize cooperation. A key prediction of this hypothesis is that genes that are more likely to be transferred, such as those on plasmids, should be more likely to code for cooperative traits. Testing this prediction requires identifying all genes for cooperation in bacterial genomes. However, previous studies used a method that likely misses some of these genes for cooperation. To solve this, we used a new genomics tool, SOCfinder, which uses three distinct modules to identify all kinds of genes for cooperation. We compared where these genes were located across 4648 genomes from 146 bacterial species. In contrast to the prediction of the hypothesis, we found no evidence that plasmid genes are more likely to code for cooperative traits. Instead, we found the opposite - that genes for cooperation were more likely to be carried on chromosomes. Overall, the vast majority of genes for cooperation are not located on plasmids, suggesting that the more general mechanism of kin selection is sufficient to explain the prevalence of cooperation across bacteria.
README: Genes for cooperation are not more likely to be carried by plasmids
Authors: Anna Dewar, Laurence Belcher, Thomas Scott, Stuart West
Affiliation: Department of Biology, University of Oxford, United Kingdom
Overview
This repository contains code and data for the research article 'Genes for cooperation are not more likely to be carried by plasmids'. For any queries please contact Anna Dewar at anna.dewar@biology.ox.ac.uk.
Our dataset included a total of 4648 genomes across 146 bacterial species. We ran SOCfinder (version 1.0.1) (https://github.com/lauriebelch/SOCfinder) with default parameters on all our genomes. We then matched the list of genes for cooperation found by SOCfinder to each genome’s chromosome(s) or plasmid(s). We did this for the consensus list which combines results of all three modules, and also for each of the modules separately.
We used two phylogenies to control for any phylogenetic non-independence between species in our analyses. First, we generated a supertree phylogeny of the 146 species in our dataset. Second, we used the GTDB bacterial reference tree as an alternative phylogeny.
This dataset contains the gene counts produced by SOCfinder for every genome in our dataset and the phylogenies. These can be used together with the code in the 'Code_S1.Rmd' or 'Code_S1.R' files to run all our analyses.
Supplementary Material 1
S1 contains full details of all results. The document was compiled from an Rmarkdown file. The original .Rmd file, including code for all models and results in the S1 document, is available in this repository as: 'Code_S1.Rmd'. The code is also available as a regular R file: 'Code_S1.R'.
Data & analyses
The .Rmd file requires the data file 'SOC_summary_data_wide_146.csv', which contains all data used in our analyses, and the phylogenies 'SOC_plas_species_tree.nex' and 'species_gtdb.tree'. For ease of use we recommend saving all files, including the data, tree and .Rmd file, into a folder alongside an Rstudio project.
The data file 'SOC_summary_data_wide_146.csv' contains one row for each of the 4648 genome in our dataset. The first column is that genome's RefSeq accession number and the second column is the species that genome has been assigned to in the RefSeq database. There are then columns correpsonding to the number of genes identified by each of the three SOCfinder modules for that genome's plasmid(s) and chromosome(s), and also the consensus total number of genes across the three modules. The name of each column contains three words separated by underscores. The first word corresponds to the SOCfinder module used ('antismash', 'psortb', 'kofam', and 'social' for secondary metabolite, extracellular protein, functional annotation, and consensus, respectively). The second word is either 'plasmid' or 'chromosome', corresponding to which replicon type each column responds to. The third word is either 'TRUE' or 'FALSE', corresponding to genes for cooperative and non-cooperative traits, respectively (as defined by the module or consensus).
Compiling 'Code_S1.Rmd' as a pdf
If you wish to locally compile the file 'Code_S1.pdf' into a PDF identical to the Supplementary Material 1 document published with the manuscript, please also download the file 'figure_order_header.tex' and include this within the same folder as the .Rmd file.
Phylogenies
Supplementary Material 2: supertree phylogeny
To build our phylogeny, we used a recently published maximum likelihood tree generated with 16S ribosomal protein data as the basis for our phylogeny (Hug et al. (2016), 'A new view of the tree of life', Nat. Microbiol.). We used the R package ‘ape’ to identify all branches that matched either a species or a genus in our dataset. In cases where we had multiple species within a single genus, we used the R package ‘phytools’ to add these species as additional branches in the tree. We used published phylogenies from the literature to add any within-genus clustering of species’ branches (details and references of these phylogenies are available in Supplementary Material 2 and 'species_tree_refs.xlsx', both published as electronic supplementary information alongside the manuscript.
The code for how did this is available to download as 'tree_script.R'. It requires the files 'Original_tree.txt' and 'species_146.csv'. The 'Original_tree.txt' file is not our own but can be downloaded from Hug et al. (2016), 'A new view of the tree of life', Nat. Microbioly (called 'Supplementary Dataset 2' in that paper - please edit the script to reflect whatever your version of that file is called). Each line of the script edits the tree to produce the final tree, so the lines of code should be run in the order they are in the script.
GTDB tree
We also used the GTDB bacterial reference tree as an alternative phylogeny (version 214.1). The code we used to make a subset of the tree with only species clusters corresponding to our genomes is in: 'GTDB_tree_script.R'. It requires: 'SOC_summary_data_wide_146.csv', 'bac120_r214.tree' and 'bac120_metadata_r214.tsv' [note: the .tree and .tsv files are not our own but are published by the GTDB team, and are available to download from GTDB at: https://data.gtdb.ecogenomic.org/releases/release214/214.1/].
Scripts to use SOCfinder on large genomic datasets
Finally, we have included a html guide with all of the code we used to conduct our analysis with SOCfinder, including selecting and downloading genomes, and extracting data. We hope this will be a helpful example of how SOCfinder could be used in future studies. This is available as: 'plas_chr_SOCfinder_analysis.html'.
Methods
We included species in our analysis if they had at least 10 complete genomes in the RefSeq database, and that at least 10 of those genomes had at least one plasmid sequence in their assembly. For all species meeting the criteria, we then downloaded all genomes available in RefSeq up to a maximum of 100. For the species which had more than 100 genomes available, we randomly selected 100 genomes for further analysis. We then downloaded the RefSeq genomes using the NCBI Datasets conda package (version 15.5.0) (https://www.ncbi.nlm.nih.gov/datasets). Overall, our dataset included a total of 4648 genomes across 146 bacterial species.
We ran SOCfinder (version 1.0.1) (https://github.com/lauriebelch/SOCfinder) with default parameters on all our genomes. We then matched the list of genes for cooperation found by SOCfinder to each genome’s chromosome(s) or plasmid(s). We did this for the consensus list which combines results of all three modules, and also for each of the modules separately. This was so we could compare whether considering different kinds of genes for cooperation influenced our results.
For each genome we calculated the proportion of genes coding for cooperative traits on both their chromosome(s) and their plasmid(s). We then analysed whether these proportions were significantly different.