Skip to main content

Data from: Comparative genomics reveals high rates of horizontal transfer and strong purifying selection on rhizobial symbiosis genes

Cite this dataset

Epstein, Brendan; Tiffin, Peter (2020). Data from: Comparative genomics reveals high rates of horizontal transfer and strong purifying selection on rhizobial symbiosis genes [Dataset]. Dryad.


Horizontal transfer (HT) alters the repertoire of symbiosis genes in rhizobial genomes and may play an important role in the on-going evolution of the rhizobia-legume symbiosis. To gain insight into the extent of HT of symbiosis genes with different functional roles (nodulation, N-fixation, host benefit, and symbiont fitness), we conducted comparative genomic and selection analyses of the full genome sequences from 27 rhizobial genomes. We find that symbiosis genes experience high rates of HT among rhizobial lineages but also bear signatures of purifying selection (low Ka:Ks). HT and purifying selection appear to be particularly strong in genes involved in initiating the symbiosis (e.g. nodulation) and in genome-wide association candidates for mediating variation in benefits provided to the host. These patterns are consistent with rhizobia adapting to the host environment through the loss and gain of symbiosis genes, but not with host-imposed positive selection driving divergence of symbiosis genes through recurring bouts of positive selection.


Data is from 27 complete bacterial genome assemblies--all symbiotic rhizobia--downloaded from the NCBI repository. Orthologous genes were identified and a variety of comparative, phylogenetic, and evolutionary analyses were conducted.

Usage notes

(All of this information is repeated in the README file.)

Gene data

If you are looking for the statistics for each gene or pair of sequences
they can be found in these two files:
- 6-comparison/pairwise_divergence.tsv
- Table S5 in the ms has key gene summary statistics. This is also repeated
  in 6-comparison/gene_divergence_summary.tsv.
- Both of these files are also included in the top level of the data submission
  for your convenience--no need to unzip any directories.

Workflow, code, and various output files:

1. Genome assemblies downloaded from public databases
   Lists of strains and accession numbers are provided in the 1-genomes
   - Accession numbers for the 72 strains from NCBI are found in
   - USDA4894 and USDA1106 were also downloaded from the MaGe Microscope
   - The 27 alpha-Proteobacteria with complete genome assemblies that
     were chosen for evolutionary analyses (after running OrthoFinder)
     are listed in 27_chosen_strains.txt.
   - The full list of strains downloaded is in all_strains.txt.

2. Ortholog identification by using OrthoFinder
   The code and output for the OrthoFinder run is provided in the
   2-orthologs directory.
   - script is Note that it was set up to run
     on the Univ. of MN Supercomputing Institute servers, so it will
     have to be modified if run in a different environment. Also, file
     paths will have to be adjusted. This is true of all the pipeline
     scripts in this repository.
   - all auxiliary python scripts have also been included.
   - I (Brendan) modified one of the scripts included with OrthoFinder
     so that it would print out the rooted gene trees.
   - Output files:
     - reconciled orthologs: reconciled_genes.tsv, this file is where
       the genes for the rest of the analysis are defined. It was
       created by, which
       used the rooted gene trees to identify phylogenetic groupings of
       gene sequences. "subset" column is the gene name.
     - annotation: annotation.tsv. The "description" column was an
       attempt at combining the gene products from all strains into a
       summary; sometimes this was successful. The "gene" and
       "descriptions" columns have comma-separated lists of all the
       unique gene names and descriptions for the gene (from all
     - pseudogenes: pseudogenes.txt
     - rooted gene trees from OrthoFinder:
       rooted_gene_tress_from_orthofinder.tsv, one newick formatted tree
       per line, first column is the "orthogroup". This output file is
       from the modfied OrthoFinder script.
     - IMPORTANT NOTE: The output files from this step still include
       pseudogenes. These have to be removed during subsequent analysis
   - Symbiosis genes are in symbiosis_genes
     - Note: The rhizobial fitness phenotype is the same one presented in
       Burghardt et al. (2017, PNAS). However, the candidate gene list
       used here is slightly different--for consistency, the analysis was
       re-run using the same methods as in Epstein et al. (2018, mSphere).
       The difference is testing one variant per LD group, instead of all

3. Alignment of sequences by using muscle
   Code for this step is found in 3-alignment.
   - is the shell script that ran the protein
     alignment and produced nucleotide alignments from the protein
   - helper python scripts also included
   - alignments were produced for entire orthogroups (the MCL output
     from OrthoFinder) and for "orthosets" (the reconciled gene
   - Note: codon alignments of just sequences from one "orthoset" were
     based on the protein alignment of all sequences in the orthogroup,
     not on the orthoset-only protein alignment--more for convenience
     than a different analysis.

4. Construction of a single-copy-core phylogenetic tree by using IQ-tree
   Code and output for this step is found in 4-species_tree.
   - code to construct the tree
   - output tree from IQtree: sc.treefile
   - ML estimates of pairwise distances: sc.mldist
   - output tree rooted by using figtree: rooted.nw
   - IQ-tree log file: sc.iqtree
   - NOTE: For some reason, the list of single copy core genes here does
     not include 9 genes that are single copy core according to the data
     I used for the symbiosis gene comparison. This is because the
     IQ-tree list of single-copy core genes was obtained by looking for
     scc genes, then removing any with pseudogenes. By contrast, the
     gene comparison data have the pseudogenes removed first, then
     searched for scc--nine of the scc genes have extra pseudogene
     copies. I don't think this is an important issue.

5. Calculation of statistics for genes

5a. Calculation of Ka/Ks for all pairs of sequences by using gestimator
    Code for this step is found in 5-divergence/5a-kaks.
    - Shell script to run the analysis
    - output, cleaned of pseudogenes, is found in the directory for step

5b. Calculation of pairwise protein distances by using FastME
    Code for this step is found in 5-divergence/5b-protein_dist
    - Shell script to run the analysis.
    - output is found in 6-comparison

5c. Calculation of phylogenetic signal statistics by using R packages
    Code for this step is found in 5-divergence/5c-phylogenetic_signal
    - Shell script to run the analysis.
    - A helper Python script and the R scripts that contain the
      phylogenetic analysis are also included.
    - output is summarized in 6-comparison

5d. Estimates of rates of dup., loss, & transfer by using GeneRax
    - Shell script to run the analysis
    - helper scripts included
    - output is summarized in 6-comparison

6. Comparison of symbiosis genes to rest of the genome
   Code and output for this step is in 6-comparison.
   Files with the statistics for each gene or pair of sequences:
   - pairwise_divergence.tsv: Pairwise sequence statistics, with
     pseudogenes removed.
   - gene_divergence_summary.tsv: Averages, etc. of pairwise statistics
     and other statistics for each gene.  Code:
   - This script compiles the pairwise
     divergence and Ka/Ks data into a single table, while removing
     pseudogenes. It also contains code for calculating medians and
     means, but this was eventually moved to a different script (see
   - make_orthoset_divergence_table.r: The R code that actually does the
     work of tabulating pairwise divergence data. Is not fast.
   - code to calculate summary statistics
     for each gene, create lists of randomly chosen background genes,
     and compare the background to the symbiosis genes
   - helper scripts included Output files:
   - randomly_chosen_gene_sets: genes chosen as random background. Each
     line is one set. "all": randomly chosen from all genes;
     "matched_strain_count": randomly chosen from genes with the same
     number of genomes as the symbiosis genes; "matched_count_and_copy":
     randomly chosen from genes that were similar in the number of
     genomes and copies to the symbiosis genes;
     "phenotype_permutations": GWAS hits from analyses run on permuted
     phenotype values (see Epstein et al. 2018).
   - comparison_results: all the output from comparing symbiosis genes
     to background genes. The file naming convention is
     table.<background set>.<statistic>.tsv. The file format is one row
     per gene set, with the values separated by tabs. The first row is
     the values for the symbiosis genes, while each remaining row is one
     random sample.

7. Tables, figures, and summaries
   Code for making the figures and tables and for doing the major
   statistical tests is in 7-figures_and_tables.


National Science Foundation, Award: IOS-1724993

National Science Foundation, Award: 1856744