Genetic variation in host selectivity and adaptive strain enrichment in legume-rhizobia symbiosis
Data files
Jan 27, 2026 version files 12.52 GB
-
A17_R108_SNPsegregation.R
661 B
-
df_summary.csv
924.50 KB
-
epstein_et_al_2022_S9.csv
712.76 KB
-
freq_C68.txt
35.20 KB
-
HapMap_SNPs.csv
10.65 GB
-
host_selectivity_gwas.sh
8.33 KB
-
manhattan_plot.R
1.51 KB
-
medtr_ulmm_host_selectivity.assoc.txt
1.39 GB
-
medtr_ulmm_host_selectivity.csv
480.57 MB
-
medtr_ulmm_host_selectivity.log.txt
1.30 KB
-
NodSize_A17_R108.R
22.30 KB
-
NoduleMorphology_HapMap_ElasticNet.R
28.87 KB
-
NodulePools_SizeData.txt
1.47 KB
-
NoduleSize_HarvestData.txt
1.32 KB
-
NoduleSizeClassInfo.txt
8.13 KB
-
README.md
13.38 KB
-
S_RGWA_1.CSV
10.42 KB
-
sig_LD_geneinfo_pval_beta.tsv
1.72 MB
-
sig_mdtr.csv
31.65 KB
-
sig_mdtr.ld
31.65 KB
-
SingleNoduleColonies.txt
1.54 KB
-
SingleStrain_phenotype_summary.tsv
32.08 KB
-
TableS1_CorrHS.csv
1.94 KB
-
TableS1_CoV.csv
789 B
Abstract
Mutualism breakdown can be prevented if partner species preferentially select and reward partners that provide greater benefit. We examined these two components using the legume Medicago truncatula and its nitrogen-fixing symbiont Sinorhizobium meliloti. The first dataset focuses on reanalyzing strain composition data from the nodules of 202 accessions to show significant genetic variation in the capacity of Medicago to restrict strain diversity when controlling for nodule number variation using a simulation. By performing an augmented image analysis of nodules pool images to produce scores of morphological parameters, we found that hosts with a suite of nodules traits, including shorter nodules, tended to be more selective. Using previously published SNP data for these ~200 accessions, we performed a genome-wide association study on host selectivity to identify candidate genes. The second dataset uses two well-studied Medicago genotypes (A17 and R108) with contrasting nodule morphologies to assess the effectiveness of adaptive enrichment mechanisms by sampling the relative frequencies of rhizobial strains in pools of small nodules that we show have smaller rhizobia population sizescompared to large nodules. We pair these results with previous single-strain assessments of strain benefits to hosts to show that hosts enriched beneficial strains in large nodules, the host that formed larger and more variably sized nodules and thus had greater 'potential' to increase rhizobial populations was less effective. Together this package includes the data, code, and figures to reproduce our analysis.
This Dryad Repository supports and enables the reuse of Burghardt et al. 2026 data and associated analysis outputs, including raw phenotypic data, software, analysis code, and GWAS and statistical outputs.
This repository included three major parts: 1) data and analysis of Medicago host selectivity and nodule morphology of 202 accessions of Medicago inoculated with a mixture of 86 Sinorhizobium meliloti strains. We provide code to assess how host selectivity is correlated with nodule morphology traits using LASSO regression and Boruta. 2) Results from a genome-wide association analysis of the host selectivity trait, and 3) data and analysis of Medicago adaptive enrichment of bacterial symbionts. The adaptive enrichment analysis includes measuring symbiotic traits (nodule length, size, branching, colony-forming units, nodule number, and strain relative frequencies in large and small nodule pools) on R108 and A17 hosts inoculated with a mixture of 68 rhizobial strains. We analyzed whether each host can enrich for beneficial rhizobial strains by comparing prior single-strain plant biomass measurements with strain frequencies in nodules.
Host Selectivity Datasets and Code
NoduleMorphology_HapMap_ElasticNet.R - Contains R code to conduct Medicago HapMap Panel rhizobial strain diversity analysis with host selectivity simulations, analysis of nodule traits, and relationships between nodule traits and host selectivity. This code generates Figure 2, S1, S3, and S4 as well as the information in Tables S1 (TableS1_CorrHS.csv + TableS1_CoV.csv) and S3 from the manuscript. Takes as inputs epstein_et_al_2022_S9.csv and df_summary.csv
epstein_et_al_2022_S9.csv - Contains previously published data describing the frequencies of 88 strains within initial inocula and nodule pools of the HapMap panel (HapMap Select & Resequence Experiment (Epstein et al., 2022)). Columns include:
- pot_or_sample- pot reference id from the greenhouse experiment in Epstein et al 2022- corresponds to ‘pot’ in the nodule morphology dataset (df_summary.csv)
- host_genotype- Medicago truncatula HapMap ID more information available on the MedicagoHapMap.org
- MAG746A… MAG97: relative frequency of each of the Sinorhizobium strains from Riley et al 2022 estimated in the nodules from each pot
S_RGWA_1.CSV - Contains experimental design data from the 2020 HapMap Select & Resequence experiment described in Epstein et al. 2023.
- pot - matches numbers in epstein_et_al_2022_S9.csv
- geno - Medicago truncatula HapMap ID more information available on the MedicagoHapMap.org
- block - Experimental replicate block
df_summary.csv - The summary dataset compiled from nodule morphology ImageJ analysis output measures a subset of which are used in downstream analysis.
- pot - matches numbers in epstein_et_al_2022_S9.csv
- prop.harvest - proportion of nodules harvested and images from pot
- Columns 3:137 - Summary statistics of nodule morphology traits measured in ImageJ including mean, 95.quant (95th percentile), 5.quant (5th percentile) median, var, min, max, range, scaled_range (range / mean). Also included are the number of nodule_raw_count, ttl.nodule.area, nodule_calc_count, and pred.prop.lobed (predicted proportion of lobed nodules).
TableS1_CoV.csv
- trait - Nodule morphological traits extracted from images of nodules pools
- coeff.var - coefficient of variation among host genotypes for each trait
Table S1_CorHS.csv
- trait- nodule morphology trait inferred from nodule images for all host genotypes
- correlation.coef- Pearson correlation coefficient with host selectivity
- p_value – raw p-value
- sig_0.05 – y/n to indicate significant at 0.05 threshold
- sig_correct – y/n to indicate significant after adjusting for multiple testing
GWAS of Host Selectivity Trait
Files for running the host selectivity GWAS pipeline (host_selectivity_gwas.sh) and record of settings for the GWAS run itself (medtr_ulmm_host_selectivity.log.txt), complete GWAS output files (medtr_ulmm_host_selectivity.assoc.txt, medtr_ulmm_host_selectivity.csv). manhattan_plot.R, A17_R108_SNPsegregation.R, and the summaries from of these analyses focused on significant SNPs, their associations, and linkage disequilibrium with other SNPs (sig_LD_geneinfo_pval_beta.tsv, sig_mdtr.csv, and sig_mdtr.ld).
Significant SNP information is mostly summarized within sig_LD_geneinfo_pval_beta.tsv. The column names in the complete GWAS results in medtr_ulmm_host_selectivity.assoc.txt and medtr_ulmm_host_selectivity.csv overlap with the one detailed below.
sig_LD_geneinfo_pval_beta.tsv- Includes additional annotation information and ref/alt SNPs calls gathered for significant SNPs in tab delimited format. Columns include:
- snp- SNP location in the Medicago truncatula V5 genome assembly
- chr- Chromosome ID from the Medicago truncatula V5 genome assembly
- pos- SNP location
- pvalue- extracted from GEMMA likelihood ratio text
- allele1- alternate allele
- allele0- reference accession allele
- af- frequency of the reference allele in the HapMap population
- beta- effect size of association with host selectivity
- se- standard error
- p_score – combination of p-value and effect size
- start- closest annotated gene start
- end- closest annotated gene end
- dist_from_start – distance of SNP from gene start
- dist_from_end size – distance of SNP from gene end
- score- NA
- strand- plus or minus DNA strand
- source- EuGene: integrative gene finder for eukaryotic and prokaryotic genomes from INRAE
- feature - All are CDS (Coding sequences)
- frame- 0,1,2
- gene_id- from Medicago truncatula V5 genome assembly
- gene_name- If historically has one otherwise Medicago truncatula V5 genome assembly geneID
- PRODUCT- based on annotation
- GO- GO terms from EuGene
- EC- KEGG Enzyme from EuGene
- ACTIVITY - prediction from EuGene
- PUBMED- - papers on the gene from EuGene
- IPR- InterPro Database Prediction
- ALT- nucleotide found in non-reference accession
- REF- nucleotide found in the reference A17
- HM001… HM316 – binary descriptor of whether each of the accession from the Medicago HapMap panel has the reference allele (0) or the alternate allele (1). Two alleles are included because Medicago are diploid but highly selfing.
medtr_ulmm_host_selectivity.assoc.txt - complete ulmm output of host selectivity means associated with SNPs containing information for all SNPs tested. Column names overlap with significant association files above.
medtr_ulmm_host_selectivity.csv - Reduced ulmm output of host selectivity means associated with SNPs. Column names overlap with significant association files above.
manhattan_plot.R- R code used generate the manhanttan plot.
A17_R108_SNPsegregation.R. R Code to compile ref/alternate SNP calls across all HapMap accessions for the significant GWAS associations. Takes as inputs sig_LD_geneinfo_pval_beta.tsv and HapMap_SNPs.csv
HapMap_SNPs.csv - CSV format of SNP markers included in the M. truncatula GWAS with genotypes as rows and SNPs as columns with names after the Medicago truncatula V5 genome assembly.
Host Adaptive Enrichment Datasets
NodSize_A17_R108.R - Contains R code for comparisons between A17 and R108 within Figure 3, 4, S5, S6, and S7 and associated analyses producing Tables S4, S5, S6, and S7. Takes as inputs NodulePools_SizeData.txt, SingleNoduleColonies.txt, NoduleSizeClassInfo.txt, SingleStrain_phenotype_summary.tsv, and freq_C68.txt.
NodulePools_SizeData.txt - Overall characteristics of A17 and R108 nodule size pools (e.g., number of nodules and wet weight)
- Plant_genotype- A17 or R108
- Community- Sinorhizobium community
- Pot_replicate- number from 7-12
- Harvest_date- month/day/year
- Person- Intials of harvestor
- Size- catagory (small, medium, large)
- Nodule_number - number of nodules in the pool
- Nodule_weight_g - nodule wet weight in grams
NoduleSize_HarvestData.txt- Overall summary of harvest data from A17 and R108 adaptive enrichment experiment.
- Plant_genotype- A17 or R108
- Community- Sinorhizobium inoculum used
- Pot_replicate- numbers (7-12)
- Harvest_date- month/day/year
- Person- Intials of harvester
- Plant_num- number of plants in the pot
- Flowering_num- number of flowers at harvest
- Seeding_num- number of seeds at harvest
- Leaf_Color- qualitative measure of leaf yellowing- sign of N deficiency
- Nod_num- number of nodules harvested
- Plants_harvested- number of plants harvested
- Nod_Num_white- number of white nodules in pool
- Nod_weight_g - total nodule wet weight in grams
- Veg_dryweight_g
- Root_dryweight_g
- Small_Nod_Num - small pool nodule number
- Small_Nod_wieght- small nodule pool wet weight in grams
- Med_Nod_Num - medium pool nodule number
- Med_Nod_wieght- medium nodule pool wet weight in grams
- Large_Nod_Num- large pool nodule number
- Large_Nod_wieght- large nodule pool wet weight in grams
SingleNoduleColonies.txt - Colony-forming units estimated from individual crushed A17 and R108 nodules of each size class. ~ 3 replicate dilution series were conducted.
- Plant_genotype- A17 or R108
- Pot_replicate- numbers (7-12)
- Size- nodule size category (small, medium, large)
- Plate- replicate
- Dilution- dilution level counted
- Count- number of colonies on the dilution plate
- Colonies- estimated number of colony-forming units from the nodule
NoduleSizeClassInfo.txt - Measurements of A17 and R108 nodule length and number of lobes from 10 randomly chosen nodules of each size category from each replicate pot.
- Plant_genotype- A17 or R108
- Community- Sinorhizobium inoculum used
- Pot_replicate- numbers (7-12)
- Size- nodule size category (small, medium, large)
- Lobes_num- number of nodule lobes
- Length_mm- length of nodule in millimeters
freq_C68.txt - Frequencies of a community of 68 Sinorhizobium strains within nodules communities of A17 and R108 small, medium, and large pools and for the initial strain communities. These values are inferred from HARP (see Burghardt *et. al., *AEM 2023 for details, Kessner *et. al., *2013 for reference to method)
- pool- includes all sample details separated by underscores community of 68 strains (C68), host genotype (A17,R108, or initial), and treatment (s=small nodule, m=medium nodules, b=big nodules, X= all nodules) and replicate pot (#)
- columns USDA1157....USDA1719 refer to all the strains in the community and their inferred relative frequency in the pool
SingleStrain_phenotype_summary.tsv - Single-strain plant growth benefit data for 68 strains from a previously published experiment (Burghardt et al. 2018)
- plant_genotype- Medicago genotype (A17 or R108)
- strain- Sinorhizobium meliloti strain name on NCBI
- nodule- number of nodules centered and normalized based on dataset overall mean
- nodule_raw- mean number of nodules for each host x strain combination
- nodabove- mean number of nodules in the top half of the root system centered and normalized based on dataset overall mean
- nodabove_raw mean number of nodules in the top half of the root system
- nodred- mean number of red nodules in the top half of the root system centered and normalized based on dataset overall mean.
- nodred_raw- mean number of red nodules in the top half of the root system
- weight- mean aboveground plant dry weight in N-free conditions centered and normalized based on dataset overall mean
- weight_raw- aboveground plant weight in N-free conditions
Code/Software
R is required to run the NoduleMorphology_HapMap_ElasticNet.R and NodSize_A17_R108.R; the scripts was created using R version 4.5.1. Annotations are provided throughout the script through 1) library loading, 2) dataset loading and cleaning, 3) analyses, and 4) figure creation.
Usage notes
Python is required for .Py scripts, R is required to open .R scripts, .sh file runs on the command line, Microsoft Excel can be used to view .csv, .txt, .tsv files
Works referencing this dataset
Burghardt, LT; Sydow, P; Sutherland, J; Epstein, B; and Tiffin, P. (2026) Genetic variation in host selectivity and adaptive strain enrichment in legume-rhizobia symbiosis, Proceedings of the Royal Society B, https://doi.org/10.1098/rspb.2025.2851
