Mutualism breakdown can be prevented if partner species preferentially select and reward partners that provide greater benefit. We examined these two components using the legume Medicago truncatula and its nitrogen-fixing symbiont Sinorhizobium meliloti. The first dataset focuses on reanalyzing strain composition data from the nodules of 202 accessions to show significant genetic variation in the capacity of Medicago to restrict strain diversity when controlling for nodule number variation using a simulation. By performing an augmented image analysis of nodules pool images to produce scores of morphological parameters, we found that hosts with a suite of nodules traits, including shorter nodules, tended to be more selective. Using previously published SNP data for these ~200 accessions, we performed a genome-wide association study on host selectivity to identify candidate genes. The second dataset uses two well-studied Medicago genotypes (A17 and R108) with contrasting nodule morphologies to assess the effectiveness of adaptive enrichment mechanisms by sampling the relative frequencies of rhizobial strains in pools of small nodules that we show have smaller rhizobia population sizescompared to large nodules. We pair these results with previous single-strain assessments of strain benefits to hosts to show that hosts enriched beneficial strains in large nodules, the host that formed larger and more variably sized nodules and thus had greater 'potential' to increase rhizobial populations was less effective. Together this package includes the data, code, and figures to reproduce our analysis.

This Dryad Repository supports and enables the reuse of Burghardt et al. 2026 data and associated analysis outputs, including raw phenotypic data, software, analysis code, and GWAS and statistical outputs.

This repository included three major parts: 1) data and analysis of Medicago host selectivity and nodule morphology of 202 accessions of Medicago inoculated with a mixture of 86 Sinorhizobium meliloti strains. We provide code to assess how host selectivity is correlated with nodule morphology traits using LASSO regression and Boruta. 2) Results from a genome-wide association analysis of the host selectivity trait, and 3) data and analysis of Medicago adaptive enrichment of bacterial symbionts. The adaptive enrichment analysis includes measuring symbiotic traits (nodule length, size, branching, colony-forming units, nodule number, and strain relative frequencies in large and small nodule pools) on R108 and A17 hosts inoculated with a mixture of 68 rhizobial strains. We analyzed whether each host can enrich for beneficial rhizobial strains by comparing prior single-strain plant biomass measurements with strain frequencies in nodules.

Host Selectivity Datasets and Code

NoduleMorphology_HapMap_ElasticNet.R - Contains R code to conduct Medicago HapMap Panel rhizobial strain diversity analysis with host selectivity simulations, analysis of nodule traits, and relationships between nodule traits and host selectivity. This code generates Figure 2, S1, S3, and S4 as well as the information in Tables S1 (TableS1_CorrHS.csv + TableS1_CoV.csv) and S3 from the manuscript. Takes as inputs epstein_et_al_2022_S9.csv and df_summary.csv

epstein_et_al_2022_S9.csv - Contains previously published data describing the frequencies of 88 strains within initial inocula and nodule pools of the HapMap panel (HapMap Select & Resequence Experiment (Epstein et al., 2022)). Columns include:

pot_or_sample- pot reference id from the greenhouse experiment in Epstein et al 2022- corresponds to ‘pot’ in the nodule morphology dataset (df_summary.csv)
host_genotype- Medicago truncatula HapMap ID more information available on the MedicagoHapMap.org
MAG746A… MAG97: relative frequency of each of the Sinorhizobium strains from Riley et al 2022 estimated in the nodules from each pot

S_RGWA_1.CSV - Contains experimental design data from the 2020 HapMap Select & Resequence experiment described in Epstein et al. 2023.

pot - matches numbers in epstein_et_al_2022_S9.csv
geno - Medicago truncatula HapMap ID more information available on the MedicagoHapMap.org
block - Experimental replicate block

df_summary.csv - The summary dataset compiled from nodule morphology ImageJ analysis output measures a subset of which are used in downstream analysis.

pot - matches numbers in epstein_et_al_2022_S9.csv

prop.harvest - proportion of nodules harvested and images from pot

Columns 3:137 - Summary statistics of nodule morphology traits measured in ImageJ including mean, 95.quant (95th percentile), 5.quant (5th percentile) median, var, min, max, range, scaled_range (range / mean). Also included are the number of nodule_raw_count, ttl.nodule.area, nodule_calc_count, and pred.prop.lobed (predicted proportion of lobed nodules).

TableS1_CoV.csv

trait - Nodule morphological traits extracted from images of nodules pools
coeff.var - coefficient of variation among host genotypes for each trait

Table S1_CorHS.csv

trait- nodule morphology trait inferred from nodule images for all host genotypes
correlation.coef- Pearson correlation coefficient with host selectivity
p_value – raw p-value
sig_0.05 – y/n to indicate significant at 0.05 threshold
sig_correct – y/n to indicate significant after adjusting for multiple testing

GWAS of Host Selectivity Trait

Files for running the host selectivity GWAS pipeline (host_selectivity_gwas.sh) and record of settings for the GWAS run itself (medtr_ulmm_host_selectivity.log.txt), complete GWAS output files (medtr_ulmm_host_selectivity.assoc.txt, medtr_ulmm_host_selectivity.csv). manhattan_plot.R, A17_R108_SNPsegregation.R, and the summaries from of these analyses focused on significant SNPs, their associations, and linkage disequilibrium with other SNPs (sig_LD_geneinfo_pval_beta.tsv, sig_mdtr.csv, and sig_mdtr.ld).

Significant SNP information is mostly summarized within sig_LD_geneinfo_pval_beta.tsv. The column names in the complete GWAS results in medtr_ulmm_host_selectivity.assoc.txt and medtr_ulmm_host_selectivity.csv overlap with the one detailed below.

sig_LD_geneinfo_pval_beta.tsv- Includes additional annotation information and ref/alt SNPs calls gathered for significant SNPs in tab delimited format. Columns include:

snp- SNP location in the Medicago truncatula V5 genome assembly
chr- Chromosome ID from the Medicago truncatula V5 genome assembly
pos- SNP location
pvalue- extracted from GEMMA likelihood ratio text
allele1- alternate allele
allele0- reference accession allele
af- frequency of the reference allele in the HapMap population
beta- effect size of association with host selectivity
se- standard error
p_score – combination of p-value and effect size
start- closest annotated gene start
end- closest annotated gene end
dist_from_start – distance of SNP from gene start
dist_from_end size – distance of SNP from gene end
score- NA
strand- plus or minus DNA strand

source- EuGene: integrative gene finder for eukaryotic and prokaryotic genomes from INRAE
feature - All are CDS (Coding sequences)
frame- 0,1,2
gene_id- from Medicago truncatula V5 genome assembly
gene_name- If historically has one otherwise Medicago truncatula V5 genome assembly geneID
PRODUCT- based on annotation
GO- GO terms from EuGene
EC- KEGG Enzyme from EuGene
ACTIVITY - prediction from EuGene
PUBMED- - papers on the gene from EuGene
IPR- InterPro Database Prediction
ALT- nucleotide found in non-reference accession
REF- nucleotide found in the reference A17
HM001… HM316 – binary descriptor of whether each of the accession from the Medicago HapMap panel has the reference allele (0) or the alternate allele (1). Two alleles are included because Medicago are diploid but highly selfing.

medtr_ulmm_host_selectivity.assoc.txt - complete ulmm output of host selectivity means associated with SNPs containing information for all SNPs tested. Column names overlap with significant association files above.

medtr_ulmm_host_selectivity.csv - Reduced ulmm output of host selectivity means associated with SNPs. Column names overlap with significant association files above.

manhattan_plot.R- R code used generate the manhanttan plot.

A17_R108_SNPsegregation.R. R Code to compile ref/alternate SNP calls across all HapMap accessions for the significant GWAS associations. Takes as inputs sig_LD_geneinfo_pval_beta.tsv and HapMap_SNPs.csv

HapMap_SNPs.csv - CSV format of SNP markers included in the M. truncatula GWAS with genotypes as rows and SNPs as columns with names after the Medicago truncatula V5 genome assembly.

Host Adaptive Enrichment Datasets

NodSize_A17_R108.R - Contains R code for comparisons between A17 and R108 within Figure 3, 4, S5, S6, and S7 and associated analyses producing Tables S4, S5, S6, and S7. Takes as inputs NodulePools_SizeData.txt, SingleNoduleColonies.txt, NoduleSizeClassInfo.txt, SingleStrain_phenotype_summary.tsv, and freq_C68.txt.

NodulePools_SizeData.txt - Overall characteristics of A17 and R108 nodule size pools (e.g., number of nodules and wet weight)

Plant_genotype- A17 or R108
Community- Sinorhizobium community
Pot_replicate- number from 7-12
Harvest_date- month/day/year
Person- Intials of harvestor
Size- catagory (small, medium, large)
Nodule_number - number of nodules in the pool
Nodule_weight_g - nodule wet weight in grams

NoduleSize_HarvestData.txt- Overall summary of harvest data from A17 and R108 adaptive enrichment experiment.

Plant_genotype- A17 or R108
Community- Sinorhizobium inoculum used
Pot_replicate- numbers (7-12)
Harvest_date- month/day/year
Person- Intials of harvester
Plant_num- number of plants in the pot
Flowering_num- number of flowers at harvest
Seeding_num- number of seeds at harvest
Leaf_Color- qualitative measure of leaf yellowing- sign of N deficiency
Nod_num- number of nodules harvested
Plants_harvested- number of plants harvested
Nod_Num_white- number of white nodules in pool
Nod_weight_g - total nodule wet weight in grams
Veg_dryweight_g
Root_dryweight_g
Small_Nod_Num - small pool nodule number
Small_Nod_wieght- small nodule pool wet weight in grams
Med_Nod_Num - medium pool nodule number
Med_Nod_wieght- medium nodule pool wet weight in grams
Large_Nod_Num- large pool nodule number
Large_Nod_wieght- large nodule pool wet weight in grams

SingleNoduleColonies.txt - Colony-forming units estimated from individual crushed A17 and R108 nodules of each size class. ~ 3 replicate dilution series were conducted.

Plant_genotype- A17 or R108
Pot_replicate- numbers (7-12)
Size- nodule size category (small, medium, large)
Plate- replicate
Dilution- dilution level counted
Count- number of colonies on the dilution plate
Colonies- estimated number of colony-forming units from the nodule

NoduleSizeClassInfo.txt - Measurements of A17 and R108 nodule length and number of lobes from 10 randomly chosen nodules of each size category from each replicate pot.

Plant_genotype- A17 or R108
Community- Sinorhizobium inoculum used
Pot_replicate- numbers (7-12)
Size- nodule size category (small, medium, large)
Lobes_num- number of nodule lobes
Length_mm- length of nodule in millimeters

freq_C68.txt - Frequencies of a community of 68 Sinorhizobium strains within nodules communities of A17 and R108 small, medium, and large pools and for the initial strain communities. These values are inferred from HARP (see Burghardt *et. al., *AEM 2023 for details, Kessner *et. al., *2013 for reference to method)

pool- includes all sample details separated by underscores community of 68 strains (C68), host genotype (A17,R108, or initial), and treatment (s=small nodule, m=medium nodules, b=big nodules, X= all nodules) and replicate pot (#)
columns USDA1157....USDA1719 refer to all the strains in the community and their inferred relative frequency in the pool

SingleStrain_phenotype_summary.tsv - Single-strain plant growth benefit data for 68 strains from a previously published experiment (Burghardt et al. 2018)

plant_genotype- Medicago genotype (A17 or R108)
strain- Sinorhizobium meliloti strain name on NCBI
nodule- number of nodules centered and normalized based on dataset overall mean
nodule_raw- mean number of nodules for each host x strain combination
nodabove- mean number of nodules in the top half of the root system centered and normalized based on dataset overall mean
nodabove_raw mean number of nodules in the top half of the root system
nodred- mean number of red nodules in the top half of the root system centered and normalized based on dataset overall mean.
nodred_raw- mean number of red nodules in the top half of the root system
weight- mean aboveground plant dry weight in N-free conditions centered and normalized based on dataset overall mean
weight_raw- aboveground plant weight in N-free conditions

Code/Software

R is required to run the NoduleMorphology_HapMap_ElasticNet.R and NodSize_A17_R108.R; the scripts was created using R version 4.5.1. Annotations are provided throughout the script through 1) library loading, 2) dataset loading and cleaning, 3) analyses, and 4) figure creation.

Usage notes

Python is required for .Py scripts, R is required to open .R scripts, .sh file runs on the command line, Microsoft Excel can be used to view .csv, .txt, .tsv files

Works referencing this dataset

Burghardt, LT; Sydow, P; Sutherland, J; Epstein, B; and Tiffin, P. (2026) Genetic variation in host selectivity and adaptive strain enrichment in legume-rhizobia symbiosis, Proceedings of the Royal Society B, https://doi.org/10.1098/rspb.2025.2851

Genetic variation in host selectivity and adaptive strain enrichment in legume-rhizobia symbiosis

Data files

Abstract

Host Selectivity Datasets and Code

GWAS of Host Selectivity Trait

Host Adaptive Enrichment Datasets

Code/Software

Usage notes

Works referencing this dataset

Genetic variation in host selectivity and adaptive strain enrichment in legume-rhizobia symbiosis

Data files

Abstract

README: Genetic variation in host selectivity and adaptive strain enrichment in legume-rhizobia symbiosis

Host Selectivity Datasets and Code

GWAS of Host Selectivity Trait

Host Adaptive Enrichment Datasets

Code/Software

Usage notes

Works referencing this dataset