Gene-specific selective sweeps are pervasive across human gut microbiomes
Data files
Oct 21, 2025 version files 24.46 MB
-
all_QP_scans.zip
14.73 MB
-
cdiff_scan_comparison.zip
2.44 MB
-
fig2.zip
4.25 KB
-
fig3.zip
2.04 MB
-
fig4.zip
4.94 MB
-
README.md
4.36 KB
-
TableS4.txt
294.94 KB
-
tract_length_parameters.txt
1.30 KB
Abstract
The human gut microbiome is composed of a highly diverse consortia of species that are continually evolving within and across hosts. The ability to identify adaptations common to many human gut microbiomes would not only reveal shared selection pressures across hosts, but also key drivers of functional differentiation of the microbiome that may affect community structure and host traits. However, the extent to which adaptations have spread across human gut microbiomes is relatively unknown. Here, we develop a novel selection scan statistic named the integrated Linkage Disequilibrium Score (iLDS) that can detect sweeps of adaptive alleles spreading across host microbiomes via migration and horizontal gene transfer. Specifically, iLDS leverages signals of hitchhiking of deleterious variants with a beneficial variant. Application of the statistic to ~30 of the most prevalent commensal gut species from 24 human populations around the world revealed more than 300 selective sweeps across species. We find an enrichment for selective sweeps at loci involved in carbohydrate metabolism, indicative of adaptation to host diet, and we find that the targets of selection significantly differ between industrialized and non-industrialized populations. One of these sweeps is at a locus known to be involved in the metabolism of maltodextrin, a synthetic starch that has recently become a widespread component of industrialized diets. In summary, our results indicate that recombination between strains fuels pervasive adaptive evolution among human gut commensal bacteria, and strongly implicate host diet and lifestyle as a critical selection pressures.
Dataset DOI: 10.5061/dryad.9ghx3ffx4
Description of the data and file structure
All underlying data used in this paper is publicly available.
Files and variables
Note: For a complete description of iDLS output variables, see the Output section of the associated Github repo.
File: TableS4.txt
Description: contains all sweeps detected in all populations analyzed (QP, UHGG, Drosophila)
Variables
- Gene: the gene the variant lies in
- Contig: the contig the variant lies on (in the case of peaks for Drosophila melanogaster, this is the chromosome)
- SNV position (contig): the position of the central intermediate-frequency non-synonymous SNP in the analysis window with the highest iLDS value along the given contig
- Species: Name of the species
- Peak number: the unique peak to which the variant belongs (peaks are numbered from left to right for each species)
- Gene product: the gene protein product
- iLDS: value of iLDS for the variant
- Country: For UHGG and Drosophila, country where scan was run
- Data type: the dataset the scan was performed on (quasi-phased metagenomes, UHGG, or Drosophila)
- COG: Cluster of Orthologous Genes category
- EC: Enzyme Commission number of the gene
File: all_QP_scans.zip
Description: complete iLDS output files for each of the 32 QP species.
File: cdiff_scan_comparison.zip
Description: iLDS output files as well as output files for iHS, Tajima's D, and dN/dS calculations in Extended Data Figure 2.
File: fig2.zip
Description: Genomewide decay of LD at synonymous and non-synonymous sites for the species R. bromii and P. copri. AUC(r2N - r2S) for all quasi-phased (QP) species.
- R. bromii: R_bromii_ld_decay.txt
- P. copri: P_copri_ld_decay.txt
- AUC(r2N - r2S): AUC.txt
File: fig3.zip
Description: iLDS scans for the species C. difficile, E. siraeum, and R. bromii. Number of sweeps for each QP species.
- C. difficile: cdiff_full_scan.txt
- E. siraeum: Eubacterium_siraeum_57634_full_scan.txt
- R. bromii: rbromii_full_scan.txt
- Number sweeps: num_peaks.txt
Note: The iLDS output files (..._full_scan.txt) contain the contains raw output files of iLDS. For a complete description of the meanings of each column, see the Output section of the associated Github repo.
File: fig4.zip
Description: complete iLDS output files for R. bromii in 16 populations across the world (UHGG). Includes both the Main Text scans (Fig 4) as well as the remainder of the scans available in Extended Data Figure 8.
File: tract_length_parameters.txt
Description: detected tract length (lr) and decay distance (lDD) for each QP species
Variables
- tract_length: lr
- decay_points: lDD
Code/software
All code necessary to process underlying short read data, annotate genomes, measure LD, calculate iLDS, perform other analyses, and generate figures can be found in the accompanying GitHub repo for this project as well as in its Zenodo.
Access information
- Shotgun metagenomic samples for the QP analyses was derived from the following sources:
- 250 individuals from Lloyd-Price et al. (2017) (accession numbers PRJNA48479 and PRJNA275349)
- 250 individuals from Xie et al. (2016) (accession number PRJEB9576)
- 185 individuals from Qin et al. (2012) (accession number PRJNA422434)
- 8 individuals from Korpela et al. (2018) (accession number PRJEB24041)
- Alignments of MAGs and isolates and accompanying data files from UHGG (Almeida et al. 2019) were downloaded from MGnify.
- Finally, for Drosophila melanogaster analyses, we used publicly available Drosophila Genome Nexus data set (Lack et al. 2015). These include 205 DGRP strains from Raleigh, North Carolina and 197 DPGP3 strains from Zambia. These data can be downloaded at www.johnpool.net.
