MERFISH measurements of the mouse gastrointestinal tract in the presence and absence of the microbiome

Published Aug 05, 2025 on Dryad. https://doi.org/10.5061/dryad.p5hqbzm0z

Data files

Aug 05, 2025 version files 26.41 GB

data_upload.zip

26.41 GB
README.md

7.94 KB

Abstract

The mammalian gastrointestinal tract is comprised of a diverse set of cell types, each responsive to a diversity of diet-, host-, and microbe-derived small molecules. These small molecules are sensed by a massive diversity of receptors and differential regional, spatial, and cellular expression of these receptors is a critical mode by which sensation is mediated. In parallel, the gut microbiome plays critical roles in modulating the homeostatic cellular, molecular, and spatial structure of the mammalian gut, including potential roles in shaping the receptors that are expressed and, thus, the molecular sensing capabilities of the gut. Here we provide a spatially resolved, single-cell atlas of gene expression across four regions of the murine gut in the presence and absence of the microbiome. These measurements provide a rich resource for understanding cell-type specific sensation capabilities, the way in which cells fine tune these capabilities across gut regions, and, on a smaller scale, how these capabilities are fine tuned along based on cellular location within the gut mucosa.

Rosalind J. Xu, Jeffrey R. Moffitt
Boston Children's Hospital, 2025

Description

This repository contains a variety of data from a MERFISH study of the molecular and cellular organization of the murine gastrointestinal tract in the presence of a specific-pathogen-free (SPF) microbiota or no microbiota (germ-free [GF]).

The respository includes metadata associated with the identity and location of all imaged RNAs, a model used for cell segmentation, gene expression and metadata associated with identified cells, and other downstream analysis.

These data are organized in the following folders.

transcripts

This folder contains files that describe metadata associated with all RNAs identified with MERFISH.
These metadata were derived either during the decoding of RNA barcodes or in the segmentation process.

transcript_metadata_raw.csv

This file contains the location, identity, and associated metadata of all measured RNAs prior to cell segmentation. Each entry is an RNA with properties provided by the following fields:

dataset_ID: ID of dataset
dataset_name: name of dataset
slice_ID: ID of slice within the dataset
slice_full_name: unique name of all slices in all datasets
gene: gene identify of measured RNA
fov_ID: ID of FOV in MERFISH measurement
x [μm]: x position of RNA in μm
y [μm]: y position of RNA in μm
z [μm]: z position of RNA in μm
area [pixel]: area of RNA in number of pixels
brightness: average pixel brightness of RNA
weighted_distance: brightness-weighted average distance between pixels in RNA and barcode
qc_score: quality control score of RNA

transcript_metadata_segmented.csv

This file contains the metadata for all measured RNAs as described above but includes metadata added during the segmentation process. Note that all RNAs associated with a blank barcode (false-positive controls) and the Neat1 RNA were removed during segmentation and are, thus, not included. In addition to the above fields, each RNA contains the following additional information:

cell_ID: unique ID for all cells in dataset (format: datasetID_sliceID_cellID) from Baysor segmentation
cellpose_prior_ID: unique ID associated with cells determined via cellpose.
baysor_assignment_confidence: Baysor assignment probability of RNA to cell

cellpose_model

This folder contains a file used in the initial cell segmentation process with cellpose.

membrane_IF_model.cellpose

In-house trained Cellpose model for Na+/K+-ATPase membrane immunofluorescence in murine small and large intestine.

cell_by_gene

This folder contains the final single-cell analysis file generated with the scanpy pipeline.

cell_by_gene_with_metadata.h5ad

Cell-by-gene matrix with associated metadata for all datasets, stored in the Scanpy anndata (adata.h5ad) format. This object contains only the cells and genes that passed all quality thresholds. It contains the following objects:

adata.X

This numpy matrix contains the normalized and log-transformed cell-by-gene matrices.

adata.layers['raw_counts']

This numpy matrix contains the total counts for each RNA in each cell.

adata.obs

This pandas dataframe contains a series of metadata associated with each cell. This includes the following fields:

index: cell_ID (format: datasetID_sliceID_cellID)
cell_ID: the id for the cell within each dataset
cell_name: unique name for all cells in all datasets (format: dataset_slice_cell)
dataset_ID: ID of dataset
dataset_name: name of dataset
slice_ID: ID of slice within the dataset
slice_full_name: unique name associated with the slice in which the cell was imaged
fov_ID: ID of FOV in MERFISH measurement
region: gut region (ile = ileum, ce = cecum, pcol = proximal colon, dcol = distal colon)
microbiome: microbiome condition (WT = specific-pathogen-free, GF = germ-free)
condition: condition of measurement (format: region_microbiome)
x [μm]: x position of cell centroid in μm
y [μm]: y position of cell centroid in μm
z [μm]: z position of cell centroid in μm
area [μm²]: area of cell in μm²
n_counts: number of transcripts in cell
n_genes: number of unique genes in cell
doublet_score [Scrublet]: Scrublet doublet score for cell
cell_class: major cell class identifier based on first tier clustering
cell_type: cell type identifier based on multi-tier clustering
anatomical_layer: Gut anatomical layer based on spatial neighborhood analysis
mucosal_pseudospace: mucosal pseudospace position (0 = crypt base, 1 = top). Cells not in the mucosa have a value of -1.
umap_coords_x [all]: UMAP coordinates for all-cell embedding, x
umap_coords_y [all]: UMAP coordinates for all-cell embedding, y
umap_coords_x [cell_class]: UMAP coordinates for cell class-specific embedding, x
umap_coords_y [cell_class]: UMAP coordinates for cell class-specific embedding, y

adata.var

This dataframe contains information on each gene.

index: gene name
n_cells: number of cells expressing this gene
description: description of gene
[receptor category]: boolean value indicating whether the gene belongs to the receptor category listed in the column name

Additional Information

The adata structure also contains a series of additional information useful in interpreting the measurements. These include:

adata.uns['blank_names']: ordered names of blank barcodes
adata.obsm['blank_counts']: cell-by-gene matrix for blank barcode counts in cells
adata.uns['sequential_names']: ordered names of sequential smFISH genes
adata.obsm['sequential_intensities']: cell-by-gene matrix for average sequential smFISH intensities in cells

downstream_analysis/microbiota_metabolites

This folder and sub-folder contain results of the analysis of microbial metabolites that might activate specific cells based on their receptor expression.

microbiota_metabolite_receptor_interactions_input_dict.pkl

A python pickle file that contains the input dictionary to the drug2cell pipeline encoding microbiota metabolite - receptor interactions.

microbiota_metabolite_cell_interactions.h5ad

The microbiota metabolite - cell interaction scores (from drug2cell pipeline), stored in Scanpy anndata (adata.h5ad) format.

adata.X: microbiota metabolite - cell interaction scores (drug2cell)
adata.obs: same as adata.obs in cell_by_gene_with_metadata.h5ad (see above)
adata.var:
- index: microbiota metabolite names
- genes: target receptor gene names of the given microbiota metabolite (in MERFISH library)
- all_genes: target gene names of the given microbiota metabolite (all target genes)
- superclass, class, subclass, parent: chemical taxonomical information of the microbiota metabolite

downstream_analysis/receptor_ligand

This folder and sub-folder contain results of the analysis of a spatially informed receptor-ligand analysis.

spatially_informed_receptor_ligand_interactions.csv

The results of a spatially informed receptor ligand interaction analysis in csv format. The file contains the following fields:

region: gut region (ile: ileum, ce: cecum, pcol: prox. colon, dcol: dist.colon) in which the analysis was performed
microbiome: microbiome state (SPF: specific-pathogen-free, GF: germ-free) in which the analysis was performed
spatial_neighborhood: anatomical region in which the analysis was performed
ligand_cell_type: cell type expressing the ligand
receptor_cell_type: cell type expressing the receptor
ligand: ligand gene (translated to human gene name)
receptor: receptor gene (translated to human gene name)
mean [CellPhoneDB]: CellPhoneDB interaction score
p_value [CellPhoneDB]: p-value of the interaction
FDR: indicates whether the p-value is below 0.05 after FDR correction