MERFISH measurements of the mouse gastrointestinal tract in the presence and absence of the microbiome
Data files
Aug 05, 2025 version files 26.41 GB
-
data_upload.zip
26.41 GB
-
README.md
7.94 KB
Abstract
The mammalian gastrointestinal tract is comprised of a diverse set of cell types, each responsive to a diversity of diet-, host-, and microbe-derived small molecules. These small molecules are sensed by a massive diversity of receptors and differential regional, spatial, and cellular expression of these receptors is a critical mode by which sensation is mediated. In parallel, the gut microbiome plays critical roles in modulating the homeostatic cellular, molecular, and spatial structure of the mammalian gut, including potential roles in shaping the receptors that are expressed and, thus, the molecular sensing capabilities of the gut. Here we provide a spatially resolved, single-cell atlas of gene expression across four regions of the murine gut in the presence and absence of the microbiome. These measurements provide a rich resource for understanding cell-type specific sensation capabilities, the way in which cells fine tune these capabilities across gut regions, and, on a smaller scale, how these capabilities are fine tuned along based on cellular location within the gut mucosa.
Rosalind J. Xu, Jeffrey R. Moffitt
Boston Children's Hospital, 2025
Description
This repository contains a variety of data from a MERFISH study of the molecular and cellular organization of the murine gastrointestinal tract in the presence of a specific-pathogen-free (SPF) microbiota or no microbiota (germ-free [GF]).
The respository includes metadata associated with the identity and location of all imaged RNAs, a model used for cell segmentation, gene expression and metadata associated with identified cells, and other downstream analysis.
These data are organized in the following folders.
transcripts
This folder contains files that describe metadata associated with all RNAs identified with MERFISH.
These metadata were derived either during the decoding of RNA barcodes or in the segmentation process.
transcript_metadata_raw.csv
This file contains the location, identity, and associated metadata of all measured RNAs prior to cell segmentation. Each entry is an RNA with properties provided by the following fields:
- dataset_ID: ID of dataset
- dataset_name: name of dataset
- slice_ID: ID of slice within the dataset
- slice_full_name: unique name of all slices in all datasets
- gene: gene identify of measured RNA
- fov_ID: ID of FOV in MERFISH measurement
- x [μm]: x position of RNA in μm
- y [μm]: y position of RNA in μm
- z [μm]: z position of RNA in μm
- area [pixel]: area of RNA in number of pixels
- brightness: average pixel brightness of RNA
- weighted_distance: brightness-weighted average distance between pixels in RNA and barcode
- qc_score: quality control score of RNA
transcript_metadata_segmented.csv
This file contains the metadata for all measured RNAs as described above but includes metadata added during the segmentation process. Note that all RNAs associated with a blank barcode (false-positive controls) and the Neat1 RNA were removed during segmentation and are, thus, not included. In addition to the above fields, each RNA contains the following additional information:
- cell_ID: unique ID for all cells in dataset (format: datasetID_sliceID_cellID) from Baysor segmentation
- cellpose_prior_ID: unique ID associated with cells determined via cellpose.
- baysor_assignment_confidence: Baysor assignment probability of RNA to cell
cellpose_model
This folder contains a file used in the initial cell segmentation process with cellpose.
membrane_IF_model.cellpose
In-house trained Cellpose model for Na+/K+-ATPase membrane immunofluorescence in murine small and large intestine.
cell_by_gene
This folder contains the final single-cell analysis file generated with the scanpy pipeline.
cell_by_gene_with_metadata.h5ad
Cell-by-gene matrix with associated metadata for all datasets, stored in the Scanpy anndata (adata.h5ad) format. This object contains only the cells and genes that passed all quality thresholds. It contains the following objects:
adata.X
This numpy matrix contains the normalized and log-transformed cell-by-gene matrices.
adata.layers['raw_counts']
This numpy matrix contains the total counts for each RNA in each cell.
adata.obs
This pandas dataframe contains a series of metadata associated with each cell. This includes the following fields:
- index: cell_ID (format: datasetID_sliceID_cellID)
- cell_ID: the id for the cell within each dataset
- cell_name: unique name for all cells in all datasets (format: dataset_slice_cell)
- dataset_ID: ID of dataset
- dataset_name: name of dataset
- slice_ID: ID of slice within the dataset
- slice_full_name: unique name associated with the slice in which the cell was imaged
- fov_ID: ID of FOV in MERFISH measurement
- region: gut region (ile = ileum, ce = cecum, pcol = proximal colon, dcol = distal colon)
- microbiome: microbiome condition (WT = specific-pathogen-free, GF = germ-free)
- condition: condition of measurement (format: region_microbiome)
- x [μm]: x position of cell centroid in μm
- y [μm]: y position of cell centroid in μm
- z [μm]: z position of cell centroid in μm
- area [μm²]: area of cell in μm²
- n_counts: number of transcripts in cell
- n_genes: number of unique genes in cell
- doublet_score [Scrublet]: Scrublet doublet score for cell
- cell_class: major cell class identifier based on first tier clustering
- cell_type: cell type identifier based on multi-tier clustering
- anatomical_layer: Gut anatomical layer based on spatial neighborhood analysis
- mucosal_pseudospace: mucosal pseudospace position (0 = crypt base, 1 = top). Cells not in the mucosa have a value of -1.
- umap_coords_x [all]: UMAP coordinates for all-cell embedding, x
- umap_coords_y [all]: UMAP coordinates for all-cell embedding, y
- umap_coords_x [cell_class]: UMAP coordinates for cell class-specific embedding, x
- umap_coords_y [cell_class]: UMAP coordinates for cell class-specific embedding, y
adata.var
This dataframe contains information on each gene.
- index: gene name
- n_cells: number of cells expressing this gene
- description: description of gene
- [receptor category]: boolean value indicating whether the gene belongs to the receptor category listed in the column name
Additional Information
The adata structure also contains a series of additional information useful in interpreting the measurements. These include:
- adata.uns['blank_names']: ordered names of blank barcodes
- adata.obsm['blank_counts']: cell-by-gene matrix for blank barcode counts in cells
- adata.uns['sequential_names']: ordered names of sequential smFISH genes
- adata.obsm['sequential_intensities']: cell-by-gene matrix for average sequential smFISH intensities in cells
downstream_analysis/microbiota_metabolites
This folder and sub-folder contain results of the analysis of microbial metabolites that might activate specific cells based on their receptor expression.
microbiota_metabolite_receptor_interactions_input_dict.pkl
A python pickle file that contains the input dictionary to the drug2cell pipeline encoding microbiota metabolite - receptor interactions.
microbiota_metabolite_cell_interactions.h5ad
The microbiota metabolite - cell interaction scores (from drug2cell pipeline), stored in Scanpy anndata (adata.h5ad) format.
- adata.X: microbiota metabolite - cell interaction scores (drug2cell)
- adata.obs: same as adata.obs in cell_by_gene_with_metadata.h5ad (see above)
- adata.var:
- index: microbiota metabolite names
- genes: target receptor gene names of the given microbiota metabolite (in MERFISH library)
- all_genes: target gene names of the given microbiota metabolite (all target genes)
- superclass, class, subclass, parent: chemical taxonomical information of the microbiota metabolite
downstream_analysis/receptor_ligand
This folder and sub-folder contain results of the analysis of a spatially informed receptor-ligand analysis.
spatially_informed_receptor_ligand_interactions.csv
The results of a spatially informed receptor ligand interaction analysis in csv format. The file contains the following fields:
- region: gut region (ile: ileum, ce: cecum, pcol: prox. colon, dcol: dist.colon) in which the analysis was performed
- microbiome: microbiome state (SPF: specific-pathogen-free, GF: germ-free) in which the analysis was performed
- spatial_neighborhood: anatomical region in which the analysis was performed
- ligand_cell_type: cell type expressing the ligand
- receptor_cell_type: cell type expressing the receptor
- ligand: ligand gene (translated to human gene name)
- receptor: receptor gene (translated to human gene name)
- mean [CellPhoneDB]: CellPhoneDB interaction score
- p_value [CellPhoneDB]: p-value of the interaction
- FDR: indicates whether the p-value is below 0.05 after FDR correction
