Bacterial-MERFISH measurements of E. coli in various media and B. thetaiotaomicron in the mouse colon

Moffitt, Jeffrey 1 ; Sarfatis, Ari1; Wang, Yuanyou1; Twumasi-Ankrah, Nana1

Published Nov 20, 2024 on Dryad. https://doi.org/10.5061/dryad.n5tb2rc4d

Data files

Nov 20, 2024 version files 3.30 GB

bacterial_merfish_data.zip

3.30 GB
README.md

12.12 KB

Abstract

Single-cell decisions made in complex environments underlie many bacterial phenomena. Image-based, transcriptomics approaches offer an avenue to study such behaviors, yet these approaches have been hindered by the massive density of bacterial mRNA. To overcome this challenge, we combine 1000-fold volumetric expansion with multiplexed error robust fluorescence in situ hybridization (MERFISH) to create bacterial-MERFISH. This method enables high-throughput, spatially resolved profiling of thousands of operons within individual bacteria. Using bacterial-MERFISH, we dissect the response of E. coli to carbon starvation, systematically map subcellular RNA organization, and chart the adaptation of a gut commensal, B. thetaiotaomicron, to micron-scale niches in the mammalian colon.

This deposition contains raw data and final analysis structures associated with the benchmarking of bacterial-MERFISH in E. coli in LB medium, profiling of E. coli through a glucose-xylose diauxic shift, mapping of the internal organization of the E. coli transcriptome in LB, and measurements of the spatial-niche-adaptation of B. theta to the mouse colon.

https://doi.org/10.5061/dryad.n5tb2rc4d
Ari Sarfatis, Yuanyou Wang, Nana Twumasi-Ankrah, Jeffrey R. Moffitt
Boston Children's Hospital, Harvard Medical School, 2024

Description of files and file structure

ecoli_barcode_metadata/*_ecoli_barcode_metadata.csv

The decoded MERFISH read metadata for 97-, 1057-, or 1930-operon measurements of E. coli grown in LB or the glucose-xylose minimal media used for diauxic shift experiments.
This folder contains measurements from 12 different experiments each contained in their own csv file.
The name of the file contains a description of the experiment.
For example, '50X-expansion_97-operons_LB_rep1_ecoli_barcode_metadata' corresponds to the first replicate (rep1) of a 97-operon measurement in E. coli grown in LB and expanded with the 50X protocol.
One E. coli experiment did not include DNase treatment and is labeled 'noDNase'.
In addition, one replicate (rep1) of the diuaxic shift measurement had two MERFISH measurements collected from the same biological sample.
These are labeled 'technical1' and 'technical2'.

Description of the columns:

operon_name: (str) The name of the operon targeted
barcode_id: (int) A numeric id for the operon targeted
fov_id: (int) An identifier for the field of view (FOV) containing the decoded read
log10_average_magnitude: (float, in arbitrary units) The log10 of the average pixel intensity of the decoded read
weighted_barcode_distance: (float, in arbitrary units) The Euclidean distance between the intensity profile for the measured barcode and that of the matched barcode
pix_area: (int) The number of contiguous pixels assigned to the decoded read
pix_position_1: (int, in pixels) The first pixel coordinate of the decoded read's centroid within the imaged FOV
pix_position_2: (int, in pixels) The second pixel coordinate of the decoded read's centroid within the imaged FOV
pix_position_3: (int) The index of the z-plane of the decoded read's centroid
abs_position_1: (float, in microns) The first absolute coordinate of the decoded read's centroid
abs_position_2: (float, in microns) The second absolute coordinate of the decoded read's centroid
abs_position_3: (float, in microns) The third absolute coordinate of the decoded read's centroid
abs_dist_to_nn_mask: (float, in microns) The distance to the mask of the nearest cell
cellid: (str) A unique identifier for the segmented cell mask to which the read was assigned.
condition: (str) An identifier of the sampling condition for the cell. A value of 'lb' indicates E. coli cells were grown in LB media. For cells grown in glucose-xylose diauxic media, this column indicates the collection point (gluX, shiftX, xylX, or statX with X an integer 1-4)
dataset: (str) A dataset identifier, which indicates the expansion factor, MERFISH library, growth media, and replicate
solidity: (float) The solidity of the segmented cell mask (total volume / convex hull volume)
whole_cell_good_fov: (bool) Indicates if the cell was identified as imaged in its entirety and not imaged in a problematic FOV. This column is only provided for LB samples

ecoli_raw_counts_LB/raw_count_matrix_LB_*.h5ad

Raw count matrices in anndata format for 50X- and 1000X-expanded E. coli cells grown in LB.
Three files are provided with one each for measurements of 97-, 1057-, or 1930-operons.
Any column in the obs dataframe that shares a name with a column in the RNA metadata described above is defined as described above.
Note: The cellid identifiers are only unique for each dataset (as defined above) and are included in these structures to allow cells within these structures to be linked (with their dataset id) to the RNA metadata files above.

Description of fields and columns:

X: The raw count matrix. Operons with 0 observations are omitted
var:
- name: (str) The target operon name
obs:
- valid_counts: (int) The total reads in the cell excluding blanks
- blank_counts: (int) The total blank counts in the cell
- valid_rna_species: (int) The number of different transcripts (not including blanks) with at least one count

ecoli_processed_counts_diauxic/processed_count_matrix_diauxic_full.h5ad

Processed count matrices in anndata format for 50X-expanded E. coli cells grown in glucose-xylose minimal defined medium.
This analysis object was created with scanpy.
Any column in the obs dataframe that shares a name with a column in the RNA metadata described above is defined as described above.

Description of fields and columns:

X: The processed count matrix (counts per cell normalized to 100, with a pseudocount added, log transformed with the natural base, and z-scored)
var:
- name: (str) The target operon name
- short_name: (str) A shorter version of the target operon name
- barcode: (str) The barcode associated with that operon
- n_cells: (int) The number of cells in which the operon is detected
- mean: (float) The average expression of the target operon across all cells
uns: The preprocessing parameters, nearest neighbors, and differential gene expression results logged by scanpy
obsm:
- X_pca: The PCA projection of X
- X_pca_harmony: The batch-corrected (harmonized) PCA projection of X
- X_umap_harmony: The UMAP representation of the harmonized data
obsp: The pairwise connectivities and distances computed by scanpy
layers:
- raw: The raw counts per cell
- dge: The counts per cell normalized to 100, with a pseudocount added, and log transformed with base 2; used for differential gene expression analysis
obs:
- biological_replicate: (int) The number of the biological replicate from which the cell was taken
- OD600: (float) The optical density (OD) at 600nm at which the cell was collected
- batch: (int) A unique numeric identifier for each MERFISH measurement (i.e., biological or technical replicate)
- unique_cellid: (str) A unique identifier for each cell
- leiden_clusters_low_res: (str) The label associated with the major growth (G) or non-growth (NG) cluster to which the cell was assigned
- leiden_clusters: (str) The label associated with the final cluster to which each cell was assigned. A label that starts with 'excluded' indicates a cluster excluded in subsequent analysis
- n_reads_raw: (float) The number of transcripts counts within that cell
- PC2_3_ratio: (float) The ratio between the eigenvalues of the PC2 and PC3 associated with the shape of the cell
- PC1_volume_ratio: (float) The ratio between the eigenvalue of PC1 and the segmentation mask volume

ecoli_processed_counts_diauxic/processed_count_matrix_diauxic_subset-glu-shift.h5ad

Processed count matrices in anndata format for 50X-expanded E. coli cells grown in glucose-xylose diauxic media sampled between glu1 and shift4.
This anndata structure was used for pseudotime analysis and was created by scanpy.

This structure is a subset of the processed_count_matrix_diauxic_full.h5ad structure.
All shared fields and columns are defined as above.
The novel fields and columns produced in this analysis are described below.

Description of fields and columns:

uns:
- iroot: The index of the root cell for the diffusion pseudotime calculation
obsm:
- X_diffmap: The diffusion maps for this cell subset
obs:
- DC1: (float) The value of the first diffusion component for each cell
- dpt_pseudotime: (float) The diffusion pseudotime computed with scanpy with a root cell set in cluster N10
- adj_pseudotime: (float) The adjusted diffusion pseudotime to more equally weight the pseudotime values across this shift

ecoli_intracellular_analyses/intracellular_distributions_and_reads_*.h5

Intracellular RNA localization analysis in HDF format for 50X and 1000X-expanded E. coli.
Minimal code to extract the hdf5 file in Python is provided.
Any column not defined below is defined as above in the ecoli barcode metadata section

Description of fields and columns:

reads: The filtered reads used for the subcellular localization analyses as a pandas DataFrame
- axial_intracell_coord: (float) The axial intracellular coordinate of the read
- radial_intracell_coord: (float) The radial intracellular coordinate of the read
- radial_intracell_coord_squared: (float) The squared radial intracellular coordinate of the read
counts: A dictionary indicating the total number of observations per mRNA target
eval_kdes: {str, numpy array} A dictionary providing the evaluated 2D Kernel Density Estimation (KDE) of each mRNA target
xx: (float) The x coordinates where the KDE was evaluated in the normalized coordinate frame
yy: (float) The y coordinates where the KDE was evaluated in the normalized coordinate frame

ecoli_intracellular_analyses/intracellular_clusters.h5ad

Anndata of E. coli intracellar RNA localization KDEs clustered with Leiden clustering.
Any column or field not defined below is defined in the description of ecoli_processed_counts_diauxic/processed_count_matrix_diauxic_full.h5ad.

Description of fields and columns:

X: The values of the flattened KDEs
obs:
- operon_name: (str) The name of the operon
- leiden_clusters: (str) The intracellular localization cluster labels
- "PSORTdb: [Annotation]": (bool) True if any of the genes in the operon are annotated with the listed [Annotation] in PSORTdb
- genomic_position_angle: (float) The chromosomal position of the operon in degrees (as defined by Ecocyc)

btheta_barcode_metadata/*_barcode_metadata.csv

The decoded MERFISH read metadata for Bacteroides thetaiotaomicron (B. theta).
The name of each file describes the fixation method (PLP or Methacarn) and a unique replicate number for each sample.
Any column not defined below is defined in the ecoli_barcode_metadata section.

Description of the columns:
- dist: (float, in microns) The minimum distance of the read to the host epithelium
- n_id: (str) A unique identified for the expression patch to which each mRNA was assigned. 'dropped' indicates that the RNA was not included in a patch
- total_magnitude: (float, in arbitrary units) The sum of the Euclidean norm of all pixel intensities across imaging rounds associated with this read

btheta_patches/btheta_data.h5ad

Processed count matrices in anndata format for B. theta patches.
This anndata structure was created by scanpy.

Description of fields and columns:

X: The processed count matrix (a pseudocount was added and expression was log-transformed with the natural base)
var: Operon names
uns: The preprocessing parameters, nearest neighbors, diffusion map, and UMAP results logged by scanpy
obsp: The pairwise connectivities and distances computed by scanpy
obsm:
- X_pca: The PCA projection of X
- X_raw: The raw untransformed counts
obs:
- abs_position_1: (float, in microns) The first absolute coordinate of a given spatial patch's centroid
- abs_position_2: (float, in microns) The second absolute coordinate of a given spatial patch's centroid
- dist: (float, in microns) The minimum euclidian distance from the host epithelia to the patch
- sqrtdist: (float, in sqrt(microns)) The square root transformation of 'dist'
- datasets: (str) A unique identifier for the dataset in which the patch was measured
- DC1: (float) The values assigned to patches for the first diffusion component
- DC2: (float) The values assigned to patches for the second diffusion component
- dcbins: (str) DC1 values binned into three categories (i.e. low, mid, and high).
- UMAP_1: (float) The first coordinate of a UMAP of patches
- UMAP_2: (float) The first coordinate of a UMAP of patches
- n_id: (str) A unique id for the patch, matching the n_id field in the barcode_metadata csv file

ecoli_intracellular_analyses/HDF5_to_KDEdict.py

Contains the HDF5_to_KDE_datadict function which can be used to load the h5 files provided for the intracellular RNA distribution analysis.

Bacterial-MERFISH measurements of E. coli in various media and B. thetaiotaomicron in the mouse colon

Data files

Abstract

README: Bacterial-MERFISH measurements of E. coli in various media and B. thetaiotaomicron in the mouse colon

Description of files and file structure

ecoli_barcode_metadata/*_ecoli_barcode_metadata.csv

ecoli_raw_counts_LB/raw_count_matrix_LB_*.h5ad

ecoli_processed_counts_diauxic/processed_count_matrix_diauxic_full.h5ad

ecoli_processed_counts_diauxic/processed_count_matrix_diauxic_subset-glu-shift.h5ad

ecoli_intracellular_analyses/intracellular_distributions_and_reads_*.h5

ecoli_intracellular_analyses/intracellular_clusters.h5ad

btheta_barcode_metadata/*_barcode_metadata.csv

btheta_patches/btheta_data.h5ad

ecoli_intracellular_analyses/HDF5_to_KDEdict.py

Methods