Bacterial-MERFISH measurements of E. coli in various media and B. thetaiotaomicron in the mouse colon
Data files
Nov 20, 2024 version files 3.30 GB
-
bacterial_merfish_data.zip
3.30 GB
-
README.md
12.12 KB
Abstract
Single-cell decisions made in complex environments underlie many bacterial phenomena. Image-based, transcriptomics approaches offer an avenue to study such behaviors, yet these approaches have been hindered by the massive density of bacterial mRNA. To overcome this challenge, we combine 1000-fold volumetric expansion with multiplexed error robust fluorescence in situ hybridization (MERFISH) to create bacterial-MERFISH. This method enables high-throughput, spatially resolved profiling of thousands of operons within individual bacteria. Using bacterial-MERFISH, we dissect the response of E. coli to carbon starvation, systematically map subcellular RNA organization, and chart the adaptation of a gut commensal, B. thetaiotaomicron, to micron-scale niches in the mammalian colon.
This deposition contains raw data and final analysis structures associated with the benchmarking of bacterial-MERFISH in E. coli in LB medium, profiling of E. coli through a glucose-xylose diauxic shift, mapping of the internal organization of the E. coli transcriptome in LB, and measurements of the spatial-niche-adaptation of B. theta to the mouse colon.
https://doi.org/10.5061/dryad.n5tb2rc4d
Ari Sarfatis, Yuanyou Wang, Nana Twumasi-Ankrah, Jeffrey R. Moffitt
Boston Children's Hospital, Harvard Medical School, 2024
Description of files and file structure
ecoli_barcode_metadata/*_ecoli_barcode_metadata.csv
The decoded MERFISH read metadata for 97-, 1057-, or 1930-operon measurements of E. coli grown in LB or the glucose-xylose minimal media used for diauxic shift experiments.
This folder contains measurements from 12 different experiments each contained in their own csv file.
The name of the file contains a description of the experiment.
For example, '50X-expansion_97-operons_LB_rep1_ecoli_barcode_metadata' corresponds to the first replicate (rep1) of a 97-operon measurement in E. coli grown in LB and expanded with the 50X protocol.
One E. coli experiment did not include DNase treatment and is labeled 'noDNase'.
In addition, one replicate (rep1) of the diuaxic shift measurement had two MERFISH measurements collected from the same biological sample.
These are labeled 'technical1' and 'technical2'.
Description of the columns:
- operon_name: (str) The name of the operon targeted
- barcode_id: (int) A numeric id for the operon targeted
- fov_id: (int) An identifier for the field of view (FOV) containing the decoded read
- log10_average_magnitude: (float, in arbitrary units) The log10 of the average pixel intensity of the decoded read
- weighted_barcode_distance: (float, in arbitrary units) The Euclidean distance between the intensity profile for the measured barcode and that of the matched barcode
- pix_area: (int) The number of contiguous pixels assigned to the decoded read
- pix_position_1: (int, in pixels) The first pixel coordinate of the decoded read's centroid within the imaged FOV
- pix_position_2: (int, in pixels) The second pixel coordinate of the decoded read's centroid within the imaged FOV
- pix_position_3: (int) The index of the z-plane of the decoded read's centroid
- abs_position_1: (float, in microns) The first absolute coordinate of the decoded read's centroid
- abs_position_2: (float, in microns) The second absolute coordinate of the decoded read's centroid
- abs_position_3: (float, in microns) The third absolute coordinate of the decoded read's centroid
- abs_dist_to_nn_mask: (float, in microns) The distance to the mask of the nearest cell
- cellid: (str) A unique identifier for the segmented cell mask to which the read was assigned.
- condition: (str) An identifier of the sampling condition for the cell. A value of 'lb' indicates E. coli cells were grown in LB media. For cells grown in glucose-xylose diauxic media, this column indicates the collection point (gluX, shiftX, xylX, or statX with X an integer 1-4)
- dataset: (str) A dataset identifier, which indicates the expansion factor, MERFISH library, growth media, and replicate
- solidity: (float) The solidity of the segmented cell mask (total volume / convex hull volume)
- whole_cell_good_fov: (bool) Indicates if the cell was identified as imaged in its entirety and not imaged in a problematic FOV. This column is only provided for LB samples
ecoli_raw_counts_LB/raw_count_matrix_LB_*.h5ad
Raw count matrices in anndata format for 50X- and 1000X-expanded E. coli cells grown in LB.
Three files are provided with one each for measurements of 97-, 1057-, or 1930-operons.
Any column in the obs dataframe that shares a name with a column in the RNA metadata described above is defined as described above.
Note: The cellid identifiers are only unique for each dataset (as defined above) and are included in these structures to allow cells within these structures to be linked (with their dataset id) to the RNA metadata files above.
Description of fields and columns:
- X: The raw count matrix. Operons with 0 observations are omitted
- var:
- name: (str) The target operon name
- obs:
- valid_counts: (int) The total reads in the cell excluding blanks
- blank_counts: (int) The total blank counts in the cell
- valid_rna_species: (int) The number of different transcripts (not including blanks) with at least one count
ecoli_processed_counts_diauxic/processed_count_matrix_diauxic_full.h5ad
Processed count matrices in anndata format for 50X-expanded E. coli cells grown in glucose-xylose minimal defined medium.
This analysis object was created with scanpy.
Any column in the obs dataframe that shares a name with a column in the RNA metadata described above is defined as described above.
Description of fields and columns:
- X: The processed count matrix (counts per cell normalized to 100, with a pseudocount added, log transformed with the natural base, and z-scored)
- var:
- name: (str) The target operon name
- short_name: (str) A shorter version of the target operon name
- barcode: (str) The barcode associated with that operon
- n_cells: (int) The number of cells in which the operon is detected
- mean: (float) The average expression of the target operon across all cells
- uns: The preprocessing parameters, nearest neighbors, and differential gene expression results logged by scanpy
- obsm:
- X_pca: The PCA projection of X
- X_pca_harmony: The batch-corrected (harmonized) PCA projection of X
- X_umap_harmony: The UMAP representation of the harmonized data
- obsp: The pairwise connectivities and distances computed by scanpy
- layers:
- raw: The raw counts per cell
- dge: The counts per cell normalized to 100, with a pseudocount added, and log transformed with base 2; used for differential gene expression analysis
- obs:
- biological_replicate: (int) The number of the biological replicate from which the cell was taken
- OD600: (float) The optical density (OD) at 600nm at which the cell was collected
- batch: (int) A unique numeric identifier for each MERFISH measurement (i.e., biological or technical replicate)
- unique_cellid: (str) A unique identifier for each cell
- leiden_clusters_low_res: (str) The label associated with the major growth (G) or non-growth (NG) cluster to which the cell was assigned
- leiden_clusters: (str) The label associated with the final cluster to which each cell was assigned. A label that starts with 'excluded' indicates a cluster excluded in subsequent analysis
- n_reads_raw: (float) The number of transcripts counts within that cell
- PC2_3_ratio: (float) The ratio between the eigenvalues of the PC2 and PC3 associated with the shape of the cell
- PC1_volume_ratio: (float) The ratio between the eigenvalue of PC1 and the segmentation mask volume
ecoli_processed_counts_diauxic/processed_count_matrix_diauxic_subset-glu-shift.h5ad
Processed count matrices in anndata format for 50X-expanded E. coli cells grown in glucose-xylose diauxic media sampled between glu1 and shift4.
This anndata structure was used for pseudotime analysis and was created by scanpy.
This structure is a subset of the processed_count_matrix_diauxic_full.h5ad structure.
All shared fields and columns are defined as above.
The novel fields and columns produced in this analysis are described below.
Description of fields and columns:
- uns:
- iroot: The index of the root cell for the diffusion pseudotime calculation
- obsm:
- X_diffmap: The diffusion maps for this cell subset
- obs:
- DC1: (float) The value of the first diffusion component for each cell
- dpt_pseudotime: (float) The diffusion pseudotime computed with scanpy with a root cell set in cluster N10
- adj_pseudotime: (float) The adjusted diffusion pseudotime to more equally weight the pseudotime values across this shift
ecoli_intracellular_analyses/intracellular_distributions_and_reads_*.h5
Intracellular RNA localization analysis in HDF format for 50X and 1000X-expanded E. coli.
Minimal code to extract the hdf5 file in Python is provided.
Any column not defined below is defined as above in the ecoli barcode metadata section
Description of fields and columns:
- reads: The filtered reads used for the subcellular localization analyses as a pandas DataFrame
- axial_intracell_coord: (float) The axial intracellular coordinate of the read
- radial_intracell_coord: (float) The radial intracellular coordinate of the read
- radial_intracell_coord_squared: (float) The squared radial intracellular coordinate of the read
- counts: A dictionary indicating the total number of observations per mRNA target
- eval_kdes: {str, numpy array} A dictionary providing the evaluated 2D Kernel Density Estimation (KDE) of each mRNA target
- xx: (float) The x coordinates where the KDE was evaluated in the normalized coordinate frame
- yy: (float) The y coordinates where the KDE was evaluated in the normalized coordinate frame
ecoli_intracellular_analyses/intracellular_clusters.h5ad
Anndata of E. coli intracellar RNA localization KDEs clustered with Leiden clustering.
Any column or field not defined below is defined in the description of ecoli_processed_counts_diauxic/processed_count_matrix_diauxic_full.h5ad.
Description of fields and columns:
- X: The values of the flattened KDEs
- obs:
- operon_name: (str) The name of the operon
- leiden_clusters: (str) The intracellular localization cluster labels
- "PSORTdb: [Annotation]": (bool) True if any of the genes in the operon are annotated with the listed [Annotation] in PSORTdb
- genomic_position_angle: (float) The chromosomal position of the operon in degrees (as defined by Ecocyc)
btheta_barcode_metadata/*_barcode_metadata.csv
The decoded MERFISH read metadata for Bacteroides thetaiotaomicron (B. theta).
The name of each file describes the fixation method (PLP or Methacarn) and a unique replicate number for each sample.
Any column not defined below is defined in the ecoli_barcode_metadata section.
Description of the columns:
- dist: (float, in microns) The minimum distance of the read to the host epithelium
- n_id: (str) A unique identified for the expression patch to which each mRNA was assigned. 'dropped' indicates that the RNA was not included in a patch
- total_magnitude: (float, in arbitrary units) The sum of the Euclidean norm of all pixel intensities across imaging rounds associated with this read
btheta_patches/btheta_data.h5ad
Processed count matrices in anndata format for B. theta patches.
This anndata structure was created by scanpy.
Description of fields and columns:
- X: The processed count matrix (a pseudocount was added and expression was log-transformed with the natural base)
- var: Operon names
- uns: The preprocessing parameters, nearest neighbors, diffusion map, and UMAP results logged by scanpy
- obsp: The pairwise connectivities and distances computed by scanpy
- obsm:
- X_pca: The PCA projection of X
- X_raw: The raw untransformed counts
- obs:
- abs_position_1: (float, in microns) The first absolute coordinate of a given spatial patch's centroid
- abs_position_2: (float, in microns) The second absolute coordinate of a given spatial patch's centroid
- dist: (float, in microns) The minimum euclidian distance from the host epithelia to the patch
- sqrtdist: (float, in sqrt(microns)) The square root transformation of 'dist'
- datasets: (str) A unique identifier for the dataset in which the patch was measured
- DC1: (float) The values assigned to patches for the first diffusion component
- DC2: (float) The values assigned to patches for the second diffusion component
- dcbins: (str) DC1 values binned into three categories (i.e. low, mid, and high).
- UMAP_1: (float) The first coordinate of a UMAP of patches
- UMAP_2: (float) The first coordinate of a UMAP of patches
- n_id: (str) A unique id for the patch, matching the n_id field in the barcode_metadata csv file
ecoli_intracellular_analyses/HDF5_to_KDEdict.py
Contains the HDF5_to_KDE_datadict function which can be used to load the h5 files provided for the intracellular RNA distribution analysis.
Raw data provided here were collected using a version of multiplexed error robust fluorescence in situ hybridization (MERFISH) adapted for the measurement of the bacterial transcriptome. The detailed protocols used for sample preparation and MERFISH measurements are provided in the methods of the associated publication.
Some MERFISH data were analyzed using the single-cell analysis package, scanpy, and the intermediate and final data analysis objects created by this program are also provided.
Analysis of the intracellular organization of the E. coli transcriptome was performed with custom scripts described in the methods of the associated publication. These results are provided here with an example script illustrating the access of data within these custom files.
