MERFISH and snRNAseq analysis of healthy and disease human liver

Moffitt, Jeffrey 1 ; Watson, Brianna1 ; Paul, Biplab2; Mullen, Alan2

Published Feb 08, 2024; Updated Nov 01, 2024 on Dryad. https://doi.org/10.5061/dryad.37pvmcvsg

Data files

Feb 08, 2024 version files 32.38 GB

adata_healthy_diseased_merfish.h5ad

2.23 GB
adata_healthy_merfish_nucseq.h5ad

269.99 MB
adata_healthy_merfish.h5ad

350.39 MB
adata_healthy_nucseq.h5ad

1.26 GB
AM031_20211229.csv

448.51 MB
AM042_20211104.csv

1.95 GB
AM048_20210830.csv

1.52 GB
AM061_20211015.csv

1.14 GB
AM062_20220301.csv

455.33 MB
AM066_20220215.csv

1.29 GB
AM066_20220216.csv

1.15 GB
AM072_20220303.csv

650.39 MB
cell_properties_healthy_diseased_merfish.csv

125.11 MB
cell_properties_healthy_merfish_nucseq.csv

7.79 MB
cell_properties_healthy_merfish.csv

21.05 MB
cell_properties_healthy_nucseq.csv

1.14 MB
gene_names_healthy_diseased_merfish.csv

2.34 KB
gene_names_healthy_merfish_nucseq.csv

2.29 KB
gene_names_healthy_merfish.csv

2.34 KB
gene_names_healthy_nucseq.csv

175.69 KB
README.md

7.88 KB
X_healthy_diseased_merfish.csv

4.56 GB
X_healthy_merfish_nucseq.csv

843 MB
X_healthy_merfish.csv

716.57 MB
X_healthy_nucseq.csv

8.11 GB
X_raw_healthy_diseased_merfish.csv

4.56 GB
X_raw_healthy_merfish.csv

716.57 MB

Nov 01, 2024 version files 4.22 GB

combined_data.zip

4.22 GB
README.md

15.71 KB

Abstract

Single-cell RNA sequencing (scRNA-seq) has advanced our understanding of cell types and their heterogeneity within the human liver, but the spatial organization at single-cell resolution has not yet been described. Here we apply multiplexed error robust fluorescent in situ hybridization (MERFISH) to map the zonal distribution of hepatocytes, spatially resolve subsets of macrophage and mesenchymal populations, and investigate the relationship between hepatocyte ploidy and gene expression within the healthy human liver. We next integrated spatial information from MERFISH with the more complete transcriptome produced by single-nucleus RNA sequencing (snRNA-seq), revealing zonally enriched receptor-ligand interactions. Finally, MERFISH and snRNA-seq analysis of fibrotic liver samples identified two hepatocyte populations that expanded with injury and do not have clear zonal distributions. Together these spatial maps of the healthy and fibrotic liver provide a deeper understanding of the cellular and spatial remodeling that drives disease which, in turn, could provide new avenues for intervention and further study.

Brianna Watson, Biplab Paul, Jeffrey Moffitt, and Alan Mullen
2024

Anndata structures

Multiple anndata structures are provide as h5ad files. These anndata structures were generated with the scanpy pipeline and can be loaded in python with the associated tools integrated into the pipeline.
The contents of these anndata structures are also provided as csv files for convience.
These include:
(1) adata_healthy_merfish.h5ad
(2) adata_healthy_diseased_merfish.h5ad
(3) adata_merfish_nucseq_healthy_hep.h5ad
(4) adata_healthy_nucseq.h5ad
(5) adata_healthy_diseased_nucseq.h5ad
(6) adata_merfish_nucseq_diseased_hep.h5ad

Each anndata frame contains distinctive values for the respective data set. A description for each will be found below.

Convenience csv files

For convenience, we have also exported important elements of each anndata structure as a csv file.
These files share the same name as the anndata structures.
They are stored in the anndata_fields folder.
Files starting with 'cell_properties_' contain all metadata associated with the fields defined below.
Files starting with 'X_' contain the normalized count matrix.
Files starting with 'X_raw' contain the raw count matrix.
Files starting with 'gene_names' contain the names of genes in the order they are quantified within the count matrices.

Anndata descriptions

(1) adata_healthy_merfish.h5ad

This structure contains only data from healthy patient samples which were imaged with the MERFISH technique.
Raw data is stored in the adata.raw.X while adata.X is normalized by the total counts per cell, scaled to a uniform value, and then converted to logarithmic space by adding a pseudocount and applying a natural log transform.

The obs field contains the properties of each observed cell and for this structure are defined as follows (this is stored as 'cell_properties' in the csv format)
-sample_id: identification of patient
-condition: defines condition of patient, all samples in this set were healthy
-cell: cell id, values restart for each experiment
-density: number of RNA per area
-area: area of cell in microns^2
-x: x-position of cell centroid in microns
-y: y-position of cell centroid in microns
-dapi_intensity: intensity of DAPI stain for nuclei which were found to have 55% of nuclei within cellpose-baysor boundary from central z-plane
-nuclei: number of nuclei attributed to cell; nuclei boundary was required to have 55% within cellpose-baysor boundary 
-batch: product of experiment concatenation; is unique identifier of experiment
-n_counts: sum of counts within cell
-log_count: natural log of n_counts
-n_genes: number of genes for the cell
-Cell_Type: final cell type annotations (as they appear in submitted text)
-Additional ontology terms to describe patients data, cell types, etc. as defined by CellxGene
	-organism_ontology_term_id: NCBITaxon:9606 for human, NCBITaxon:10090 for mouse
	-donor_id: free-text identifier that distinguishes the unique individual that data were derived from (same as sample_id)
	-development_stage_ontology_term_id: HsapDv if human, MmusDv if mouse, unknown if information unavailable
	-sex_ontology_term_id: PATO:0000384 for male, PATO:0000383 for female, or unknown if unavailable
	-self_reported_ethnicity_ontology_term_id: HANCESTRO multiple comma-separated terms may be used if more than one ethnicity is reported. If human and information unavailable, use unknown. Use na if non-human.
	-disease_ontology_term_id: MONDO or PATO:0000461 for 'normal'
	-tissue_type: tissue, organoid, or cell culture
	-tissue_ontology_term_id: UBERON
	-cell_type_ontology_term_id: CL
	-assay_ontology_term_id: EFO
	-suspension_type: cell, nucleus, or na, as corresponding to assay. Base designation on cellxgene table. ("na" for this data set)

(2) adata_healthy_diseased_merfish.h5ad

This structure contains data from both healthy and diseased patient samples which were imaged with the MERFISH technique.
Raw data is stored in the adata.raw.X while adata.X is normalized by the total counts per cell, scaled to a uniform value, and then converted to logarithmic space by adding a pseudocount and applying a natural log transform.

The obs field contains the properties of each observed cell and for this structure are defined as follows (this is stored as 'cell_properties' in the csv format)
-sample_id: identification of patient
-condition: defines condition of patient, either healthy or diseased
-cell: cell id, values restart for each experiment
-density: number of RNA per area
-area: area of cell in microns^2
-x: x-position of cell centroid in microns
-y: y-position of cell centroid in microns
-batch: product of experiment concatenation; is unique identifier of experiment
-n_counts: sum of counts within cell
-log_count: natural log of n_counts
-n_genes: number of genes for the cell
-Cell_Type_final: final cell type annotations (as they appear in submitted text)
-Additional ontology terms to describe patients data, cell types, etc. as defined by CellxGene
	-organism_ontology_term_id: NCBITaxon:9606 for human, NCBITaxon:10090 for mouse
	-donor_id: free-text identifier that distinguishes the unique individual that data were derived from (same as sample_id)
	-development_stage_ontology_term_id: HsapDv if human, MmusDv if mouse, unknown if information unavailable
	-sex_ontology_term_id: PATO:0000384 for male, PATO:0000383 for female, or unknown if unavailable
	-self_reported_ethnicity_ontology_term_id: HANCESTRO multiple comma-separated terms may be used if more than one ethnicity is reported. If human and information unavailable, use unknown. Use na if non-human.
	-disease_ontology_term_id: MONDO or PATO:0000461 for 'normal'
	-tissue_type: tissue, organoid, or cell culture
	-tissue_ontology_term_id: UBERON
	-cell_type_ontology_term_id: CL
	-assay_ontology_term_id: EFO
	-suspension_type: cell, nucleus, or na, as corresponding to assay. Base designation on cellxgene table. ("na" for this data set)

(3) adata_merfish_nucseq_healthy_hep.h5ad

This structure contains data from only healthy patient samples which were imaged with the MERFISH or processed with snRNA-Seq.
adata.X is normalized by the total counts per cell, scaled to a uniform value, then converted to logarithmic space by adding a pseudocount and applying a natural log transform, and finally z-scored. This processing was done for the MERFISH and snRNA-Seq seperatly before concatenating the two data sets together. 
The raw data was not included in the structure since the units would not be equivalent between the two techniques.

The obs field contains the properties of each observed cell and for this structure are defined as follows (this is stored as 'cell_properties' in the csv format)
-CellType: snRNAseq annotation of cell from snRNAseq analysis alone
-n_genes: number of genes for the cell
-sample_id: identification of patient (MERFISH measurements)
-batch: product of anndata concatenation; is unique identifier of technique
-cell_name: Hepatocyte annotation from analysis of MERFISH data alone
-leiden_r1.5_joint: leiden resolution used for joint annotations
-Hep_Type_Joint: final joint annotation of heaptocytes
-sample_id_sn: identification of patient (snRNAseq measurements)

(4) adata_healthy_nucseq.h5ad

This structure contains data from only healthy patient samples which were processed with snRNA-Seq.
Raw data is stored in the adata.raw.X while adata.X is normalized by the total counts per cell, scaled to a uniform value, and then converted to logarithmic space by adding a pseudocount and applying a natural log transform.

The obs field contains the properties of each observed cell and for this structure are defined as follows (this is stored as 'cell_properties' in the csv format)
-Sex: biologically sex of patient (F or M)
-Age: Age range of patient (ie. 51-60 years of age)
-log10GenesPerUMI: number of genes detected per UMI
-seurat_clusters: clustering of cell based on seurat parameters
-CellType: cell annotations as determined by ScType
-Condition: classification of tissue as healthy (normal) or fibrotic (disease)
-n_counts: number of counts after mt filtering
-log_counts: natural log of n_counts
-n_genes: number of genes for the cell
-cell_type_final: cell type annotations as the appear in final text for snRNA-Seq 
-sample_id: identification of patient
-Additional ontology terms to describe patients data, cell types, etc. as defined by CellxGene
	-organism_ontology_term_id: NCBITaxon:9606 for human, NCBITaxon:10090 for mouse
	-donor_id: free-text identifier that distinguishes the unique individual that data were derived from (same as sample_id)
	-development_stage_ontology_term_id: HsapDv if human, MmusDv if mouse, unknown if information unavailable
	-sex_ontology_term_id: PATO:0000384 for male, PATO:0000383 for female, or unknown if unavailable
	-self_reported_ethnicity_ontology_term_id: HANCESTRO multiple comma-separated terms may be used if more than one ethnicity is reported. If human and information unavailable, use unknown. Use na if non-human.
	-disease_ontology_term_id: MONDO or PATO:0000461 for 'normal'
	-tissue_type: tissue, organoid, or cell culture
	-tissue_ontology_term_id: UBERON
	-cell_type_ontology_term_id: CL
	-assay_ontology_term_id: EFO
	-suspension_type: cell, nucleus, or na, as corresponding to assay. Base designation on cellxgene table. ("na" for this data set)

(5) adata_healthy_diseased_nucseq.h5ad

This structure contains data from healthy and diseased patient samples which were processed with snRNA-Seq.
Raw data is stored in the adata.raw.X while adata.X is normalized by the total counts per cell, scaled to a uniform value, and then converted to logarithmic space by adding a pseudocount and applying a natural log transform.

The obs field contains the properties of each observed cell and for this structure are defined as follows (this is stored as 'cell_properties' in the csv format)
-Condition: classification of tissue as healthy (normal) or fibrotic (disease)
-Sex: biologically sex of patient (F or M)
-Age: Age range of patient (ex. 51-60 years of age)
-log10GenesPerUMI_injured: number of genes detected per UMI (from seurat analysis for injured cells)
-seurat_clusters_injured: clustering of cell based on seurat parameters (from seurat analysis for injured cells)
-CellType_injured: cell annotations as determined by ScType (from seurat analysis for injured cells)
-cell_type_final_injured: cell type annotations as the appear in final text for injured snRNA-Seq 
-log10GenesPerUMI_healthy: number of genes detected per UMI (from seurat analysis for healthy cells)
-seurat_clusters_healthy: clustering of cell based on seurat parameters (from seurat analysis for healthy cells)
-CellType_healthy: cell annotations as determined by ScType (from seurat analysis for healthy cells)
-cell_type_final_healthy: cell type annotations as the appear in final text for healthy snRNA-Seq 
-batch: product of anndata concatenation; is unique identifier of technique
-n_counts: number of counts after mt filtering
-log_counts: natural log of n_counts
-n_genes: number of genes for the cell
-sample_id: identification of patient
-Additional ontology terms to describe patients data, cell types, etc. as defined by CellxGene
	-organism_ontology_term_id: NCBITaxon:9606 for human, NCBITaxon:10090 for mouse
	-donor_id: free-text identifier that distinguishes the unique individual that data were derived from (same as sample_id)
	-development_stage_ontology_term_id: HsapDv if human, MmusDv if mouse, unknown if information unavailable
	-sex_ontology_term_id: PATO:0000384 for male, PATO:0000383 for female, or unknown if unavailable
	-self_reported_ethnicity_ontology_term_id: HANCESTRO multiple comma-separated terms may be used if more than one ethnicity is reported. If human and information unavailable, use unknown. Use na if non-human.
	-disease_ontology_term_id: MONDO or PATO:0000461 for 'normal'
	-tissue_type: tissue, organoid, or cell culture
	-tissue_ontology_term_id: UBERON
	-cell_type_ontology_term_id: CL
	-assay_ontology_term_id: EFO
	-suspension_type: cell, nucleus, or na, as corresponding to assay. Base designation on cellxgene table. ("na" for this data set)

(6) adata_merfish_nucseq_diseased_hep.h5ad

This structure contains data from healthy and diseased patient samples which were imaged with the MERFISH and diseased samples processed with snRNA-Seq.
adata.X is normalized by the total counts per cell, scaled to a uniform value, then converted to logarithmic space by adding a pseudocount and applying a natural log transform, and finally z-scored. This processing was done for the MERFISH and snRNA-Seq seperatly before concatenating the two data sets together. 
The raw data was not included in the structure since the units would not be equivalent between the two techniques.

The obs field contains the properties of each observed cell and for this structure are defined as follows (this is stored as 'cell_properties' in the csv format)
-CellType: snRNAseq annotation of cell from snRNAseq analysis alone
-n_counts: sum of counts within cell
-n_genes: number of genes for the cell
-sample_id: identification of patient (MERFISH measurements)
-batch: product of anndata concatenation; is unique identifier of technique
-healthy_annotations: Annotations from analysis of healthy MERFISH data alone
-leiden_r1.2_joint: leiden resolution used for joint annotations
-Hep_Type_Joint_1.2: final joint annotation of heaptocytes (based on leiden resolution 1.2)
-sample_id_sn: identification of patient (snRNAseq measurements)

MERFISH Localizations

We also provide the output of the MERFISH analysis pipeline after segmentation with baysor.
These files provide all associated metadata with all identified RNA molecules.
These files are stored within the rna_metadata folder.

Each experiment has a csv file associated with the measurement. This file describes the properties of each RNA detected through the decoding pipeline (described below).
The files follow the convention of '[sample id]_[date of collection]'.

The following fields are included in each csv file:
-barcode_id: index of gene in codebook
-gene_name: name of identified gene
-fov_id: specific field-of-view (FOV) associated with image in which RNA is found. FOV follows sequential order of acquisition 
-total_magnitude: sum of the normalized intensity associated with each pixel of the given RNA
-abs_position_1: x position of the centroid of the RNA in microns
-abs_position_2: y position of the centroid of the RNA in microns
-abs_position_3: z position (depth within the tissue slice) of the centroid of the RNA in microns
-area: number of pixel assigned to the given RNA
-error_bit: bit position at which error correction occurred (if applicable)
-error_dir: direction of error (1 = 1->0 error (loss of signal); 0 = 0->1 error (gain of signal))
-av_distance: average distance to matched barcode
-feature_id: cellpose cell id
-in_feature: True or false indication if RNA in within cellpose cell boundary
-in_feature_id: cellpose cell id
-molecule_id: RNA ID used by baysor
-prior_segmentation: cell ID used by baysor
-confidence: Baysor output showing probability of a molecule to be real
-cluster: Baysor output identifying the molecule cluster
-cell: ID of cell assigned cell, ""corresponds to noise
-assignment_confidence: Baysor output indicating confidence of RNA assigned to a correct cell
-is_noise: Baysor output which indicated if the algorithm should (False) or should not (true) be assigned to a cell