This DATSETNAMEreadme.txt file was generated on 2021-04-19 by XIAN-JIE YANG GENERAL INFORMATION 1. Title of Dataset: Single cell RNA-sequencing dataset of H9 human ESC-derived retinal organdies: transduced by lentiviruses (LV-TetO) expressing EGFP, ATOH7, Neurog2 2. Author Information: Principal Investigator Contact Information Name: XIAN-JIE YANG Institution: UCLA Stein Eye Institute Address: 100 Stein Plaza, Los Angeles, CA 90095 Email: yang@jsei.ucla.edu 3. Date of data collection: approximately 2018-04-18 to 2018-05-18 4. Geographic location of data collection: City of Los Angeles, California, USA 5. Information about funding sources that supported the collection of the data: NIH grant R01EY026319 to XJY, NIH core grant P30EY000331, and an unrestricted grant from the Research to Prevent Blindness to the Department of Ophthalmology at University of California Los Angeles. SHARING/ACCESS INFORMATION 1. Licenses/restrictions placed on the data: None 2. Links to publications that cite or use the data: http://doil.org/10.3389/fcell.2021.653305 3. Links to other publicly accessible locations of the data: None 4. Links/relationships to ancillary data sets: None 5. Was data derived from another source? No 6. Recommended citation for this dataset: Zhang X., Mandric I., Nguyen K.H., Nguyen T.T.T., Pellegrini M., Grove J.C.R., Barnes S., Yang X.-J. (2021) Single cell transcriptomic analyses reveal the impact of bHLH factors on human retinal organoid development. Front. Cell De. Biol. 9:653305. DATA & FILE OVERVIEW 1. File List: GFP_filtered_gene_bc_matrices_h5.h5 AEP_filtered_gene_bc_matrices_h5.h5 NEP_filtered_gene_bc_matrices_h5.h5 Description of dataset: Day45-48 single cell RNA-sequencing data of H9 ESC-derived retinal organoid cells infected with the virus LV-GFP, LV-AEP, or LV-NEP. Data represent FACS sorted GFP-positive viral infected cells. 2. Relationship between files, if important: LV-GFP data serves as a control for LV-AEP and LV-NEP transduced human retinal organoid cells. 3. Additional related data collected that was not included in the current data package: None. 4. Are there multiple versions of the dataset? No. METHODOLOGICAL INFORMATION 1. Description of methods used for collection/generation of data: (See reference cited above.) Single cell cDNA library preparation and sequencing Distinct pools of H9 ES cell-derived retinal organoids (12-20 retinal organoids/pool) co-infected by LV-rtTA and LV-EGFP, LV-AEP, or LV-NEP were induced by Dox and dissociated between Day 45 and Day 48 using trypsin and manual trituration. Dissociated cell suspensions were subjected to fluorescence activated cell sorting using FACSAriaII (BD Biosciences). Non-infected retinal organoid cells were used to set thresholds for selecting EGFP-positive cells. Sorted EGFP-positive cells were collected in HBSS without Ca2+ and Mg2+ (ThermoFisher, 14170-112) containing 1% FBS and 0.4% BSA. The cells were washed with PBS containing 0.04% BSA, then counted with Countess II Cell Counter (ThermoFisher). Automated single-cell capture, barcoding, and cDNA library preparation were carried out using 10X Genomics Chromium Controller with Chromium Single Cell 3’ Library & Gel Bead Kit v2 reagents, with 12 cycles of cDNA amplification and 12 cycles of library amplification, following the manufacturer’s instructions. Qubit dsDNA Assay kit (Life Technologies) and TapeStation 4200 (Agilent) were used to assess the quality and concentration of the libraries. Illumina NovaSeq6000 S2 paired-end 2x50bp mode was used to sequence the libraries. 2. Methods for processing the data: (See reference cited above.) Single cell RNA-sequencing data processing and quality control 10X Genomics Cell Ranger version 2.1.1 was used to demultiplex the raw base calls into FASTQ files (cellranger mkfastq). Spliced Transcripts Alignment to a Reference (STAR) version 2.5.1b (cellranger count) was used to perform sequence alignments to the reference human genome (GRCh38), barcode counts, and UMI counts to yield summary reports and t-Stochastic Neighboring Embedding (t-SNE) dimensionality reduction. For downstream analyses, cells with a number of unique molecular identifiers (UMI) > 2500 per cell and < 0.1 % mitochondrial gene expression were used. For LV-GFP, LV-AEP, LV-NEP samples, the mean reads per cell ranged from 139,000-195,000, with mean gene per cell ranging from 2935-3079. The resulting total single cell counts used for analysis were 3004 for LV-EGFP, 2063 for LV-AEP, and 3909 for LV-NEP infected samples. 3. Instrument- or software-specific information needed to interpret the data: (See reference cited above.) Single cell RNA-sequencing data analysis and visualization The analysis of sc RNA-seq data was performed using Seurat R package (https://satijalab.org/seurat/v2.2) (Butler et al., 2018; Stuart et al., 2019). Clustering of cells was performed by using Seurat FindCluster function (top 20 principal components, resolution 0.8) that implements the shared nearest neighbor modularity optimization algorithm. Nonlinear dimensionality reduction using UMAP (Uniform Manifold Approximation and Projection) was applied for the visualization of cells in two-dimensional space. Feature plots of known genes were used to designate clusters observed in the UMAP space into six major cell categories/states. Cell counts of each category were obtained using custom R code. Pseudotime developmental progression of cell states was obtained by using the Monocle R package (version 2) to process the datasets with cell labels corresponding to the six cell categories and visualized as UMAPs. Pseudotime cell cycle progression and cell fate adoption analysis was performed using Slingshot R (version1.6.1) by combining the LV-GFP, LV-AEP, and LV-NEP sc RNA-seq datasets and assigning the start point as neural stem cells and end points as differentiated neuronal cell types. Differentially expressed genes (DEGs) were identified using edgeR with significantly enriched genes in each cell category defined as those with adjusted p values < 0.05 for LV-GFP, LV-AEP, and LV-NEP data sets (Supplementary Tables 1-3). The top 10 enriched DEGs in each cell category were defined as those with adjusted p-value < 0.05, and log fold change >1.5 (Supplementary Tables 4-6). Heatmaps of the top 10 DEGs were generated using the Seurat package. Volcano plots were generated using EnhancedVolcano R package to show p values and fold changes of DEGs between two datasets. Gene Ontology (GO) Enrichment analysis was performed using ShinyGO v0.61 (http://bioinformatics.sdstate.edu/go) (Ge et al., 2020) and the Homo Sapiens background using p-value (FDR) cutoff at 0.05. The top 25 DEGs from each cell category within the LV-GFP dataset were used as inputs, and the redundancy of the output biological processes was manually reduced to the most predominant GO terms. Feature plots of individual gene expression patterns in different cell clusters were presented as UMAPs. Violin plots for individual genes in all cell clusters were constructed to show expression levels and cell distributions. Kruskal-Wallis one-way ANOVA rank sum test and Tukey-Kramer-Nemenyi all-pairs test were used for statistical analysis, taken into consideration of both gene expression levels and cell numbers between different samples, with p value < 0.05 considered significant. The statistical tests were performed on R Studio using ‘PMCMRPlus’ (Kessler et al., 2020) and ‘FSA’ (Derek H. Ogle, 2020) packages. Data were plotted using R Studio ‘ggplot2’ (Hadley, 2016) and ‘ggsignif’ (Ahlmann-Eltze, 2019) packages. Cells with gene expression level < 0.2 were exclude from the violin plots and statistical analyses. STRING analysis exploring protein-protein association network was performed using human protein database (version 11.0; https://string-db.org/) by inputting relevant genes involved in retinal development. The schematic network model shows known molecular interactions reported previously and new regulatory relationships described in this study. 4. Standards and calibration information, if appropriate: 5. Environmental/experimental conditions: 6. Describe any quality-assurance procedures performed on the data: See 2. "Methods used for processing the data". 7. People involved with sample collection, processing, analysis and/or submission: Xiangmei Zhang, Kevin H. Nguyen in sample collection and processing; Igor Mandric, Matteo Pellegrini, Thao T.T. Nguyen in data analyses; Xian-Jie Yang in data submission. DATA-SPECIFIC INFORMATION FOR: [GFP_filtered_gene_bc_matrices_h5.h5][AEP_filtered_gene_bc_matrices_h5.h5][NEP_filtered_gene_bc_matrices_h5.h5] 1. Number of variables: 3 2. Number of cases/rows: 3. Variable List: GFP-filtered genes, Dox inducible lentivirus LV-GFP infected cells AEP-filtered genes, Dox inducible lentivirus LV-AEP infected cells NEP-filtered genes, Dox inducible lentivirus LV-NEP infected cells 4. Missing data codes: 5. Specialized formats or other abbreviations used: None