Data from: Biomarker detection and validation for corneal involvement in patients with acute infectious conjunctivitis: A multi-country study
Data files
Oct 18, 2025 version files 8.98 MB
-
biomarker_detection_count_data_md5-9b4b21f9.csv
8.58 MB
-
biomarker_detection_sample_table_md5-7bd7fef2.csv
2.58 KB
-
ENSG_ID2Name.txt.zip
395.48 KB
-
README.md
2.89 KB
Abstract
This dataset accompanies the study “Biomarker Detection and Validation for Corneal Involvement in Patients With Acute Infectious Conjunctivitis,” published in JAMA Ophthalmology (doi:10.1001/jamaophthalmol.2024.2891). The study utilized transcriptomic data and machine learning approaches to identify biomarkers associated with corneal involvement in conjunctivitis patients, with apolipoprotein E (APOE) emerging as a key biomarker. The dataset includes raw transcriptomic counts, sample metadata, and gene mapping files, enabling replication and further exploration of the findings.
Ethical considerations have been addressed, with all patient data anonymized and deidentified to protect privacy.
https://doi.org/10.5061/dryad.4j0zpc8mm
This dataset contains the gene count data used for finding biomarkers predicting corneal involvement in patients with acute infectious conjuntivitis, as described in the paper:
Seitzman GD, Prajna L, Prajna NV, et al. Biomarker Detection and Validation for Corneal Involvement in Patients With Acute Infectious Conjunctivitis. JAMA Ophthalmol. 2024;142(9):865–871. doi:10.1001/jamaophthalmol.2024.2891
List of files
biomarker_detection_count_data_md5-9b4b21f9.csv
biomarker_detection_sample_table_md5-7bd7fef2.csv
ENSG_ID2Name.txt.zip
Description of the data and file structure
Below is a brief description of each data file.
biomarker_detection_count_data_DESeq2norm_md5-9b4b21f9.csv
The CSV file biomarker_detection_count_data_DESeq2norm_md5-9b4b21f9.csv contains counts for human genes found in 58 conjunctival samples used in the study. RNA-Seq data was generated on an Illumina NovaSeq 6000 sequencing machine at the UCSF sequencing center. Sequencing reads were quality filtered using PriceSeqFilter, and aligned to the GRCh38 human genome assembly using HISAT2 (version 2.1.0). Abundance of genes was calculated using the default parameters in stringtie2 (version 1.3.4d). Annotation of transcripts was based on ENSEMBL GRCh38.87. The attached gene count matrix was then generated using the "prepDE.py" script according to the protocol found in the stringtie2 documentation.
- Rows: counts for 58,302 genes (identified by ENSG gene_id)
- Columns: one for each of 58 samples
biomarker_detection_sample_table_md5-7bd7fef2.csv
The CSV file biomarker_detection_sample_table_md5-7bd7fef2.csv contains metadata about the samples used, including the number of input reads for normalzing the counts to reads/million. It contains the following columns:
- Sample: unique, anonymized ID for each sample
- N_reads: number of sequencing read pairs
- Country: country of origin for each sample
- DESEq: part of training set used for DESeq2 dimensionality reduction
- Machine Learning: included in ML (all true, redundant column)
- RT-qPCR Validation: Real-time quantitative PCR was performed on this sample (Yes / No)
- Corneal Involvement: corneal involvement clinically detected (1 yes, 0 no)
- Sex: sex of patient
- Age: age of patient in years
ENSG_ID2Name.txt.zip
ENSG_ID2Name.txt.zip is a zip-compressed text file containing the mapping of ENSG IDs to gene names as they were used in the study.
Code/Software
Software used for determining gene counts:
- PRICE Sequence Filter (version 1.2)
- HISAT2 (version 2.1.0)
- stringtie2 (version 1.3.4d), and stringtie2's
prepDE.pyPython script
