This dataset contains all next-generation sequencing (NGS) data generated to evaluate the OMEGA (Oligo-based Multiplexed Efficient Gene Assembly) platform for multiplexed gene library construction. It includes PacBio HiFi long-read sequencing (BAM with index and metadata), Illumina paired-end sequencing (FASTQ), and Oxford Nanopore long-read sequencing (FASTQ) across assembly validation (Rubisco and Cas9 libraries), amplicon-based quality control (PCR and JSCAN), and functional screening of GFP variant libraries. Data are organized by sequencing platform and experiment type, including replicate-level presort libraries and fluorescence-activated cell sorting (FACS) populations spanning multiple fluorescence bins and negative controls. These files enable reconstruction of full-length variants, quantification of assembly accuracy and uniformity, and analysis of sequence–function relationships. All data are standard DNA sequencing formats compatible with common open-source tools and are suitable for benchmarking gene assembly methods, developing analysis pipelines, and training machine learning models. All sequences are synthetic or non-pathogenic research constructs; no human or clinical data are included, and there are no ethical or legal restrictions on reuse.

Dataset Overview

This dataset contains all next-generation sequencing (NGS) data generated to evaluate the OMEGA gene assembly method. It includes PacBio HiFi long-read sequencing (BAM), Illumina paired-end sequencing (FASTQ), and Oxford Nanopore long-read sequencing (FASTQ). These data measure gene assembly accuracy, library composition, and functional screening of GFP variants.

All files contain DNA sequence reads. No processed tables or derived variables are included.

Data Organization

The dataset is organized by sequencing platform and experiment type:

PacBio HiFi (BAM): full-length assembled genes
Illumina (FASTQ): short-read validation
Oxford Nanopore (FASTQ): long-read variant analysis and sorted populations

Each filename encodes:

library type (e.g., GFP, Rubisco)
replicate number (rep1, rep2)
sample type (presort, sorted bin, control)

Key Definitions

Presort: unsorted gene library before selection
FACS: sorting of GFP variants by fluorescence
GFP1–GFP5: increasing fluorescence bins (GFP1 = lowest, GFP5 = highest)
N1, N2: negative gates
NEG: combined negative population
PreSort: unsorted Nanopore sample
Replicate: independent experiment

File Types

FASTQ (.fastq.gz)

Raw sequencing reads:

nucleotide sequence (A, T, C, G)
quality scores (Phred scale)

Units:

sequence length: base pairs (bp)
quality: Phred score

BAM (.bam)

Aligned long reads with metadata.

Associated files:

.pbi: index
.xml: metadata
.md5sum: integrity check

Data File Inventory

PacBio HiFi — Rubisco and Cas9

lib1-hifi_reads.bam
lib1-hifi_reads.bam.pbi
lib2-hifi_reads.bam
lib2-hifi_reads.bam.pbi

Illumina — PCR

pcrs_R1.fastq.gz
pcrs_R2.fastq.gz

Illumina — JSCAN

jscan_R1.fastq.gz
jscan_R2.fastq.gz

Oxford Nanopore — fscan

fscan_ont_data.fq.gz

PacBio HiFi — Uniprot GFP

uniprot_rep1_presort.bam
uniprot_rep1_presort.bam.pbi
uniprot_rep1_presort.bam.md5sum
uniprot_rep1_presort.consensusreadset.xml
uniprot_rep2_presort.bam
uniprot_rep2_presort.bam.pbi
uniprot_rep2_presort.bam.md5sum
uniprot_rep2_presort.consensusreadset.xml

PacBio HiFi — SynGFP (Original)

syngfp_rep1_presort.bam
syngfp_rep1_presort.bam.pbi
syngfp_rep1_presort.bam.md5sum
syngfp_rep1_presort.consensusreadset.xml
syngfp_rep2_presort.bam
syngfp_rep2_presort.bam.pbi
syngfp_rep2_presort.bam.md5sum
syngfp_rep2_presort.consensusreadset.xml

PacBio HiFi — SynGFP (Repeat Assembly)

syngfp_assembly2_rep1_presort.bam
syngfp_assembly2_rep1_presort.bam.pbi
syngfp_assembly2_rep1_presort.md5sum
syngfp_assembly2_rep1_presort.consensusreadset.xml
syngfp_assembly2_rep2_presort.bam
syngfp_assembly2_rep2_presort.bam.pbi
syngfp_assembly2_rep2_presort.md5sum
syngfp_assembly2_rep2_presort.consensusreadset.xml

Oxford Nanopore — FACS GFP (Replicate 1)

syngfp_rep1_GFP1.fq.gz
syngfp_rep1_GFP2.fq.gz
syngfp_rep1_GFP3.fq.gz
syngfp_rep1_GFP4.fq.gz
syngfp_rep1_GFP5.fq.gz
syngfp_rep1_N1.fq.gz
syngfp_rep1_N2.fq.gz
syngfp_rep1_NEG.fq.gz
syngfp_rep1_PreSort.fq.gz

Oxford Nanopore — FACS GFP (Replicate 2)

syngfp_rep2_GFP1.fq.gz
syngfp_rep2_GFP2.fq.gz
syngfp_rep2_GFP3.fq.gz
syngfp_rep2_GFP4.fq.gz
syngfp_rep2_GFP5.fq.gz
syngfp_rep2_N1.fq.gz
syngfp_rep2_N2.fq.gz
syngfp_rep2_NEG.fq.gz
syngfp_rep2_PreSort.fq.gz

Usage Notes

Viewing BAM files (PacBio HiFi)

Use samtools to inspect BAM files:

samtools view file.bam | head

Viewing FASTQ files (Illumina and ONT)

View reads directly:

zcat file.fastq.gz | head

Companion files

.bam.pbi — PacBio index (not required for viewing)
.consensusreadset.xml — metadata (not required for viewing)
.md5sum — file integrity check (run: md5sum -c filename.md5sum)

Only .bam and .fastq.gz files are needed for basic inspection.

Reuse

Applications include:

gene assembly benchmarking
variant reconstruction
sequence–function modeling
machine learning training datasets

Ethical and Legal

No human data
Non-pathogenic sequences
No restrictions on reuse (CC0)

Scalable and cost-efficient custom gene library assembly from oligopools

Data files

Abstract

Dataset Overview

Data Organization

Key Definitions

File Types

FASTQ (.fastq.gz)

BAM (.bam)

Data File Inventory

PacBio HiFi — Rubisco and Cas9

Illumina — PCR

Illumina — JSCAN

Oxford Nanopore — fscan

PacBio HiFi — Uniprot GFP

PacBio HiFi — SynGFP (Original)

PacBio HiFi — SynGFP (Repeat Assembly)

Oxford Nanopore — FACS GFP (Replicate 1)

Oxford Nanopore — FACS GFP (Replicate 2)

Usage Notes

Viewing BAM files (PacBio HiFi)

Viewing FASTQ files (Illumina and ONT)

Companion files

Reuse

Ethical and Legal

Scalable and cost-efficient custom gene library assembly from oligopools

Data files

Abstract

README: NGS Data for Scalable and cost-efficient custom gene library assembly from oligopools

Dataset Overview

Data Organization

Key Definitions

File Types

FASTQ (.fastq.gz)

BAM (.bam)

Data File Inventory

PacBio HiFi — Rubisco and Cas9

Illumina — PCR

Illumina — JSCAN

Oxford Nanopore — fscan

PacBio HiFi — Uniprot GFP

PacBio HiFi — SynGFP (Original)

PacBio HiFi — SynGFP (Repeat Assembly)

Oxford Nanopore — FACS GFP (Replicate 1)

Oxford Nanopore — FACS GFP (Replicate 2)

Usage Notes

Viewing BAM files (PacBio HiFi)

Viewing FASTQ files (Illumina and ONT)

Companion files

Reuse

Ethical and Legal