Scalable and cost-efficient custom gene library assembly from oligopools
Data files
Apr 09, 2026 version files 269.96 GB
-
fscan_ont_data.fq.gz
143.46 GB
-
jscan_R1.fastq.gz
3.42 GB
-
jscan_R2.fastq.gz
3.75 GB
-
lib1-hifi_reads.bam
40.55 GB
-
lib1-hifi_reads.bam.pbi
55 MB
-
lib2-hifi_reads.bam
43.08 GB
-
lib2-hifi_reads.bam.pbi
52.05 MB
-
pcrs_R1.fastq.gz
3.47 GB
-
pcrs_R2.fastq.gz
3.62 GB
-
README.md
4.30 KB
-
syngfp_assembly2_rep1_presort.bam
3.90 GB
-
syngfp_assembly2_rep1_presort.bam.pbi
12.48 MB
-
syngfp_assembly2_rep1_presort.consensusreadset.xml
1.98 KB
-
syngfp_assembly2_rep1_presort.md5sum
86 B
-
syngfp_assembly2_rep2_presort.bam
2.88 GB
-
syngfp_assembly2_rep2_presort.bam.pbi
9.22 MB
-
syngfp_assembly2_rep2_presort.consensusreadset.xml
1.97 KB
-
syngfp_assembly2_rep2_presort.md5sum
86 B
-
syngfp_rep1_GFP1.fq.gz
640.76 MB
-
syngfp_rep1_GFP2.fq.gz
773.50 MB
-
syngfp_rep1_GFP3.fq.gz
799.81 MB
-
syngfp_rep1_GFP4.fq.gz
1.02 GB
-
syngfp_rep1_GFP5.fq.gz
1.06 GB
-
syngfp_rep1_N1.fq.gz
879.86 MB
-
syngfp_rep1_N2.fq.gz
402.77 MB
-
syngfp_rep1_NEG.fq.gz
1.30 GB
-
syngfp_rep1_presort.bam
1.65 GB
-
syngfp_rep1_presort.bam.md5sum
86 B
-
syngfp_rep1_presort.bam.pbi
6.69 MB
-
syngfp_rep1_presort.consensusreadset.xml
1.98 KB
-
syngfp_rep1_PreSort.fq.gz
572.42 MB
-
syngfp_rep2_GFP1.fq.gz
791.21 MB
-
syngfp_rep2_GFP2.fq.gz
1.04 GB
-
syngfp_rep2_GFP3.fq.gz
700.88 MB
-
syngfp_rep2_GFP4.fq.gz
832.47 MB
-
syngfp_rep2_GFP5.fq.gz
809.77 MB
-
syngfp_rep2_N1.fq.gz
1.11 GB
-
syngfp_rep2_N2.fq.gz
879.54 MB
-
syngfp_rep2_NEG.fq.gz
771.65 MB
-
syngfp_rep2_presort.bam
1.53 GB
-
syngfp_rep2_presort.bam.md5sum
86 B
-
syngfp_rep2_presort.bam.pbi
6.18 MB
-
syngfp_rep2_presort.consensusreadset.xml
1.98 KB
-
syngfp_rep2_PreSort.fq.gz
752.71 MB
-
uniprot_rep1_presort.bam
1.83 GB
-
uniprot_rep1_presort.bam.md5sum
86 B
-
uniprot_rep1_presort.bam.pbi
8.27 MB
-
uniprot_rep1_presort.consensusreadset.xml
1.98 KB
-
uniprot_rep2_presort.bam
1.53 GB
-
uniprot_rep2_presort.bam.md5sum
86 B
-
uniprot_rep2_presort.bam.pbi
7.66 MB
-
uniprot_rep2_presort.consensusreadset.xml
1.98 KB
Abstract
This dataset contains all next-generation sequencing (NGS) data generated to evaluate the OMEGA (Oligo-based Multiplexed Efficient Gene Assembly) platform for multiplexed gene library construction. It includes PacBio HiFi long-read sequencing (BAM with index and metadata), Illumina paired-end sequencing (FASTQ), and Oxford Nanopore long-read sequencing (FASTQ) across assembly validation (Rubisco and Cas9 libraries), amplicon-based quality control (PCR and JSCAN), and functional screening of GFP variant libraries. Data are organized by sequencing platform and experiment type, including replicate-level presort libraries and fluorescence-activated cell sorting (FACS) populations spanning multiple fluorescence bins and negative controls. These files enable reconstruction of full-length variants, quantification of assembly accuracy and uniformity, and analysis of sequence–function relationships. All data are standard DNA sequencing formats compatible with common open-source tools and are suitable for benchmarking gene assembly methods, developing analysis pipelines, and training machine learning models. All sequences are synthetic or non-pathogenic research constructs; no human or clinical data are included, and there are no ethical or legal restrictions on reuse.
Dataset Overview
This dataset contains all next-generation sequencing (NGS) data generated to evaluate the OMEGA gene assembly method. It includes PacBio HiFi long-read sequencing (BAM), Illumina paired-end sequencing (FASTQ), and Oxford Nanopore long-read sequencing (FASTQ). These data measure gene assembly accuracy, library composition, and functional screening of GFP variants.
All files contain DNA sequence reads. No processed tables or derived variables are included.
Data Organization
The dataset is organized by sequencing platform and experiment type:
- PacBio HiFi (BAM): full-length assembled genes
- Illumina (FASTQ): short-read validation
- Oxford Nanopore (FASTQ): long-read variant analysis and sorted populations
Each filename encodes:
- library type (e.g., GFP, Rubisco)
- replicate number (rep1, rep2)
- sample type (presort, sorted bin, control)
Key Definitions
- Presort: unsorted gene library before selection
- FACS: sorting of GFP variants by fluorescence
- GFP1–GFP5: increasing fluorescence bins (GFP1 = lowest, GFP5 = highest)
- N1, N2: negative gates
- NEG: combined negative population
- PreSort: unsorted Nanopore sample
- Replicate: independent experiment
File Types
FASTQ (.fastq.gz)
Raw sequencing reads:
- nucleotide sequence (A, T, C, G)
- quality scores (Phred scale)
Units:
- sequence length: base pairs (bp)
- quality: Phred score
BAM (.bam)
Aligned long reads with metadata.
Associated files:
- .pbi: index
- .xml: metadata
- .md5sum: integrity check
Data File Inventory
PacBio HiFi — Rubisco and Cas9
- lib1-hifi_reads.bam
- lib1-hifi_reads.bam.pbi
- lib2-hifi_reads.bam
- lib2-hifi_reads.bam.pbi
Illumina — PCR
- pcrs_R1.fastq.gz
- pcrs_R2.fastq.gz
Illumina — JSCAN
- jscan_R1.fastq.gz
- jscan_R2.fastq.gz
Oxford Nanopore — fscan
- fscan_ont_data.fq.gz
PacBio HiFi — Uniprot GFP
- uniprot_rep1_presort.bam
- uniprot_rep1_presort.bam.pbi
- uniprot_rep1_presort.bam.md5sum
- uniprot_rep1_presort.consensusreadset.xml
- uniprot_rep2_presort.bam
- uniprot_rep2_presort.bam.pbi
- uniprot_rep2_presort.bam.md5sum
- uniprot_rep2_presort.consensusreadset.xml
PacBio HiFi — SynGFP (Original)
- syngfp_rep1_presort.bam
- syngfp_rep1_presort.bam.pbi
- syngfp_rep1_presort.bam.md5sum
- syngfp_rep1_presort.consensusreadset.xml
- syngfp_rep2_presort.bam
- syngfp_rep2_presort.bam.pbi
- syngfp_rep2_presort.bam.md5sum
- syngfp_rep2_presort.consensusreadset.xml
PacBio HiFi — SynGFP (Repeat Assembly)
- syngfp_assembly2_rep1_presort.bam
- syngfp_assembly2_rep1_presort.bam.pbi
- syngfp_assembly2_rep1_presort.md5sum
- syngfp_assembly2_rep1_presort.consensusreadset.xml
- syngfp_assembly2_rep2_presort.bam
- syngfp_assembly2_rep2_presort.bam.pbi
- syngfp_assembly2_rep2_presort.md5sum
- syngfp_assembly2_rep2_presort.consensusreadset.xml
Oxford Nanopore — FACS GFP (Replicate 1)
- syngfp_rep1_GFP1.fq.gz
- syngfp_rep1_GFP2.fq.gz
- syngfp_rep1_GFP3.fq.gz
- syngfp_rep1_GFP4.fq.gz
- syngfp_rep1_GFP5.fq.gz
- syngfp_rep1_N1.fq.gz
- syngfp_rep1_N2.fq.gz
- syngfp_rep1_NEG.fq.gz
- syngfp_rep1_PreSort.fq.gz
Oxford Nanopore — FACS GFP (Replicate 2)
- syngfp_rep2_GFP1.fq.gz
- syngfp_rep2_GFP2.fq.gz
- syngfp_rep2_GFP3.fq.gz
- syngfp_rep2_GFP4.fq.gz
- syngfp_rep2_GFP5.fq.gz
- syngfp_rep2_N1.fq.gz
- syngfp_rep2_N2.fq.gz
- syngfp_rep2_NEG.fq.gz
- syngfp_rep2_PreSort.fq.gz
Usage Notes
Viewing BAM files (PacBio HiFi)
Use samtools to inspect BAM files:
samtools view file.bam | head
Viewing FASTQ files (Illumina and ONT)
View reads directly:
zcat file.fastq.gz | head
Companion files
.bam.pbi— PacBio index (not required for viewing).consensusreadset.xml— metadata (not required for viewing).md5sum— file integrity check (run: md5sum -c filename.md5sum)
Only .bam and .fastq.gz files are needed for basic inspection.
Reuse
Applications include:
- gene assembly benchmarking
- variant reconstruction
- sequence–function modeling
- machine learning training datasets
Ethical and Legal
- No human data
- Non-pathogenic sequences
- No restrictions on reuse (CC0)
