Liquid crystal-guided DNA information storage: Non-destructive recovery and long-term preservation
Data files
Sep 05, 2025 version files 19.82 MB
-
README.md
12.47 KB
-
Sequencing_results_of_27_DNA_strands_encoding_Eine_Kleine_Nachtmusik.zip
8.67 MB
-
The_algorithm_for_storing_audio_in_multiple_DNA_(Eine_Kleine_Nachtmusik).zip
131.67 KB
-
The_algorithm_for_storing_text_in_DNA_fragment_(The_Sea).zip
70.39 KB
-
The_algorithm_for_storing_text_in_genome_DNA_(The_Complete_Works_of_William_Shakespeare).zip
10.93 MB
Abstract
Dataset DOI: 10.5061/dryad.c59zw3rmp
Description of the data and file structure
The encoding algorithm files contain three files
(The_algorithm_for_storing_audio_in_multiple_DNA_(Eine_Kleine_Nachtmusik), The_algorithm_for_storing_text_in_DNA_fragment_(The_Sea) and The_algorithm_for_storing_text_in_genome_DNA_(The_Complete_Works_of_William_Shakespeare))
(The_algorithm_for_storing_text_in_DNA_fragment_(The_Sea)) In order to ensure the high density and accuracy of storing information in DNA molecules, we developed a cascade algorithm capable of densely converting digital data into DNA sequences, which contained an arithmetic compression algorithm, Lempel-Ziv-Welch (LZW) constraint algorithm, and Reed-Solomon (RS) error-correcting code. The arithmetic compression algorithm ensured high storage efficiency throughout the encoding process. The LZW constraint algorithm regulated the GC content and eliminated homopolymer. The RS error-correcting code corrected nucleotide substitution and loss errors during data writing, storing, and reading.(The_algorithm_for_storing_text_in_genome_DNA_(The_Complete_Works_of_William_Shakespeare)) To evaluate the storage capacity of LDIPP for large-scale digital storage, a 4.8 Mbp DH5α Escherichia coli (E. coli) bacterial genome, theoretically corresponding to about 1.9 MB digital data based our encoding algorithm, was employed to construct the LDIPP.(The_algorithm_for_storing_audio_in_multiple_DNA_(Eine_Kleine_Nachtmusik)) A segment audio of “Eine Kleine Nachtmusik” by Mozart (approximately 18 KB) was encoded into a total of 76 kbp DNA sequences, divided into 27 strands of around 3000 bp each. To distinguish and recover every strand from mixed sample, the unique index, forward primers and reverse primers were designed and added at the 5’/3’ ends of each strand
File: The_algorithm_for_storing_text_in_DNA_fragment_(The_Sea).zip
Description: This dataset contains scripts and supporting files for storing text information (The Sea) in a single plasmid sequence. The folder includes implementations of arithmetic coding, LZW compression, and Reed–Solomon error correction, along with the main encoding and decoding pipelines.
File and Directory Description
.idea/
Project configuration files automatically generated by the Python IDE. Not required for reproduction.
pycache/
Python cache files automatically generated when running scripts.
source_file/
Folder containing example text input files that are encoded into plasmid DNA sequences.
Scripts
main.py
Entry point script that integrates encoding and decoding functions for plasmid-based text storage.
main_encode.py
Script for encoding short text files into DNA sequences suitable for storage in a single plasmid.
main_decode.py
Script for decoding plasmid DNA sequences back into the original text files.
Arithmetic.py
Implements arithmetic coding for compressing and encoding text into DNA sequences.
Lzw.py
Implements Lempel–Ziv–Welch (LZW) compression for reducing redundancy in text before DNA encoding.
reedsolomon.py
Implements Reed–Solomon error correction to ensure robustness of the encoded DNA sequences.
encoding_frame.py
Defines the DNA sequence frame structure, including data payload, indexing, and error-correction markers.
FileTools.py
Utility functions for file input/output handling, including reading text and writing encoded DNA.
parameters.py
Configuration file specifying key parameters (fragment length, GC-content range, error-correction level, etc.).
sequence_find.py
Helper script for scanning DNA sequences and locating encoded payloads within a plasmid.
Usage
Place the text file to be encoded inside the source_file/ directory.
Run main_encode.py to convert the input text into a DNA sequence formatted for single plasmid storage.
Internally, the process applies LZW compression, arithmetic coding, and Reed–Solomon error correction, and wraps the sequence into a predefined encoding frame.
Output DNA sequence is written to the results directory (or displayed in console).
To decode, run main_decode.py with the encoded plasmid DNA sequence as input.
This script performs sequence identification, error correction, decompression, and recovers the original text file.
Advanced users can modify parameters.py to adjust fragment length, primer binding sites, or error-correction depth.
File:The_algorithm_for_storing_text_in_genome_DNA_(The_Complete_Works_of_William_Shakespeare).zip
Description: This dataset contains source code, genomic data, and processed files used in the study Liquid Crystal-Guided DNA Information Storage: Non-Destructive Recovery and Long-Term Preservation. It includes scripts for encoding/decoding, genome-based data storage, and raw/reference text files used for demonstration.
File and Directory Description
Directories
pycache/
Auto-generated Python cache files. Not essential for analysis.
*data/
*
Contains intermediate or supporting data files for running the scripts.
Data Files
JRYM01.1.fsa_nt
Reference genome sequence in FASTA format.
mutated_genome.fasta
Genome sequence with introduced mutations, used for testing error tolerance.
shakespeare.txt
Plain text of Shakespeare's works used as input data.
The Complete Works of William Shake.bin
Full binary code of Shakespeare's works for encoding experiments.
decoded_output/
Folder containing decoded results after applying genome-based retrieval.
encoded_output/
Folder containing encoded DNA sequences or intermediate genome fragments.
Scripts
encoder_v2.py
Main script for converting digital data into DNA-encoded sequences.
decoder_v2.py
Main script for decoding DNA sequences back into digital files.
genome_blast.py
Script for sequence alignment and similarity search using BLAST.
genome_loader.py
Script to load genome files into memory for processing.
genome_read.py
Script to read and parse genomic sequence data.
index_builder.py
Script for building index structures for fast lookup during encoding/decoding.
k_selector.py
Script for selecting optimal k-mer size for encoding.
kmer_gpu_counter.py
GPU-accelerated script for counting k-mers in genome sequences.
Output Files
decoded_output/
Final results after decoding (text or binary files).
encoded_output/
DNA-encoded sequence files generated by encoder scripts.
Usage
Install Python 3.8+ and required libraries (numpy, biopython, etc.).
Run encoder_v2.py to encode input text files into DNA sequences.
Run decoder_v2.py to recover text files from encoded DNA sequences.
Optional scripts (genome_blast.py, kmer_gpu_counter.py, etc.) are used for additional analysis or acceleration.
File: The_algorithm_for_storing_audio_in_multiple_DNA_(Eine_Kleine_Nachtmusik).zip
Description: This dataset contains scripts, metadata, and DNA sequence files used for encoding an MP3 file into multiple DNA sequences. The folder includes encoding and decoding pipelines, primer design files, and final encoded DNA sequences for data storage experiments.
File and Directory Description
Directories
.idea/
Auto-generated project settings from the Python IDE. Not required for reproduction.
pycache/
Python cache files automatically generated during script execution.
test/
Contains test files and intermediate results for validation of encoding/decoding scripts.
Data Files
final_dna_sequences.fasta
FASTA-formatted sequences representing fragments of plasmids storing the encoded MP3 file.
fragment_metadata.xlsx
Metadata table describing each DNA fragment.
- fragment_index: Numerical index for each DNA fragment.
- block_id: Identifier for the data block that the fragment belongs to.
- index_seq: Index sequence (barcode) assigned to the fragment.
- order_in_block: The position/order of the fragment within its block.
- payload_length: The length (in nucleotides) of the fragment payload region that encodes data.
candidate_primer_sequences_ref.xlsx
Excel file containing candidate primer sequences for PCR amplification.
sequence: Nucleotide sequence of the primer candidate.
decoded_output.bin
Binary file reconstructed after decoding the DNA sequences back into the original MP3 data.
Scripts
encode_main.py
Main script for encoding digital files (e.g., MP3) into DNA sequences distributed across multiple plasmids.
decode_main.py
Main script for decoding DNA sequences back into the original digital file.
arith_encode.py
Script implementing arithmetic coding to compress and encode digital data into DNA-friendly formats.
arith_decode.py
Script for decoding DNA sequences that were encoded with arithmetic coding.
raptorg_encode.py
Script implementing RaptorQ forward error correction coding during DNA encoding.
raptorg_decode.py
Script for recovering digital data from DNA sequences using RaptorQ decoding.
rs_codec.py
Script implementing Reed–Solomon error correction to ensure robustness of stored DNA sequences.
fragmenter.py
Splits the full DNA sequence into smaller plasmid fragments with metadata annotation.
primer_generator.py
Script to design primers for plasmid fragments.
primer_selector.py
Script to select optimized primers from candidate lists.
dna_utils.py
Utility functions for DNA sequence manipulation, including GC content calculation and sequence validation.
utils_analysis.py
Supplementary analysis tools for evaluating fragment quality and error correction performance.
Usage
Prepare the input MP3 file.
Run encode_main.py to encode the MP3 into DNA sequences.
arith_encode.py, raptorg_encode.py, and rs_codec.py are called internally for compression and error correction.
Run fragmenter.py to divide encoded DNA into plasmid-sized fragments, generating final_dna_sequences.fasta and fragment_metadata.xlsx.
Use primer_generator.py and primer_selector.py to design primers for amplification and sequencing validation.
To reconstruct the file, run decode_main.py. This will call raptorg_decode.py, arith_decode.py, and rs_codec.py to decode and recover the original MP3 as decoded_output.bin.
File: Sequencing_results_of_27_DNA_strands_encoding_Eine_Kleine_Nachtmusik.zip
Description: This folder contains sequencing results of 27 DNA strands used to store and retrieve segments of the MP3 music file Eine Kleine Nachtmusik. Each file corresponds to one DNA fragment obtained from plasmid-based storage platforms, sequenced to validate the integrity of stored information.
File Description
The files are named sequentially as seqXX.dna, where XX is an integer index (00–26).
Each file contains the nucleotide sequence of a single DNA fragment in plain text format (.dna).
Collectively, the 27 fragments represent the full DNA-encoded form of the MP3 file.
Examples
seq00.dna – First DNA fragment encoding a portion of the MP3 file.
seg01.dna – Second DNA fragment encoding a subsequent portion.
…
seg26.dna – Final DNA fragment completing the set of 27.
Data Format
File extension: .dna
Content: Plain text string of nucleotides (A, T, C, G) representing encoded digital data.
Each DNA sequence has been validated by sequencing, and corresponds to one designed fragment used for storage.
Usage
Collect all 27 .dna files (seq00–seg26).
Input the files into the decoding pipeline (e.g., decode_main.py from the The algorithm for storing audio in multiple DNA (Eine Kleine Nachtmusik)).
The decoding process will:
Align fragments according to their index,
Apply error correction (Reed–Solomon, RaptorQ),
Reconstruct the binary file.
The final output is the original MP3 file (Eine Kleine Nachtmusik).
License
All these files are released under CC0 public domain dedication in accordance with Dryad’s requirements. Users may reuse, modify, and redistribute without restriction.
Code/software
All programs are developed and executed in Python.
Access information
Other publicly accessible locations of the data:
Data was derived from the following sources:
- n/a
