Liquid crystal-guided DNA information storage: Non-destructive recovery and long-term preservation

Data files

Sep 05, 2025 version files 19.82 MB

README.md

12.47 KB
Sequencing_results_of_27_DNA_strands_encoding_Eine_Kleine_Nachtmusik.zip

8.67 MB
The_algorithm_for_storing_audio_in_multiple_DNA_(Eine_Kleine_Nachtmusik).zip

131.67 KB
The_algorithm_for_storing_text_in_DNA_fragment_(The_Sea).zip

70.39 KB
The_algorithm_for_storing_text_in_genome_DNA_(The_Complete_Works_of_William_Shakespeare).zip

10.93 MB

Abstract

DNA digital storage features high storage density, low power consumption, and extended digital recovery time. However, conventional DNA preservation methods suffer from limitations like low DNA loading and complex recovery processes. In this study, we develop a liquid crystal-guided DNA information preservation platform (LDIPP) assembled from 688 bp to 4.8 Mbp DNA information molecules and cationic surfactants. The thermotropic LDIPP provides encoded DNA information with high density loading, thermoplasticity, antimicrobial, and anti-enzymatic properties. Notably, DNA information was non-destructively recovered by manipulating the assembly structure using specific salt solutions and bio-amplified through microbial fermentation. By mineralizing inorganic crystals on LDIPP, the preservation lifetime is expected to be increased by nearly an order of magnitude at -20°C. The LDIPP platform offers enhanced DNA data loading, non-destructive recovery, and customizable macroscopic features, making it a promising option for long-term information preservation in the rapidly advancing field of DNA storage technology.

Dataset DOI: 10.5061/dryad.c59zw3rmp

Description of the data and file structure

The encoding algorithm files contain three files
(The_algorithm_for_storing_audio_in_multiple_DNA_(Eine_Kleine_Nachtmusik), The_algorithm_for_storing_text_in_DNA_fragment_(The_Sea) and The_algorithm_for_storing_text_in_genome_DNA_(The_Complete_Works_of_William_Shakespeare))

(The_algorithm_for_storing_text_in_DNA_fragment_(The_Sea)) In order to ensure the high density and accuracy of storing information in DNA molecules, we developed a cascade algorithm capable of densely converting digital data into DNA sequences, which contained an arithmetic compression algorithm, Lempel-Ziv-Welch (LZW) constraint algorithm, and Reed-Solomon (RS) error-correcting code. The arithmetic compression algorithm ensured high storage efficiency throughout the encoding process. The LZW constraint algorithm regulated the GC content and eliminated homopolymer. The RS error-correcting code corrected nucleotide substitution and loss errors during data writing, storing, and reading.
(The_algorithm_for_storing_text_in_genome_DNA_(The_Complete_Works_of_William_Shakespeare)) To evaluate the storage capacity of LDIPP for large-scale digital storage, a 4.8 Mbp DH5α Escherichia coli (E. coli) bacterial genome, theoretically corresponding to about 1.9 MB digital data based our encoding algorithm, was employed to construct the LDIPP.
(The_algorithm_for_storing_audio_in_multiple_DNA_(Eine_Kleine_Nachtmusik)) A segment audio of “Eine Kleine Nachtmusik” by Mozart (approximately 18 KB) was encoded into a total of 76 kbp DNA sequences, divided into 27 strands of around 3000 bp each. To distinguish and recover every strand from mixed sample, the unique index, forward primers and reverse primers were designed and added at the 5’/3’ ends of each strand

File: The_algorithm_for_storing_text_in_DNA_fragment_(The_Sea).zip

Description: This dataset contains scripts and supporting files for storing text information (The Sea) in a single plasmid sequence. The folder includes implementations of arithmetic coding, LZW compression, and Reed–Solomon error correction, along with the main encoding and decoding pipelines.

File and Directory Description

.idea/

Project configuration files automatically generated by the Python IDE. Not required for reproduction.

pycache/

Python cache files automatically generated when running scripts.

source_file/

Folder containing example text input files that are encoded into plasmid DNA sequences.

Scripts

main.py

Entry point script that integrates encoding and decoding functions for plasmid-based text storage.

main_encode.py

Script for encoding short text files into DNA sequences suitable for storage in a single plasmid.

main_decode.py

Script for decoding plasmid DNA sequences back into the original text files.

Arithmetic.py

Implements arithmetic coding for compressing and encoding text into DNA sequences.

Lzw.py

Implements Lempel–Ziv–Welch (LZW) compression for reducing redundancy in text before DNA encoding.

reedsolomon.py

Implements Reed–Solomon error correction to ensure robustness of the encoded DNA sequences.

encoding_frame.py

Defines the DNA sequence frame structure, including data payload, indexing, and error-correction markers.

FileTools.py

Utility functions for file input/output handling, including reading text and writing encoded DNA.

parameters.py

Configuration file specifying key parameters (fragment length, GC-content range, error-correction level, etc.).

sequence_find.py

Helper script for scanning DNA sequences and locating encoded payloads within a plasmid.

Usage

Place the text file to be encoded inside the source_file/ directory.

Run main_encode.py to convert the input text into a DNA sequence formatted for single plasmid storage.

Internally, the process applies LZW compression, arithmetic coding, and Reed–Solomon error correction, and wraps the sequence into a predefined encoding frame.

Output DNA sequence is written to the results directory (or displayed in console).

To decode, run main_decode.py with the encoded plasmid DNA sequence as input.

This script performs sequence identification, error correction, decompression, and recovers the original text file.

Advanced users can modify parameters.py to adjust fragment length, primer binding sites, or error-correction depth.

File:The_algorithm_for_storing_text_in_genome_DNA_(The_Complete_Works_of_William_Shakespeare).zip

Description: This dataset contains source code, genomic data, and processed files used in the study Liquid Crystal-Guided DNA Information Storage: Non-Destructive Recovery and Long-Term Preservation. It includes scripts for encoding/decoding, genome-based data storage, and raw/reference text files used for demonstration.

File and Directory Description

Directories

pycache/

Auto-generated Python cache files. Not essential for analysis.

*data/
*
Contains intermediate or supporting data files for running the scripts.

Data Files

JRYM01.1.fsa_nt

Reference genome sequence in FASTA format.

mutated_genome.fasta

Genome sequence with introduced mutations, used for testing error tolerance.

shakespeare.txt

Plain text of Shakespeare's works used as input data.

The Complete Works of William Shake.bin

Full binary code of Shakespeare's works for encoding experiments.

decoded_output/

Folder containing decoded results after applying genome-based retrieval.

encoded_output/

Folder containing encoded DNA sequences or intermediate genome fragments.

Scripts

encoder_v2.py

Main script for converting digital data into DNA-encoded sequences.

decoder_v2.py

Main script for decoding DNA sequences back into digital files.

genome_blast.py

Script for sequence alignment and similarity search using BLAST.

genome_loader.py

Script to load genome files into memory for processing.

genome_read.py

Script to read and parse genomic sequence data.

index_builder.py

Script for building index structures for fast lookup during encoding/decoding.

k_selector.py

Script for selecting optimal k-mer size for encoding.

kmer_gpu_counter.py

GPU-accelerated script for counting k-mers in genome sequences.

Output Files

decoded_output/

Final results after decoding (text or binary files).

encoded_output/

DNA-encoded sequence files generated by encoder scripts.

Usage

Install Python 3.8+ and required libraries (numpy, biopython, etc.).

Run encoder_v2.py to encode input text files into DNA sequences.

Run decoder_v2.py to recover text files from encoded DNA sequences.

Optional scripts (genome_blast.py, kmer_gpu_counter.py, etc.) are used for additional analysis or acceleration.

File: The_algorithm_for_storing_audio_in_multiple_DNA_(Eine_Kleine_Nachtmusik).zip

Description: This dataset contains scripts, metadata, and DNA sequence files used for encoding an MP3 file into multiple DNA sequences. The folder includes encoding and decoding pipelines, primer design files, and final encoded DNA sequences for data storage experiments.

File and Directory Description

Directories

.idea/

Auto-generated project settings from the Python IDE. Not required for reproduction.

pycache/

Python cache files automatically generated during script execution.

test/

Contains test files and intermediate results for validation of encoding/decoding scripts.

Data Files

final_dna_sequences.fasta

FASTA-formatted sequences representing fragments of plasmids storing the encoded MP3 file.

fragment_metadata.xlsx

Metadata table describing each DNA fragment.

fragment_index: Numerical index for each DNA fragment.
block_id: Identifier for the data block that the fragment belongs to.
index_seq: Index sequence (barcode) assigned to the fragment.
order_in_block: The position/order of the fragment within its block.
payload_length: The length (in nucleotides) of the fragment payload region that encodes data.

candidate_primer_sequences_ref.xlsx

Excel file containing candidate primer sequences for PCR amplification.

sequence: Nucleotide sequence of the primer candidate.

decoded_output.bin

Binary file reconstructed after decoding the DNA sequences back into the original MP3 data.

Scripts

encode_main.py

Main script for encoding digital files (e.g., MP3) into DNA sequences distributed across multiple plasmids.

decode_main.py

Main script for decoding DNA sequences back into the original digital file.

arith_encode.py

Script implementing arithmetic coding to compress and encode digital data into DNA-friendly formats.

arith_decode.py

Script for decoding DNA sequences that were encoded with arithmetic coding.

raptorg_encode.py

Script implementing RaptorQ forward error correction coding during DNA encoding.

raptorg_decode.py

Script for recovering digital data from DNA sequences using RaptorQ decoding.

rs_codec.py

Script implementing Reed–Solomon error correction to ensure robustness of stored DNA sequences.

fragmenter.py

Splits the full DNA sequence into smaller plasmid fragments with metadata annotation.

primer_generator.py

Script to design primers for plasmid fragments.

primer_selector.py

Script to select optimized primers from candidate lists.

dna_utils.py

Utility functions for DNA sequence manipulation, including GC content calculation and sequence validation.

utils_analysis.py

Supplementary analysis tools for evaluating fragment quality and error correction performance.

Usage

Prepare the input MP3 file.

Run encode_main.py to encode the MP3 into DNA sequences.

arith_encode.py, raptorg_encode.py, and rs_codec.py are called internally for compression and error correction.

Run fragmenter.py to divide encoded DNA into plasmid-sized fragments, generating final_dna_sequences.fasta and fragment_metadata.xlsx.

Use primer_generator.py and primer_selector.py to design primers for amplification and sequencing validation.

To reconstruct the file, run decode_main.py. This will call raptorg_decode.py, arith_decode.py, and rs_codec.py to decode and recover the original MP3 as decoded_output.bin.

File: Sequencing_results_of_27_DNA_strands_encoding_Eine_Kleine_Nachtmusik.zip

Description: This folder contains sequencing results of 27 DNA strands used to store and retrieve segments of the MP3 music file Eine Kleine Nachtmusik. Each file corresponds to one DNA fragment obtained from plasmid-based storage platforms, sequenced to validate the integrity of stored information.

File Description

The files are named sequentially as seqXX.dna, where XX is an integer index (00–26).

Each file contains the nucleotide sequence of a single DNA fragment in plain text format (.dna).

Collectively, the 27 fragments represent the full DNA-encoded form of the MP3 file.

Examples

seq00.dna – First DNA fragment encoding a portion of the MP3 file.

seg01.dna – Second DNA fragment encoding a subsequent portion.

…

seg26.dna – Final DNA fragment completing the set of 27.

Data Format

File extension: .dna

Content: Plain text string of nucleotides (A, T, C, G) representing encoded digital data.

Each DNA sequence has been validated by sequencing, and corresponds to one designed fragment used for storage.

Usage

Collect all 27 .dna files (seq00–seg26).

Input the files into the decoding pipeline (e.g., decode_main.py from the The algorithm for storing audio in multiple DNA (Eine Kleine Nachtmusik)).

The decoding process will:

Align fragments according to their index,

Apply error correction (Reed–Solomon, RaptorQ),

Reconstruct the binary file.

The final output is the original MP3 file (Eine Kleine Nachtmusik).

License

All these files are released under CC0 public domain dedication in accordance with Dryad’s requirements. Users may reuse, modify, and redistribute without restriction.

Code/software

All programs are developed and executed in Python.

Access information

Other publicly accessible locations of the data:

https://github.com/liu-yangyi/LC_DNA_storage

Data was derived from the following sources:

Liquid crystal-guided DNA information storage: Non-destructive recovery and long-term preservation

Data files

Abstract

README: Liquid crystal-guided DNA information storage: Non-destructive recovery and long-term preservation

Description of the data and file structure

File: The_algorithm_for_storing_text_in_DNA_fragment_(The_Sea).zip

File and Directory Description

Usage

File:The_algorithm_for_storing_text_in_genome_DNA_(The_Complete_Works_of_William_Shakespeare).zip

File and Directory Description

Usage

File: The_algorithm_for_storing_audio_in_multiple_DNA_(Eine_Kleine_Nachtmusik).zip

File and Directory Description

Usage

File: Sequencing_results_of_27_DNA_strands_encoding_Eine_Kleine_Nachtmusik.zip

File Description

Usage

License

Code/software

Access information