Data from: Diffractive tensorized unit for million-TOPS general-purpose computing
Data files
Aug 18, 2025 version files 14.05 GB
-
Dataset1_NLG.zip
45.84 MB
-
Dataset2_MSVD.zip
1.83 GB
-
Dataset3_MSRVTT.z01
5.37 GB
-
Dataset3_MSRVTT.z02
5.37 GB
-
Dataset3_MSRVTT.zip
341.37 MB
-
Dataset4_MNIST.zip
11.51 MB
-
Dataset5_FashionMNIST.zip
47.47 MB
-
Dataset6_CIFAR10.zip
463.91 MB
-
Dataset7_KTH.zip
555.37 MB
-
Dataset8_MovingMNIST.zip
17.19 MB
-
README.md
7.47 KB
Abstract
Photonic computing has emerged as a promising next-generation technology for processors, with diffraction-based architectures showing particular potential for large-scale parallel processing. Unfortunately, the lack of on-chip reconfigurability poses significant obstacles to realizing general-purpose computing, restricting the adaptability of these architectures to diverse advanced applications. We propose a diffractive tensorized unit (DTU), which is a fully reconfigurable photonic processor supporting million-TOPS general-purpose computing. The DTU leverages a tensor factorization approach to perform complex matrix multiplication through clustered diffractive tensor cores (DTCs), while each DTC employs a near-core modulation mechanism to activate dynamic temporal diffractive connections. Experiments confirm that the DTU overcomes the long-standing generality and scalability constraints of diffractive computing, realizing general computing with a 10-6 mean absolute error (MAE) for arbitrary 1,024-size matrix multiplications. Compared with state-of-the-art electronic-based solutions, the DTU not only achieves competitive accuracy on various challenging tasks, such as natural language generation and cross-modal recognition, but also delivers a remarkable 1,000X improvement in throughput over conventional electronic processors. The proposed DTU represents a leap forward in general-purpose photonic computing, paving the way for further advancements in large-scale artificial intelligence.
https://doi.org/10.5061/dryad.7d7wm387c
Including the following datasets:
Dataset 1: NLG (Natural Language Generation) dataset. This dataset contains raw feature data for word2vec and doc2vec training, along with a preprocessing Python file. The dataset includes multiple sub-datasets: word2vec folder contains novels (Alice's Adventures in Wonderland, Harry Potter, The Little Prince, Wizard of Oz) and doc2vec folder contains Chinese couplets, poetry, and The Little Prince. The preprocessing script converts text data into vector files that can be read by the training program. The dataset uses an x-in-y-out setting with fixed input and predicted words (sentences shorter than x + y are ignored), where x is set as 4, 6, 8, 10, 12, 14, and 16 and y is set as 2, 4, and 8 in the practical simulations. After 1,000 iterations, the embeddings of each selected word or sentence are obtained, which are shaped into a 50-way vector.
Dataset 2: MSVD (Microsoft Video Description) dataset. This dataset contains video clips with corresponding text descriptions, designed for video-to-text generation and cross-modal recognition tasks. The MSVD consists of more than 1,970 video snippets along with approximately 120,000 descriptions. As a practical common setting, we sample 2 frames per second from each video, and every frame is tensorized to a resolution of 128 × 120 × 3. Original dataset: https://www.cs.utexas.edu/~ml/clamp/videoDescription/
Dataset 3: MSRVTT (Microsoft Research Video to Text) dataset. This is a comprehensive multi-modal dataset containing video clips with text descriptions. MSRVTT contains 10,000 video clips and each is annotated with 20 English sentences with 29,000 unique words. This dataset is used for video captioning and cross-modal recognition tasks. Original dataset: https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/
Dataset 4: MNIST dataset. This is the classic handwritten digit recognition dataset containing 60,000 training images and 10,000 test images of handwritten digits (0-9). Each original image is 28x28 pixels in grayscale. Original dataset: http://yann.lecun.com/exdb/mnist/
Dataset 5: FashionMNIST dataset. This dataset contains 70,000 grayscale images of fashion items across 10 categories, including clothing items like shirts, dresses, shoes, and bags. Each original image is 28x28 pixels, designed as a drop-in replacement for the original MNIST dataset. Original dataset: https://github.com/zalandoresearch/fashion-mnist
Dataset 6: CIFAR-10 dataset. This dataset contains 60,000 32x32 color images in 10 different classes, including airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks. Original dataset: https://www.cs.toronto.edu/~kriz/cifar.html
Dataset 7: KTH dataset. This dataset contains video sequences for human action recognition and video generation tasks. KTH contains 2,391 video clips with coherent actions from 6 categories, and each video is recorded by a static camera with a resolution of 320 × 240 at 25 frames per second. Original dataset: https://www.csc.kth.se/cvap/actions/
Dataset 8: Moving MNIST dataset. This dataset contains sequences of moving handwritten digits, created by animating the original MNIST digits. MovingMNIST is modified from MNIST, in which each digit is randomly moving across a 64×64-pixel grid. The dataset consists of 10,000 sequences, each with 10 consecutive frames of a digit in motion. This dataset is commonly used for video prediction, sequence modeling, and temporal data analysis tasks. Original dataset: https://www.cs.toronto.edu/~nitish/unsupervised_video/
The files in this repository are compressed folders for the eight datasets. Please download and extract them for further usage.
For Dataset 1 (Dataset1_NLG.zip), you will find raw feature data for word2vec and doc2vec training, along with a preprocessing Python script. After extraction, you will find a preprocess.py file and two subdirectories: word2vec/ and doc2vec/ containing data files for training. The script shows sample to convert the raw text data into vector files that can be read by the training program.
For Dataset 2 (Dataset2_MSVD.zip), you will find video clips with corresponding text descriptions. After extraction, you will find all_video_features.npy containing video features and all_video_captions.txt containing corresponding captions for cross-modal recognition tasks.
For Dataset 3 (Dataset3_MSRVTT.zip, Dataset3_MSRVTT.z01, Dataset3_MSRVTT.z02). Note that you need to download all three files and extract them together as they form a single split archive. After extraction, you will find 10 separate .npy files containing video features and msrvtt_video_captions.json with corresponding captions for each video. The video features can be loaded using numpy.load() function, and the captions can be accessed through the JSON file for video captioning tasks.
For Dataset 4 (Dataset4_MNIST.zip), you will find the classic MNIST handwritten digit dataset. After extraction, you will find train_images.npy, train_labels.npy, test_images.npy, and test_labels.npy containing data from 60,000 training images and 10,000 test images.
For Dataset 5 (Dataset5_FashionMNIST.zip), you will find the Fashion-MNIST dataset containing fashion item images. After extraction, you will find fashionmnist_train_images.npy, fashionmnist_train_labels.npy, fashionmnist_test_images.npy, and fashionmnist_test_labels.npy containing 70,000 grayscale images across 10 fashion categories.
For Dataset 6 (Dataset6_CIFAR10.zip), you will find four numpy .npy files containing the training and testing data along with their ground truth data. Use numpy.load() function to read them.
For Dataset 7 (Dataset7_KTH.zip), you will find video sequences for human action recognition. After extraction, you will find train_kth.npy and test_kth.npy containing various human actions and movements.
For Dataset 8 (Dataset8_MovingMNIST.zip), you will find sequences of moving handwritten digits. After extraction, you will find movingmnist_train.npy and movingmnist_test.npy containing animated MNIST digits for video prediction and sequence modeling tasks.
Even though most of the datasets mentioned in this repository are based on public datasets, the actual files may have been modified or preprocessed for specific experiments. The data formats and structures have been adapted to match the requirements of natural language generation, cross-modal recognition and video generation tasks. Please cite the original papers for each dataset as well as this repository if you consider using these data in your research. Thank you very much.
I, the copyright holder of this work, hereby publish it under the following license: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication.
The original data was preprocessed to align with the architectural specifications of our DTU optical computing hardware. This involved image resizing, cropping, and a reorganization of the training-testing splits to optimize computational efficiency.
