6D-ViCuT: Six degree-of-freedom visual cuboid tracking dataset for manual packing of cargo in warehouses
Data files
Oct 16, 2025 version files 92.17 GB
-
group1.zip
5.19 GB
-
group10.zip
6.06 GB
-
group11.zip
6.27 GB
-
group12.zip
6.20 GB
-
group13.zip
5.63 GB
-
group14.zip
4.89 GB
-
group15.zip
4.94 GB
-
group16.zip
6.78 GB
-
group2.zip
5.34 GB
-
group3.zip
5.35 GB
-
group4.zip
5.24 GB
-
group5.zip
5.10 GB
-
group6.zip
6.73 GB
-
group7.zip
6.58 GB
-
group8.zip
5.83 GB
-
group9.zip
6.03 GB
-
README.md
12.16 KB
Abstract
Visual tracking of objects is a fundamental technology for Industry 4.0, allowing the integration of digital content and real-world objects. The industrial operation known as manual cargo packing can benefit from the visual tracking of objects. No dataset exists to evaluate the visual tracking algorithms on manual packing scenarios. To close this gap, this article presents 6DViCuT, a dataset of images, and 6D pose ground truth of cuboids in a manual packing operation in intralogistics. The initial release of the dataset comprises 28 sessions acquired in a space that recreates a manual packing zone: indoors, area of (6 x 4 x 2) m3, and warehouse illumination. The data acquisition experiment involves capturing images from fixed and mobile RGBD devices and a motion capture system while an operator performs a manual packing operation. Each session contains between 6 and 18 boxes from an available set of 10 types, with each type varying in height, width, depth, and texture. Each session had a duration in the range of 1 to 5 minutes. Each session exhibits operator speed and box type differences (box texture, size heterogeneity, occlusion).
The 6D-ViCuT dataset offers synchronized multimodal recordings designed for research on visual tracking and six-degree-of-freedom (6D) pose estimation of cuboid-shaped objects during manual cargo packing operations.
It enables benchmarking and training of computer vision algorithms under realistic industrial conditions, including occlusions, variable lighting, and operator motion.
A total of 28 sessions were recorded in a reconstructed packing area (6 × 4 × 2 m³, indoor illumination) using the following sensors:
- Microsoft HoloLens 2 (HL2) — head-mounted RGB-D sensor capturing RGB, depth, and pose data.
- Microsoft Kinect v2 — fixed RGB-D camera observing the consolidation zone.
- PhaseSpace MoCap system — motion-capture reference for 6D ground truth of boxes and operator.
Each session lasts between 1–5 minutes, includes 6–18 boxes (10 box types in total), and presents variability in size, texture, occlusion, and operator speed.
Description of the Data and File Structure
Data Structure
The dataset includes 28 sessions. Each session captures the simultaneous manual packing of cuboid boxes within the consolidation zone and the corresponding sensor recordings from all devices.
Data generated during each session is stored in a folder named sessionx, where x represents the session identifier.
The following diagram shows the folder and file structure for Session 1, which generalizes to all sessions. For additional details, see Reference [1].
├── session1 Data from Session 1
│ ├── analyzed Processed data derived from raw data
│ │ ├── HL2 HL2 processed data
│ │ │ └── boxByFrame.txt File that begins with a single line indicating the total number of HL2 frames recorded in the session, followed by a three-column table: (1) target box ID, (2) frame where the operator starts seeking the box, and (3) frame where the operator reaches the target box.
│ │ └── Th2m.txt Transformation matrix (r11 r12 r13 px r21 r22 r23 py r31 r32 r33 pz fscore) mapping points from qh to qm coordinates; fscore indicates fit quality.
│ ├── filtered Data obtained after filtering the raw data
│ │ ├── HL2 Data generated from the HL2 sensor
│ │ │ └── PointClouds.zip Contains non-colored 3D point clouds (PLY format) at each keyframe.
│ │ │ ├── PCDouble2SingleID.txt Two-column text file: (1) point cloud timestamp, (2) single-precision frame ID.
│ │ │ ├── framek.ply Point cloud at instant k.
│ │ │ └── scriptToConvertPLYs.sh Example Bash script for converting double-precision to single-precision point clouds.
│ │ └── MoCap Data acquired from the MoCap system
│ │ └── sessionDescriptor.csv CSV file organized in two sections. First row: (1) number of boxes in the scene, (2) number of MoCap frames, (3) initial occlusion level [1–3], (4) initial texture level [1–3], (5) box heterogeneity level [1–3], and (6) operator speed level [1–2]. Subsequent rows: (1) box ID, (2) corresponding HL2 frame, (3) components of the transformation matrix that projects points from the MoCap coordinate frame to each box coordinate frame, in the sequence r11 r12 r13 px r21 r22 r23 py r31 r32 r33.
│ └── raw Unprocessed sensor data
│ ├── HL2 Data acquired from the HL2 system. See Reference [2] for details.
│ │ ├── 2021-07-01-130927_head_hand_eye.csv Each row begins with a timestamp (FILETIME format), followed by 4×4 transformation matrices. The first matrix represents the head pose in world coordinates. Subsequent matrices correspond to the left- and right-hand joint poses, each given as a homogeneous transformation (r11 r12 r13 px r21 r22 r23 py r31 r32 r33). Non-tracked joints are filled with zeros or a sentinel value (–1.0737418e+08).
│ │ ├── 2021-07-01-130927_pv.txt Text file containing the intrinsic and extrinsic parameters of the HL2 Photo-Video (PV) camera. The first line lists intrinsic parameters (fx, fy, cx, cy). Each subsequent row corresponds to a captured frame and includes: (1) timestamp in FILETIME units, (2–4) translation of the PV camera in world coordinates (x, y, z), and (5–16) components of the 4×4 transformation matrix describing the PV camera pose (rotation and translation). The final four values (0, 0, 0, 1) close the homogeneous matrix.
│ │ ├── Depth Long Throw_extrinsics.txt 4×4 homogeneous transformation matrix that maps 3D points from the Depth Long Throw camera coordinate frame to the HL2 rig coordinate frame. Values are given in row-major order (r11 r12 r13 px r21 r22 r23 py r31 r32 r33 pz 0 0 0 1), with translation in meters.
│ │ ├── Depth Long Throw_lut.bin Binary lookup table containing a per-pixel unit-direction vector for the 320×288 Long-Throw depth camera. Each entry consists of three 32-bit floats (x, y, z) representing the normalized ray direction in the camera frame. The file can be read with NumPy (dtype=float32) and used to back-project depth values into 3D points.
│ │ ├── Depth Long Throw_rig2world.txt Each row contains: (1) the frame timestamp in FILETIME units, followed by (2) the 4×4 homogeneous transformation (row-major) from the HL2 rig coordinate frame to world coordinates (rotation and translation; final row 0 0 0 1). Use together with the Long-Throw extrinsics to place depth points in world space.
│ │ ├── Depth Long Throw.tar Contains the raw depth frames captured by the HL2 Long-Throw depth camera. Each frame includes three aligned image buffers: (1) a distance (depth) map storing 16-bit metric distances for each pixel; (2) an active-brightness (AB) image recording the infrared intensity of the reflected light used for quality and texture reference; and (3) a sigma buffer encoding per-pixel confidence and validity flags (e.g., low signal, saturation, out-of-range).
│ │ └── PV.tar Contains the raw frames captured by the HL2 Photo-Video (PV) camera, which is the RGB sensor. Each frame includes a color image (typically 1280 × 720 pixels, 8-bit RGB) and associated timestamp metadata for synchronization with other sensor streams.
│ ├── Kinect Data acquired from the Kinect system
│ │ ├── Kinect_Color.mp4 Temporal sequence of color images.
│ │ └── Kinect_Depth Contains depth images stored in NumPy .npz format, which can be loaded in Python using the NumPy library. Each array represents a depth map (in millimeters) captured by the Kinect sensor.
│ └── MoCap Data acquired from the MoCap system
│ ├── markerDescriptors.json JSON metadata file generated by the PhaseSpace motion-capture system. It records session-level parameters (recording date, duration, frame count, and capture frequency) and defines each tracked optical marker used during the experiment. The “markers” array lists all marker IDs and their associated names (e.g., HL2 markers attached to the headset and pairs of markers attached to each box, such as box15_m1, box15_m2). This file enables identification of each marker’s role when interpreting the 3D coordinates stored in markerPosition.c3d.
│ └── markerPosition.c3d 3D coordinates (x, y, z) of all active markers in each frame, captured by the MoCap system.
Additionally, a miscellaneous folder has the following structure:
├── misc
│ ├── boxDescriptors.csv CSV listing box properties: (1) ID, (2) Type [1–10], (3) Height (cm), (4) Width (cm), (5) Depth (cm), (6) Texture [0–1].
│ ├── colorDepth2PC.py Example script to generate point clouds from Kinect depth images using camera parameters.
│ ├── kinect_intrinsicP.csv Intrinsic parameters of the Kinect depth sensor.
│ └── Tk2m_gross.txt Transformation matrix to project a 3D point from qm to qk coordinate system, in the sequence r11 r12 r13 px r21 r22 r23 py r31 r32 r33.
Grouped Data Files
To optimize download efficiency, the dataset is distributed across 16 compressed folders, each containing one or more sessions (≤ 6.7 GB per group).
Group numbering does not correspond to session numbering.
| file name | Sessions Included |
|---|---|
| group1.zip | 21, misc |
| group2.zip | 18 |
| group3.zip | 6 |
| group4.zip | 7 |
| group5.zip | 3 |
| group6.zip | 35, 39 |
| group7.zip | 32, 33 |
| group8.zip | 10, 36 |
| group9.zip | 17, 27 |
| group10.zip | 15, 20 |
| group11.zip | 13, 19 |
| group12.zip | 5, 54 |
| group13.zip | 2, 45 |
| group14.zip | 4, 25 |
| group15.zip | 16, 53 |
| group16.zip | 1, 12, 52 |
Each group decompresses into its respective session folders, which share the same structure as session1.
Potential Applications
- Algorithm training and validation: Fine-tune object pose estimation or visual tracking models for cuboid objects in realistic warehouse settings.
- Benchmarking: Evaluate performance under motion blur, occlusion, and texture variability.
- Sensor fusion: Compare pose estimations from RGB-D data (HL2/Kinect) against 6D motion-capture ground truth.
- Data augmentation: Generate synthetic datasets using the provided 3D transformations and annotations.
Sharing and Access Information
There are currently no additional access methods available.
Code and Software, Usage Notes
Any program suitable to open the next formats will be useful to open the data files
Raw:
- Color images from the mobile sensor (760x428 px resolution), PNG format
- Depth Images from the mobile sensor (320x288 px resolution), PGM format
- HL2 device poses for each keyframe and from the on-device tracker, TXT format.
- Color images from the fixed sensor (1920 x 1080 px resolution), MP4 format
- Depth images from the fixed sensor (512x424 px resolution), NPZ format
- 3D position of MoCap markers in millimeters, C3D format
- Metadata from MoCap markers, JSON format
Analyzed
- Matrix transformations between HL2 coordinate frame and MoCap coordinate frame, TXT format.
- Target box at each HL2 frame, TXT format
Filtered
- Point Clouds from mobile sensor signals, PLY format.
- 6D pose of boxes from MoCap signals, CSV format.
- Session descriptors (number of boxes, occlusion level, operator’s speed level, and 6D initial pose of boxes from MoCap signals), CSV format
Miscellaneous
- Box descriptors (length, depth, width, type, texture), CSV format
- Intrinsic parameters of depth and color Kinect cameras, CSV format
References
[1] G. A. Camacho-Muñoz, J. C. M. Franco, S. E. Nope-Rodríguez, H. Loaiza-Correa, S. Gil-Parga, and D. Álvarez-Martínez, “6D-ViCuT: Six degree-of-freedom visual cuboid tracking dataset for manual packing of cargo in warehouses,” Data in Brief, p. 109385, 2023. https://doi.org/10.1016/j.dib.2023.109385
[2] D. Ungureanu et al., “HL2 Research Mode as a Tool for Computer Vision Research,” arXiv e-prints, arXiv:2008.11239, 2020. https://doi.org/10.48550/arXiv.2008.11239
[3] G. A. Camacho-Muñoz, S. E. Nope-Rodríguez, H. Loaiza-Correa, J. P. Silva do Monte Lima, and R. A. Roberto, “Evaluation of the use of box size priors for 6D plane segment tracking from point clouds with applications in cargo packing,” EURASIP Journal on Image and Video Processing, vol. 17, 2024. https://doi.org/10.1186/s13640-024-00636-1
Data were acquired indoors in an area isolated from sunlight and with tube light illumination. The workspace rebuilds a dispatching zone from a warehouse. The operator performs a manual packing operation while sensors are scanning the scene. Each session was defined to obtain a different combination of four factors: occlusion, the texture of the cargo, cargo size heterogeneity, and speed of the operator. Each of the first three factors has three levels, and the last factor has two levels.
An operator grasps and transports each box from a consolidation to a packing zone. In the meantime, the scene is scanned by three devices:
1. A Microsoft HoloLens-2 (HL2) is attached to the operator’s head and allows the acquisition of RGBD images and camera 6D pose in space.
2. A Microsoft Kinect v2 was fixed with a line of sight to the consolidation zone to acquire RGB-D images.
3. A Phase Space motion capture (MoCap) system was set up to cover the workspace. Two active markers were attached to each box’s top side. Five additional markers were attached to the HL2 device.
- Camacho-Muñoz, Guillermo A.; Nope Rodríguez, Sandra Esperanza; Loaiza-Correa, Humberto et al. (2024). Evaluation of the use of box size priors for 6D plane segment tracking from point clouds with applications in cargo packing. EURASIP Journal on Image and Video Processing. https://doi.org/10.1186/s13640-024-00636-1
- Camacho-Muñoz, Guillermo A.; Franco, Juan Camilo Martínez; Nope-Rodríguez, Sandra Esperanza et al. (2023). 6D-ViCuT: Six degree-of-freedom visual cuboid tracking dataset for manual packing of cargo in warehouses. Data in Brief. https://doi.org/10.1016/j.dib.2023.109385
