IoT network traffic dataset using the custom flow representation
Data files
Nov 24, 2025 version files 2.57 GB
-
bidirectional.tar.gz
1.78 GB
-
README.md
16.16 KB
-
unidirectional.tar.gz
794.91 MB
Abstract
This dataset provides Custom Flow representations derived from raw IoT network traffic traces, capturing detailed behavioral characteristics of IoT communications. Each Custom Flow encapsulates network behavior in a structured, vectorized format that includes flow-level metadata, packet sequence timing, direction, and selected payloads. Flows are uniquely identified by a five-tuple: device IP address, remote IP address, protocol, device port, and remote port, and maintain a fixed one-minute lifetime. To ensure consistent temporal granularity and computational efficiency, long-lived connections (such as persistent IoT–cloud sessions) are segmented into consecutive flow records sharing the same identifier. The dataset was generated from 60 days of packet capture (PCAP) traces obtained from the publicly available UNSW IoT Traffic Analytics platform. Two variants are included: (1) Bidirectional Custom Flows, capturing both upstream and downstream packets (~6 million flows), and (2) Unidirectional Custom Flows, capturing only upstream packets from the device perspective (about 3.5 million flows). Each day’s data is provided as a separate Parquet file, organized and compressed by direction to facilitate scalable analysis. This dataset provides a fine-grained yet computationally efficient representation of Iot network behavior, supporting research in traffic analysis, anomaly detection, and IoT device identification.
Dataset DOI: 10.5061/dryad.6q573n6c1
Description of the data and file structure
Custom Flow – A Comprehensive Network Traffic Representation
A. Overview
Analyzing patterns in network traffic can be performed at both micro and macro levels.
At the micro level, inspecting byte values within packet headers and payloads provides detailed behavioral insights but is often computationally expensive and limited in capturing a broader communication context.
At the macro level, flow-based aggregation offers a more scalable and cost-effective alternative. A network flow represents a sequence of packets sharing common properties such as source/destination IP addresses, source/destination port numbers, and protocol (e.g., TCP or UDP).
While flow records are efficient for large-scale monitoring, they may lack the granularity needed for fine-grained classification of diverse devices and applications.
To address this, we propose a hybrid representation that combines the strengths of both approaches: integrating flow metadata with selective packet-level information.
B. Custom Bidirectional Flow Design
We introduce custom bidirectional flows to provide a comprehensive representation of network behaviors. Each custom flow aggregates metadata (key header fields), statistical summaries, packet timestamps and directions, and selected payload bytes from the first few packets in the flow.
A custom flow is uniquely identified by a five-tuple: device IP address, remote IP address, protocol, device port, and remote port, and has a fixed lifetime of one minute.
Long-running connections, such as persistent IoT–cloud sessions, are segmented into multiple consecutive flows to maintain computational efficiency and ensure consistent temporal granularity.
When a flow exceeds one minute, a new flow record is created with the same five-tuple.
C. Terminology and Direction Handling
To maintain generality, we replace the conventional client and server terminology with device and remote.
This choice reflects the challenges of determining client–server roles in real time, particularly in UDP communications where connection states are ambiguous.
Even in TCP flows, identifying session initiation and termination can be unreliable due to missing SYN/FIN packets.
For device-to-device communication within the monitored network, we generate two mirrored flow records, each reflecting the perspective of one device endpoint.
Outgoing packets from one device correspond to incoming packets for the other, ensuring consistent bidirectional representation.
For device-to-cloud traffic, only a single bidirectional flow is recorded and attributed to the local device.
D. Flow Metadata and Generalization
While TCP headers can exhibit device-specific characteristics, we intentionally exclude them to preserve generalizability.
Our design prioritizes capturing discriminative information from the transport-layer payload, which often embeds unique application-layer behaviors revealing the signatures of device manufacturer or functionality.
Each custom flow includes the following metadata:
timestamp(µs): Unix timestamp of the first packetremote IPv4 addressprotocol(transport layer)device-side portandremote-side port- total byte count and total packet count
E. Fine-Grained Packet-Level Features
Capturing Behavioral Fingerprints in Network Flows
To capture behavioral fingerprints within each flow, we extract fine-grained information from the first i packets of the flow.
Since both the total number of packets and their payload sizes can vary significantly, analyzing every packet is impractical.
Empirical studies suggest that most distinguishing features appear within the early bytes of the initial packets.
Accordingly, for each of the first i packets, we record:
- Time offset from the flow’s first-seen timestamp
- Total packet size
- Direction flag:
1= device → remote,0= remote → device - Up to j bytes of the transport-layer payload
The parameters i and j are configurable based on system resources and network conditions.
In our dataset:
- 92% of 1-minute flows contain ≤ 10 packets
- 96% of packets are ≤ 1,000 bytes
Therefore, we set i = 10 and j = 1,000.
If a packet’s payload is smaller than 1,000 bytes, the remaining bytes are filled with NaN.
Without appropriate data pruning or compaction, this would introduce a substantial number of NaN values, potentially misleading pattern recognition algorithms.
Given the wide variability in payload sizes, a fixed-size representation per packet is suboptimal for our custom flows.
To address this, we introduced delimiters that mark the start and end of each packet’s payload (whether empty or not).
This method prevents the unnecessary insertion of NaN values and maintains a compact, structured payload representation.
We further limit the total payload section of each flow to B bytes — where B = 3,000 in our experiment — ensuring the capture of at least three packets that are likely to contain meaningful patterns.
If the total payload size (including delimiters) is less than B, we pad the remaining bytes with a negative value of -255, avoiding overlap with any actual payload values.
This compact representation reduces the overall custom flow size by nearly 70% compared to a fixed-size per-packet encoding.
The delimiters are chosen as two distinct negative values within the range -1 to -254 to prevent collisions with payload or padding values.
In our implementation, we use -4 and -8 as delimiters for the start and end of each packet’s payload — represented as <P_start> and <P_end> in the custom flow.
F. Payload Considerations
We make no assumptions about payload contents, whether encrypted, encoded, or plaintext.
Including payload bytes in custom flows enables learning models to discover patterns that characterize IoT device behaviors, even in encrypted traffic.
G. Illustration of Custom Flow Structure
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ Custom Flow │
├─────────────────────────────────────────────────────────────────────────────────────┤
│ Flow Meta │ P01_Meta │ P02_Meta │ ... │ Pi_Meta │ B bytes flow payload │
└──────┬──────┴──────┬─────┴────────────┴───────┴───────────┴───────────┬─────────────┘
│ │ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────┐ │
│ │ PACKET METADATA │ │
│ ├────────────────────────────────────┤ │
│ │ Time Offset │ Pkt Size │ Direction │ │
│ └────────────────────────────────────┘ │
│ │
│ ▼
│ ┌──────────────────────────────────────────────────────────────────────┐
│ │ FLOW PAYLOAD │
│ ├──────────────────────────────────────────────────────────────────────┤
│ │ P_start │ P_end │ P_start │ P2_B001 │ P2_B... │ P2_Bn │ P_end │ Pad │
│ └──────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ FLOW METADATA │
├─────────────────────────────────────────────────────────────────────────────────────┤
│ Relative │ IPv4 │ Remote │ Device │ Protocol │ Total │ Total │
│ Timestamp │ │ Port │ Port │ │ Bytes │ Packets │
└─────────────────────────────────────────────────────────────────────────────────────┘
Figure 1: Structure of each custom flow record.
H. Dataset Description
The data was constructed by analyzing a public dataset of PCAP traces from UNSW IoT Analytics, collected by researchers at UNSW Sydney. It contains 60 days of traffic from 22 consumer IoT device types, including cameras, lightbulbs, power plugs, sensors, appliances, and health monitors.
After processing, we extracted over 5.9 million custom flow records.
Table I summarizes the number of flows per device type.
The activity levels vary widely between devices. For instance, Amazon Echo, Insteon camera, and Belkin motion sensor generate significantly more flows than low-activity devices like the Withings scale.
Table I. Summary of Data Records
| Device Make-and-model | Number of Custom Flow Records |
|---|---|
| Amazon Echo | 777K |
| Belkin motion sensor | 971K |
| Belkin power switch | 642K |
| Dropcam | 172K |
| HP printer | 239K |
| iHome power plug | 17K |
| Insteon camera | 1,096K |
| LiFX lightbulb | 191K |
| NEST Protect | 1K |
| Netatmo camera | 308K |
| Netatmo weather | 77K |
| PIX-STAR photoframe | 24K |
| Samsung camera | 621K |
| Smart Things | 196K |
| TP-Link camera | 56K |
| TP-Link power plug | 18K |
| Triby speaker | 164K |
| Withings baby monitor | 63K |
| Withings scale | 1K |
| Withings sleep sensor | 96K |
| IT (Android tablet) | 235K |
Files and variables
File: bidirectional.tar.gz
bidirectional/
├── 16-09-23.parquet
├── 16-09-24.parquet
├── 16-09-25.parquet
│ ...
├── 16-11-20.parquet
├── 16-11-21.parquet
└── 16-11-22.parquet
File: unidirectional.tar.gz
unidirectional/
├── 16-09-23.parquet
├── 16-09-24.parquet
├── 16-09-25.parquet
│ ...
├── 16-11-20.parquet
├── 16-11-21.parquet
└── 16-11-22.parquet
Code/software
Parsing Parquet Data Files
This section explains how to analyze the customFlow data provided in Parquet format within this repository.
Prerequisites
Before proceeding, ensure you have the required Python modules installed. You can install them using pip:
Reading Parquet Files
Once the Bidirectional and Unidirectional customFlow datasets are extracted, you can parse them in Python as follows:
import pandas as pd
import os
# Define the flow type and file name
flow_type = 'bidirectional'
file_name = '16-09-23.parquet'
# Construct the file path and load the Parquet file
df = pd.read_parquet(os.path.join(flow_type, file_name), engine='pyarrow')
# Display the first few rows of the dataset
print(df.head())
Data Representation
The parquet files contains following columns:
- Device - Device MAC Address
- FirstSeen - Flow timestamp
- RemIP - Remote IP
- Proto - Transport layer Protocol
- DevPort - Device side port number of flow
- RemPort - Remote side port number of flow
- TotalFlowSize - Total byte count of the flow
- PacketCount - Total packet count of the flow
- P00_TO - P10_TO - Time offset of first 10 packets
- P00_PS - P10_PS - Packet size of first 10 packets
- P00_D - P10_D - Direction of first 10 packets
- C_000 - C_2999 - Payload of upto 10 packets. We use
-4and-8as delimiters for the start and end of each packet's payload
Device List
| Device | MAC Address |
|---|---|
| Amazon Echo | 44:65:0d:56:cc:d3 |
| Belkin motion sensor | ec:1a:59:83:28:11 |
| Belkin power switch | ec:1a:59:79:f4:89 |
| Blipcare Blood Pressure meter | 74:6a:89:00:2e:25 |
| Dropcam | 30:8c:fb:2f:e4:b2 |
| Dropcam | 30:8c:fb:b6:ea:45 |
| HP printer | 70:5a:0f:e4:9b:c0 |
| iHome power plug | 74:c6:3b:29:d7:1d |
| Insteon camera | 00:62:6e:51:27:2e |
| Insteon camera | e8:ab:fa:19:de:4f |
| LiFX lightbulb | d0:73:d5:01:83:08 |
| NEST Protect | 18:b4:30:25:be:e4 |
| Netatmo camera | 70:ee:50:18:34:43 |
| Netatmo weather | 70:ee:50:03:b8:ac |
| PIX-STAR photoframe | e0:76:d0:33:bb:85 |
| Samsung camera | 00:16:6c:ab:6b:88 |
| Smart Things | d0:52:a8:00:67:5e |
| TP-Link camera | f4:f2:6d:93:51:f1 |
| TP-Link power plug | 50:c7:bf:00:56:39 |
| Triby speaker | 18:b7:9e:02:20:44 |
| Withings baby monitor | 00:24:e4:11:18:a8 |
| Withings scale | 00:24:e4:1b:6f:96 |
| Withings sleep sensor | 00:24:e4:20:28:c6 |
| IT (Android tablet) | 08:21:ef:3b:fc:e3 |
