Activity recognition from in-the-wild smartwatches (ArWISE)
Data files
Mar 21, 2025 version files 101.20 GB
-
c01.zip
511.66 MB
-
c02-part1.zip
8.18 GB
-
c02-part2.zip
8.27 GB
-
c02-part3.zip
9.29 GB
-
c03-part1.zip
8.09 GB
-
c03-part2.zip
7.42 GB
-
c04.zip
6.43 GB
-
c05.zip
2.69 GB
-
c06.zip
2.92 GB
-
c07.zip
1.86 GB
-
c08.zip
30.79 MB
-
c09.zip
2.15 GB
-
c10.zip
3.03 GB
-
c11.zip
421.06 MB
-
c12.zip
3.90 GB
-
c13.zip
1.77 GB
-
c14.zip
325.42 MB
-
c15.zip
58.76 MB
-
c16.zip
710.88 MB
-
c17.zip
6.66 GB
-
c18.zip
4.88 GB
-
c19.zip
4.85 GB
-
c20-part1.zip
7.92 GB
-
c20-part2.zip
8.84 GB
-
README.md
4.44 KB
Abstract
The Activity recognition from in-the-WIld SmartwatchEs (ArWISE) dataset is based on sensor data and activity labels collected from smart watches as part of several studies for a total of 854 participants across 20 cohorts. The sensor data consisted of 10Hz accelerometer, gyroscope, and location information that has been processed into anonymized features computed from one minute windows of data: local time, date, and day of week; mean and standard deviation of yaw, pitch, roll, x/y/z/total rotation rate, x/y/z/total acceleration, speed, course, distance from home, and bearing from home. The activity label is one of eat, errands, exercise, hobby, housework, hygiene, relax, sleep, socialize, travel, work, other. There are 470M data points total, of which 37M are labeled.
https://doi.org/10.5061/dryad.jdfn2z3nm
Description of the data and file structure
CSV files for each participant’s data are organized into zip files by cohort (c01-c20). Cohorts c02, c03, and c20 are split into multiple parts to adhere to the 10G limit per file. For cohorts c03 and c05, participants alternated between two watches (day and night); these are split across two files with w1 and w2 designations (note that file c03.p053.w2.csv
is missing; no data was collected for that watch). The CSV files all include a header with feature names. All features are floats, except for the two time stamps that define the start and end times of the window used to generate each point, and the activity label.
Features
stamp_start
, stamp_end
Timestamps indicating the start and end of time window for computing features. The timestamp follows the ISO 8601 standard, but without the timezone. Times are in user’s local time.
yaw_mean
, pitch_mean
, roll_mean
Mean of yaw/pitch/roll angle across window in radians.
rotation_rate_x_mean
, rotation_rate_y_mean
, rotation_rate_z_mean
Mean of rotation rate in around x/y/z axis in radians per second.
user_acceleration_x_mean
, user_acceleration_y_mean
, user_acceleration_z_mean
Mean of user acceleration (gravity removed) along x/y/z axis across window in meters per second^2.
yaw_std
, pitch_std
, roll_std
Standard deviation of yaw/pitch/roll angle across window in radians.
rotation_rate_x_std
, rotation_rate_y_std
, rotation_rate_z_std
Standard deviation of rotation rate in around x/y/z axis in radians per second.
user_acceleration_x_std
, user_acceleration_y_std
, user_acceleration_z_std
Standard deviation of user acceleration (gravity removed) along x/y/z axis across window in meters per second^2.
rotation_magnitude_mean
Mean of magnitude of rotation rate across window in radians per second.
acceleration_magnitude_mean
Mean of magnitude of user acceleration across window in meters per second^2.
rotation_magnitude_std
Standard deviation of magnitude of rotation rate across window in radians per second.
acceleration_magnitude_std
Standard deviation of magnitude of user acceleration across window in meters per second^2.
speed_mean
, speed_std
Mean and standard deviation of speed across window in meters per second. Speed value of -1 indicates GPS speed unavailable.
course_mode
, course_std
Mode and standard deviation of course across window in degrees. Course value of -1 indicates GPS course unavailable.
distance_from_home_mean
, distance_from_home_std
Mean and standard deviation across window of Haversine distance from home in meters.
distance_from_home_latitude_mean
, distance_from_home_latitude_std
Mean and standard deviation across window of distance from home along latitude in degrees.
distance_from_home_longitude_mean
, distance_from_home_longitude_std
Mean and standard deviation across window of distance from home along longitude in degrees.
bearing_from_home_mode
, bearing_from_home_std
Mode and standard deviation across window of bearing from home in degrees.
time_of_day_radians
Time of day in radians of end of window expressed as 2 * pi * (seconds since midnight) / (seconds in day).
time_of_day_sin
Equals sin(time_of_day_radians)
.
time_of_day_cos
Equals sin(time_of_day_radians)
.
day_of_week
Numeric equivalent of day of week for end of window, where Monday=0, Tuesday=1, …, Sunday=6.
activity_label
Activity occurring during time window. One of 12 possible string values: Eat, Errands, Exercise, Hobby, Housework, Hygiene, Relax, Sleep, Socialize, Travel, Work, Other. Missing value indicates no ground truth label available for the time window.
Missing data
Individual feature values may be missing due to sensor limitations or no ground truth activity label provided by the user. Missing values are indicated by empty strings in the CSV file. Empty strings are used in order to be consistent with the code that uses the Python Pandas library, whose default behavior is to read (read_csv
) and write (to_csv
) empty strings to indicate missing data.
Code/software
Code used to generate datasets and for generating train/test datasets is available at https://github.com/WSU-CASAS/ArWISE.
We introduce ArWISE (Activity recognition from in-the-Wild SmartwatchEs), a dataset containing labeled and unlabeled data collected by Apple Watches. ArWISE represents readings collected from 20 studies in 2 countries over 8 years.
Data Collection
Data collection followed a consistent protocol for each study. Participants were given an Apple Watch to wear each day on their non-dominant arm. While they wore the watch, a custom app collected 3d accelerometer and gyroscope readings at 10Hz. Additionally, the app collected the person’s location every minute or when the magnitude of the acceleration vector exceeded a threshold.
At random times throughout each day, the smartwatch prompted the participant to select an activity from a scroll-down list that best described their current activity. The distribution of user-provided labels across 12 activity categories are Eat (6.5%), Errands (3.7%), Exercise (4.7%), Hobby (1.1%), Housework (19.7%), Hygiene (1.9%), Other (3.1%), Relax (37.7%), Sleep (3.0%), Socialize (3.7%), Travel (5.6%), Work (9.1%). The label was applied to five minutes of sensor readings ending at the time of the participant’s response.
Additionally, an external annotator provided labels for a much greater density of data collected for cohorts 7 and 18. This person used a tool that visualized 3D movement data, a map of visited locations, and time stamps, at arbitrary time frames.
While the data collection mechanism was the same for all study cohorts, other parameters varied. These include the number of participants, participant demographics, length of data collection, and other clinical variables that were collected. A summary of study cohort parameters is given in Table 1, where HOA=healthy older adult, SCD=subjective cognitive decline, and MCI=mild cognitive impairment.
Cohort | Sample | Study/participant characteristics |
1 |
4 |
Younger adults, self-reported activities |
2 |
185 |
HOA/SCD/MCIa, English and Spanish self-reported activities |
3 |
56 |
Younger adults, no activity labels |
4 |
46 |
HOA/SCD/MCI, self-reported activities |
5 |
10 |
Older adult pairs, no activity labels |
6 |
35 |
HOA/SCD/MCI, no activity labels |
7 |
37 |
HOA/SCD/MCI, self-reported activities and expert-annotated activities |
8 |
9 |
Younger adults, self-reported activities |
9 |
15 |
Younger adults, self-reported activities |
10 |
13 |
Younger adults, self-reported activities |
11 |
3 |
Younger adults, self-reported activities |
12 |
18 |
Younger adults, self-reported activities |
13 |
10 |
Younger adults, self-reported activities |
14 |
22 |
Younger adults, self-reported activities |
15 |
21 |
HOA/SCD/MCI, no activity labels |
16 |
6 |
Younger adults, self-reported activities |
17 |
103 |
HOA/SCD/MCI, self-reported activities |
18 |
16 |
HOA/SCD/MCI, self-reported activities and expert-annotated activities |
19 |
16 |
HOA/SCD/MCI, self-reported activities |
20 |
229 |
HOA/SCD/MCI, no activity labels |
Dataset Characteristics
The ArWISE dataset is unique among the resources that are typically available for human activity recognition. Some of the most-analyzed datasets reflect movement categories based on data that are collected in controlled settings [1], [2]. However, more recent wearable sensor datasets represent activities observed in uncontrolled settings. Although 150 participants are monitored for only 24 hours with movement-only sensors, Capture-24 [3] includes labels for functional activities of household chores, sports, and sleep in real-world settings. ExtraSensory [4] monitors a smaller set of 60 participants with up to 20 seconds of movement and location readings but provides diverse activity and location. The UK Biobank [5] offers 7 days of accelerometry data for 100,000+ participants and Intuition [6] longitudinally observes 23,004 participants, though no ground-truth labels are provided for these data.
The ArWISE dataset contains 37,578,059 labeled points from 503 participants across 15 cohorts and 469,881,358 total points for 854 participants across 20 cohorts. Each point represents one minute of data. ArWISE offers unique benefits for HAR analysis, including a large set of participants, functional activity labels, longitudinal observations, and consistency in the data collection mechanism.
Data Preprocessing
Our functional activity recognition models consider both raw time series data and engineered features. Table 2 summarizes the features that are available for both cases.
Type | Category | Feature |
Raw (10Hz) |
time |
date and time |
Raw (10Hz) |
motion |
yaw, pitch, roll, rotation rate (x,y,z), acceleration (x,y,z) |
Raw (10Hz) |
location |
latitude, longitude, altitude, course, speed |
Engineered (1 min) |
time |
time of day (radians, sin, cos), day of week |
Engineered (1 min) |
motion |
mean & stdev (each raw movement variable), mean & stdev (rotation vector magnitude, acceleration vector magnitude) |
Engineered (1 min) |
location |
mean & stdev (course, speed) mean & stdev (distance from home, latitude distance from home, longitude distance from home) mode & stdev (bearing from home) |
Class label | activity | eat, errands, exercise, hobby, housework, hygiene, relax, sleep, socialize, travel, work, other |
We imputed missing values (with mode for location and median for other features) and dropped data points where there was not a complete minute of sensor readings leading up to the label. We also normalized each feature separately.
For the engineered features, we aggregated values over one minute leading up to the user (or expert) label. Time of day was represented as a set of sinusoidal features to maintain the periodic nature. We did not use raw location values here, to preserve user privacy and because the values do not easily generalize between individuals. Instead, we defined a person’s home as the location visited most often at the beginning of each day. We then extracted the Haversine distance and trigonometric bearing from the person’s home location.
References
[1] O. Napoli et al., “A benchmark for domain adaptation and generalization in smartphone-based human activity recognition,” Scientific Data, vol. 11, p. 1192, 2024.
[2] A. Reiss, D. Stricker, and G. Hendeby, “Towards robust activity recognition for everyday life: Methods and evaluation,” in Pervasive Computing Technologies for Healthcare, 2013, pp. 25–32.
[3] S. Chan et al., “CAPTURE-24: A large dataset of wrist-worn activity tracker data collected in the wild for human activity recognition,” Nature Scientific Data, vol. 11, p. 1135, 2024.
[4] Y. Vaizman, K. Ellis, and G. Lanckriet, “Recognizing detailed human context in the wild from smartphones and smartwatches,” IEEE Pervasive Computing, vol. 16, no. 4, pp. 62–74, 2017.
[5] C. Sudlow, J. Gallacher, N. Allen, V. Beral, P. Burton, and J. Danesh, “UK Biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age,” PLoS Medicine, vol. 12, no. 3, p. 1001779, 2015.
[6] P. M. Butler, J. Yang, R. Brown, M. Hobbs, and A. Becker, “Smartwatch- and smartphone-based remote assessment of brain health and detection of mild cognitive impairment,” Nature Medicine, 2025, doi: https://doi.org/10.1038/s41591-024-03475-9.