Note: Chapters in the YouTube progress bar denote activity labels.
The activities highlight a wide range of manipulations and action sequences, to explore both lower-level control tasks and higher-level reasoning tasks.
Peeling and slicing are dexterous activities requiring in-hand manipulation with dynamic grasps, coordination between hands, and tool use. They are well-suited to a multimodal dataset since aspects such as hand pose, motion paths, force, and attention are all critical to successful completion. Slicing is repeated with cucumbers, potatoes, and bread, while peeling is repeated with cucumbers and potatoes; the disparate hardness and shapes precipitate disparate forces, techniques, and even tool selection. They can also be performed by both experts and novices, but with different techniques and efficiencies. In addition to these low-level motion and high-level reasoning aspects, the tasks are also interesting for computer vision pipelines since the objects change appearance and subdivide.
Spreading almond butter or jelly on bread uses a knife in a different way. It involves two-handed coordination, varying object appearances, and motions that are repetitive while adapting to the task and object. The consistencies of almond butter and jelly also lead to different techniques.
Opening and closing a jar are simpler manipulations but still require precise coordination and subtle motions. Tactile forces and muscle activity are also key components of these operations.
Wiping pans or plates with towels or sponges all aim to clean a flat surface but can have quite varied approaches. For example, large or small circular or linear periodic motions may all accomplish the goal. The amount of force applied throughout the motion is also a key component. Whether a person, or ultimately a robot, chooses a particular strategy may depend on preference or the object state.
Pouring water can be informative for prediction or classification pipelines by introducing a transparent liquid that can be hard to model, manipulate, or detect. Each object also continuously changes weight.
High-level tableware tasks such as setting a table or loading/unloading the dishware introduce more task reasoning. They combine longer sequences of dexterous manipulations with abstracted planning, catering to pipelines that focus on motion primitives as well as action sequence prediction.
Joseph DelPreto, Chao Liu, Yiyue Luo, Michael Foshey, Yunzhu Li, Antonio Torralba, Wojciech Matusik, and Daniela Rus, "ActionSense: A Multimodal Dataset and Recording Framework for Human Activities Using Wearable Sensors in a Kitchen Environment," Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2022. URL: https://action-sense.csail.mit.edu
Hide Bibtex
@inproceedings{delpretoLiu2022actionSense,
title={{ActionSense}: A Multimodal Dataset and Recording Framework for Human Activities Using Wearable Sensors in a Kitchen Environment},
author={Joseph DelPreto and Chao Liu and Yiyue Luo and Michael Foshey and Yunzhu Li and Antonio Torralba and Wojciech Matusik and Daniela Rus},
booktitle={Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks},
year={2022},
url={https://action-sense.csail.mit.edu},
abstract={This paper introduces ActionSense, a multimodal dataset and recording framework with an emphasis on wearable sensing in a kitchen environment. It provides rich, synchronized data streams along with ground truth data to facilitate learning pipelines that could extract insights about how humans interact with the physical world during activities of daily living, and help lead to more capable and collaborative robot assistants. The wearable sensing suite captures motion, force, and attention information; it includes eye tracking with a first-person camera, forearm muscle activity sensors, a body-tracking system using 17 inertial sensors, finger-tracking gloves, and custom tactile sensors on the hands that use a matrix of conductive threads. This is coupled with activity labels and with externally-captured data from multiple RGB cameras, a depth camera, and microphones. The specific tasks recorded in ActionSense are designed to highlight lower-level physical skills and higher-level scene reasoning or action planning. They include simple object manipulations (e.g., stacking plates), dexterous actions (e.g., peeling or cutting vegetables), and complex action sequences (e.g., setting a table or loading a dishwasher). The resulting dataset and underlying experiment framework are available at https://action-sense.csail.mit.edu. Preliminary networks and analyses explore modality subsets and cross-modal correlations. ActionSense aims to support applications including learning from demonstrations, dexterous robot control, cross-modal predictions, and fine-grained action segmentation. It could also help inform the next generation of smart textiles that may one day unobtrusively send rich data streams to in-home collaborative or autonomous robot assistants.}
}
Joseph DelPreto
Chao Liu
Yiyue Luo
Michael Foshey
Yunzhu Li
Antonio Torralba
Wojciech Matusik
Daniela Rus
delpreto@csail.mit.edu
chaoliu@csail.mit.edu
yiyueluo@csail.mit.edu
mfoshey@csail.mit.edu
liyunzhu@csail.mit.edu
torralba@csail.mit.edu
wojciech@csail.mit.edu
rus@csail.mit.edu
ActionSense is offered under a CC BY-NC-SA 4.0 license. You are free to use, copy, and redistribute the material for non-commercial purposes provided you give appropriate credit, provide a link to the license, and indicate if changes were made. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. You may not use the material for commercial purposes.
The dataset and code are made available for research purposes. Anticipated use cases include extracting insights about how humans perform common tasks, analyzing how various sensing modalities relate to each other, analyzing how various sensing modalities relate to specific tasks, and training learning pipelines that can help teach robots to assist or autonomously perform tasks of daily living.
While subjects consented to having their video and audio included in a public dataset, no attempts should be made to actively identify the subjects included in the dataset. The data should also not be modified or augmented in a way that further exposes the subjects' identities.
When using the dataset, societal and ethical implications should be carefully considered. These include safety, privacy, bias, and long-term impact on society. If using the data to train robot assistants, immediate safety of any nearby subjects should be carefully considered. In addition, if the new pipelines use similar personally identifiable sensors as ActionSense, then the privacy of any new subjects should be preserved as highly as possible and clearly described to the subjects; this includes how the new learning pipelines store and process any video or audio data.
In general, ActionSense is intended to be a tool for developing the next generation of wearable sensing and robot assistants for the betterment of society. Endeavors using its data or framework should consider the long-term implications of the application. For example, robot assistants have the potential to improve quality of life and mitigate unsafe working conditions, but they can also result in job displacement that could negatively impact people especially in the short term. How a new robot assistant balances these aspects should be carefully considered before embarking on a novel learning pipeline. In addition, ActionSense and subsequent expansions or reproductions may contain biased data along dimensions such as subject backgrounds, experience, demographics, and hand or eye dominance. This could lead to unanticipated consequences for learning pipelines based on the data. Information is provided about the subject pool along with the dataset, and this should be taken into account when scoping a new project based on the provided data.
The authors declare that they bear all responsibility in case of any violation of rights during the collection of the data, and will take appropriate action when needed, e.g. to remove data with such issues.
This work was supported by the GIST-MIT Research Collaboration grant funded by the Gwangju Institute of Science and Technology (GIST) in 2021-2022. The authors are also grateful for the support of the Toyota Research Institute (TRI).