Note: Chapters in the YouTube progress bar denote activity labels.
The activities highlight a wide range of manipulations and action sequences, to explore both lower-level control tasks and higher-level reasoning tasks.
Peeling and slicing are dexterous activities requiring in-hand manipulation with dynamic grasps, coordination between hands, and tool use. They are well-suited to a multimodal dataset since aspects such as hand pose, motion paths, force, and attention are all critical to successful completion. Slicing is repeated with cucumbers, potatoes, and bread, while peeling is repeated with cucumbers and potatoes; the disparate hardness and shapes precipitate disparate forces, techniques, and even tool selection. They can also be performed by both experts and novices, but with different techniques and efficiencies. In addition to these low-level motion and high-level reasoning aspects, the tasks are also interesting for computer vision pipelines since the objects change appearance and subdivide.
Spreading almond butter or jelly on bread uses a knife in a different way. It involves two-handed coordination, varying object appearances, and motions that are repetitive while adapting to the task and object. The consistencies of almond butter and jelly also lead to different techniques.
Opening and closing a jar are simpler manipulations but still require precise coordination and subtle motions. Tactile forces and muscle activity are also key components of these operations.
Wiping pans or plates with towels or sponges all aim to clean a flat surface but can have quite varied approaches. For example, large or small circular or linear periodic motions may all accomplish the goal. The amount of force applied throughout the motion is also a key component. Whether a person, or ultimately a robot, chooses a particular strategy may depend on preference or the object state.
Pouring water can be informative for prediction or classification pipelines by introducing a transparent liquid that can be hard to model, manipulate, or detect. Each object also continuously changes weight.
High-level tableware tasks such as setting a table or loading/unloading the dishware introduce more task reasoning. They combine longer sequences of dexterous manipulations with abstracted planning, catering to pipelines that focus on motion primitives as well as action sequence prediction.
Joseph DelPreto, Chao Liu, Yiyue Luo, Michael Foshey, Yunzhu Li, Antonio Torralba, Wojciech Matusik, and Daniela Rus, "ActionNet: A Multimodal Dataset for Human Activities Using Wearable Sensors in a Kitchen Environment," Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (submitted), 2022.
ActionNet is offered under a CC BY-NC-SA 4.0 license. You are free to use, copy, and redistribute the material for non-commercial purposes provided you give appropriate credit, provide a link to the license, and indicate if changes were made. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. You may not use the material for commercial purposes.
The dataset and code are made available for research purposes. Anticipated use cases include extracting insights about how humans perform common tasks, analyzing how various sensing modalities relate to each other, analyzing how various sensing modalities relate to specific tasks, and training learning pipelines that can help teach robots to assist or autonomously perform tasks of daily living.
While subjects consented to having their video and audio included in a public dataset, no attempts should be made to actively identify the subjects included in the dataset. The data should also not be modified or augmented in a way that further exposes the subjects' identities.
When using the dataset, societal and ethical implications should be carefully considered. These include safety, privacy, bias, and long-term impact on society. If using the data to train robot assistants, immediate safety of any nearby subjects should be carefully considered. In addition, if the new pipelines use similar personally identifiable sensors as \ActionNet, then the privacy of any new subjects should be preserved as highly as possible and clearly described to the subjects; this includes how the new learning pipelines store and process any video or audio data.
In general, ActionNet is intended to be a tool for developing the next generation of wearable sensing and robot assistants for the betterment of society. Endeavors using its data or framework should consider the long-term implications of the application. For example, robot assistants have the potential to improve quality of life and mitigate unsafe working conditions, but they can also result in job displacement that could negatively impact people especially in the short term. How a new robot assistant balances these aspects should be carefully considered before embarking on a novel learning pipeline. In addition, ActionNet and subsequent expansions or reproductions may contain biased data along dimensions such as subject backgrounds, experience, demographics, and hand or eye dominance. This could lead to unanticipated consequences for learning pipelines based on the data. Information is provided about the subject pool along with the dataset, and this should be taken into account when scoping a new project based on the provided data.
The authors declare that they bear all responsibility in case of any violation of rights during the collection of the data, and will take appropriate action when needed, e.g. to remove data with such issues.