Goal: provide detailed recordings of humans performing everyday actions,
to enable insights about manipulation, task planning, teaching robots, and guiding smart textiles

Note: Chapters in the YouTube progress bar denote activity labels.


The Sensors

Wearable Sensors

The wearable sensing suite captures motion, force, and attention information.

It includes eye tracking with a first-person camera, forearm muscle activity sensors, a body-tracking system using 17 inertial sensors, finger-tracking gloves, and custom tactile sensors on the hands that use a matrix of conductive threads.

Environment Sensors

Synchronizing wearable data with global data and ground-truth labels facilitates learning pipelines.

There are 5 RGB cameras, a depth camera, and 2 microphones.

The Activities

The activities highlight a wide range of manipulations and action sequences, to explore both lower-level control tasks and higher-level reasoning tasks.

Peeling and slicing are dexterous activities requiring in-hand manipulation with dynamic grasps, coordination between hands, and tool use. They are well-suited to a multimodal dataset since aspects such as hand pose, motion paths, force, and attention are all critical to successful completion. Slicing is repeated with cucumbers, potatoes, and bread, while peeling is repeated with cucumbers and potatoes; the disparate hardness and shapes precipitate disparate forces, techniques, and even tool selection. They can also be performed by both experts and novices, but with different techniques and efficiencies. In addition to these low-level motion and high-level reasoning aspects, the tasks are also interesting for computer vision pipelines since the objects change appearance and subdivide.

Spreading almond butter or jelly on bread uses a knife in a different way. It involves two-handed coordination, varying object appearances, and motions that are repetitive while adapting to the task and object. The consistencies of almond butter and jelly also lead to different techniques.

Opening and closing a jar are simpler manipulations but still require precise coordination and subtle motions. Tactile forces and muscle activity are also key components of these operations.

Wiping pans or plates with towels or sponges all aim to clean a flat surface but can have quite varied approaches. For example, large or small circular or linear periodic motions may all accomplish the goal. The amount of force applied throughout the motion is also a key component. Whether a person, or ultimately a robot, chooses a particular strategy may depend on preference or the object state.

Pouring water can be informative for prediction or classification pipelines by introducing a transparent liquid that can be hard to model, manipulate, or detect. Each object also continuously changes weight.

High-level tableware tasks such as setting a table or loading/unloading the dishware introduce more task reasoning. They combine longer sequences of dexterous manipulations with abstracted planning, catering to pipelines that focus on motion primitives as well as action sequence prediction.


The Kitchen

The mock kitchen is designed to be as similar to a real kitchen as possible. It has a refridgerator, an island table, cabinets, a sink, a dishwasher, a stove, and various appliances. It also has white tape markings on the floor that indicate the global coordinate system and provide known locations for the subject to stand during the relevant calibration procedures.

The subjects interact with a wide range of objects while performing common kitchen tasks. These range from object manipulation tasks (such as cutting and peeling) to complex action sequences (such as setting the table and loading the dishwasher).

Citation and Usage

Citation

Joseph DelPreto, Chao Liu, Yiyue Luo, Michael Foshey, Yunzhu Li, Antonio Torralba, Wojciech Matusik, and Daniela Rus, "ActionSense: A Multimodal Dataset and Recording Framework for Human Activities Using Wearable Sensors in a Kitchen Environment," Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2022. URL: https://action-sense.csail.mit.edu

Show Bibtex

Direct Links to Publication PDFs: Main Paper | Supplementary Materials
View/Download Publication on OpenReview

Contact Us

 
delpreto@csail.mit.edu
chaoliu@csail.mit.edu
yiyueluo@csail.mit.edu
mfoshey@csail.mit.edu
liyunzhu@csail.mit.edu
torralba@csail.mit.edu
wojciech@csail.mit.edu
rus@csail.mit.edu

License

ActionSense is offered under a CC BY-NC-SA 4.0 license. You are free to use, copy, and redistribute the material for non-commercial purposes provided you give appropriate credit, provide a link to the license, and indicate if changes were made. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. You may not use the material for commercial purposes.

Usage and Ethical Considerations

The dataset and code are made available for research purposes. Anticipated use cases include extracting insights about how humans perform common tasks, analyzing how various sensing modalities relate to each other, analyzing how various sensing modalities relate to specific tasks, and training learning pipelines that can help teach robots to assist or autonomously perform tasks of daily living.

While subjects consented to having their video and audio included in a public dataset, no attempts should be made to actively identify the subjects included in the dataset. The data should also not be modified or augmented in a way that further exposes the subjects' identities.

When using the dataset, societal and ethical implications should be carefully considered. These include safety, privacy, bias, and long-term impact on society. If using the data to train robot assistants, immediate safety of any nearby subjects should be carefully considered. In addition, if the new pipelines use similar personally identifiable sensors as ActionSense, then the privacy of any new subjects should be preserved as highly as possible and clearly described to the subjects; this includes how the new learning pipelines store and process any video or audio data.

In general, ActionSense is intended to be a tool for developing the next generation of wearable sensing and robot assistants for the betterment of society. Endeavors using its data or framework should consider the long-term implications of the application. For example, robot assistants have the potential to improve quality of life and mitigate unsafe working conditions, but they can also result in job displacement that could negatively impact people especially in the short term. How a new robot assistant balances these aspects should be carefully considered before embarking on a novel learning pipeline. In addition, ActionSense and subsequent expansions or reproductions may contain biased data along dimensions such as subject backgrounds, experience, demographics, and hand or eye dominance. This could lead to unanticipated consequences for learning pipelines based on the data. Information is provided about the subject pool along with the dataset, and this should be taken into account when scoping a new project based on the provided data.

The authors declare that they bear all responsibility in case of any violation of rights during the collection of the data, and will take appropriate action when needed, e.g. to remove data with such issues.


Acknowledgments

This work was supported by the GIST-MIT Research Collaboration grant funded by the Gwangju Institute of Science and Technology (GIST) in 2021-2022. The authors are also grateful for the support of the Toyota Research Institute (TRI).