University of California, Berkeley
We present an approach to learn general robot manipulation priors from 3D hand-object interaction trajectories. We build a framework to use in-the-wild videos to generate sensorimotor robot trajectories. We do so by lifting both the human hand and the manipulated object in a shared 3D space and retargeting human motions to robot actions. Generative modeling on this data gives us a task-agnostic base policy. This policy captures a general yet flexible manipulation prior. We empirically demonstrate that finetuning this policy, with both reinforcement learning (RL) and behavior cloning (BC), enables sample-efficient adaptation to downstream tasks and simultaneously improves robustness and generalizability compared to alternate approaches.
The objective of Hand-Object interaction Pretraing (HOP) is to capture general hand-object interaction priors from videos. Our key intuition is that the basic skills required for manipulation lie on a manifold whose axes are well covered by unstructured human-object interactions. We learn a general manipulation prior implicitly embedded in the weights of a causal transformer, pretrained with a conditional distribution matching objective on sensorimotor robot trajectories.
These trajectories are generated by mapping 3D hand-object interactions from a subset of 100DOH and DexYCB to the robot's embodiment via a physically grounded simulator. We extract sensorimotor information from videos by lifting the human hand and the manipulated object in a shared 3D space. We then bring such 3D representations to a physics simulator, where we map human motion to robot actions. There are several advantages to using a simulator as an intermediary between videos and robot sensorimotor trajectories: (i) we can add physics, inevitably lost in videos, back to the interactions; (ii) it enables the synthesis of large training datasets without putting the physical platform in danger; and (iii) we can add diversity to the data by randomizing the simulation environment, e.g., varying the friction between the robot's joints, the scene's layout, and the object's location relative to the robot.
In contrast to previous work, we do not assume a strict alignment of the human’s intent in the video and the downstream robot tasks. This allows us to train on a large number of videos and learn an end-to-end, task-agnostic prior. We find that finetuning this prior using either RL or BC allows fast skill acquisition.
We train a depth-based end-to-end manipulation prior on sensorimotor trajectories extracted from human hand-object interaction videos. We find that this prior can be finetuned to downstream tasks with few demonstrations, outperforming training policies from scratch and other visual-pretraining baselines.
We finetune the HOP-initialized base policy using RL on three dexterous manipulation tasks in simulation. Finetuning the learnt prior with RL is more sample-efficient, generalizable and leads to more robust policies, outperforming training from scratch and other demonstration-guided RL algorithms.
Below are example trajectories retargeted from human hand-object interaction videos to robots. This approach has the potential scale data collection for robot learning using in-the-wild videos.
We provide examples of 3-D hand-object reconstructions from in-the-wild videos. We display samples of the extracted hand mesh and the object point cloud in the same 3-D space. While occlusions lead to increased detection noise, the higher-level details of hand-object interaction such as affordances, pre-grasp and post-grasp trajectories are preserved.
Following are more demonstrations of our BC-finetuned policy in the real world on a diverse set of objects.
This work was supported by the DARPA Machine Common Sense program, the DARPA Transfer from Imprecise and Abstract Models to Autonomous Technologies (TIAMAT) program, and by the ONR MURI award N00014-21-1-2801. This work was also funded by ONR MURI N00014-22-1-2773. We thank Adhithya Iyer for assistance with teleoperation systems, Phillip Wu for setting-up the real robot, and Raven Huang, Jathushan Rajasegaran and Yutong Bai for helpful discussions
@misc{singh2024handobjectinteractionpretrainingvideos,
title={Hand-Object Interaction Pretraining from Videos},
author={Himanshu Gaurav Singh and Antonio Loquercio and Carmelo Sferrazza and Jane Wu and Haozhi Qi and Pieter
Abbeel
and Jitendra Malik},
year={2024},
eprint={2409.08273},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2409.08273}
}