Hand-object interaction pretraining from videos

University of California, Berkeley

We present an approach to learn general robot manipulation priors from 3D hand-object interaction trajectories. We build a framework to use in-the-wild videos to generate sensorimotor robot trajectories. We do so by lifting both the human hand and the manipulated object in a shared 3D space and retargeting human motions to robot actions. Generative modeling on this data gives us a task-agnostic base policy. This policy captures a general yet flexible manipulation prior. We empirically demonstrate that finetuning this policy, with both reinforcement learning (RL) and behavior cloning (BC), enables sample-efficient adaptation to downstream tasks and simultaneously improves robustness and generalizability compared to alternate approaches.


Fine-tune curves.

3-D hand-object trajectories from in-the-wild human manipulation videos are re-targeted to a robot embodiment within a physics simulator, resulting in physically grounded robot data. General manipulation priors are learnt from this using generative modelling of trajectories. Such representation enables sample-efficient adaptation for new downstream tasks.

The objective of Hand-Object interaction Pretraing (HOP) is to capture general hand-object interaction priors from videos. Our key intuition is that the basic skills required for manipulation lie on a manifold whose axes are well covered by unstructured human-object interactions. We learn a general manipulation prior implicitly embedded in the weights of a causal transformer, pretrained with a conditional distribution matching objective on sensorimotor robot trajectories.

These trajectories are generated by mapping 3D hand-object interactions from a subset of 100DOH and DexYCB to the robot's embodiment via a physically grounded simulator. We extract sensorimotor information from videos by lifting the human hand and the manipulated object in a shared 3D space. We then bring such 3D representations to a physics simulator, where we map human motion to robot actions. There are several advantages to using a simulator as an intermediary between videos and robot sensorimotor trajectories: (i) we can add physics, inevitably lost in videos, back to the interactions; (ii) it enables the synthesis of large training datasets without putting the physical platform in danger; and (iii) we can add diversity to the data by randomizing the simulation environment, e.g., varying the friction between the robot's joints, the scene's layout, and the object's location relative to the robot.

In contrast to previous work, we do not assume a strict alignment of the human’s intent in the video and the downstream robot tasks. This allows us to train on a large number of videos and learn an end-to-end, task-agnostic prior. We find that finetuning this prior using either RL or BC allows fast skill acquisition.

HOP enables sample-efficient BC finetuning

We train a depth-based end-to-end manipulation prior on sensorimotor trajectories extracted from human hand-object interaction videos. We find that this prior can be finetuned to downstream tasks with few demonstrations, outperforming training policies from scratch and other visual-pretraining baselines.

Initialisation with HOP leads to more robust RL policies

We finetune the HOP-initialized base policy using RL on three dexterous manipulation tasks in simulation. Finetuning the learnt prior with RL is more sample-efficient, generalizable and leads to more robust policies, outperforming training from scratch and other demonstration-guided RL algorithms.

Fine-tune curves.

Comparison of HOP-initialized actor with baselines. HOP improves sample-efficiency of online RL across multiple tasks, particularly when the downstream task and the behaviors in the data are less aligned. Runs are averaged across three randomly chosen seeds.

Fine-tune curves.

Evaluating RL finetuning under out-of-distribution scenarios (Left) To test grasp robustness, we apply to the grasped objects, forces in random direction equal to their weights. When initialized with HOP, the resulting policy is more than 3x more robust compared to training PPO from scratch. (Right) We evaluate grasp success on multiple objects from the YCB dataset that were not part of the training set. When initialized with HOP, the resulting policy is more than 2x more robust compared to training PPO from scratch.

Samples from sim-in-the-loop retargeting

Below are example trajectories retargeted from human hand-object interaction videos to robots. This approach has the potential scale data collection for robot learning using in-the-wild videos.

Retargeting in-the-wild videos

We provide examples of 3-D hand-object reconstructions from in-the-wild videos. We display samples of the extracted hand mesh and the object point cloud in the same 3-D space. While occlusions lead to increased detection noise, the higher-level details of hand-object interaction such as affordances, pre-grasp and post-grasp trajectories are preserved.

More Demonstrations

Following are more demonstrations of our BC-finetuned policy in the real world on a diverse set of objects.

Acknowledgements

This work was supported by the DARPA Machine Common Sense program, the DARPA Transfer from Imprecise and Abstract Models to Autonomous Technologies (TIAMAT) program, and by the ONR MURI award N00014-21-1-2801. This work was also funded by ONR MURI N00014-22-1-2773. We thank Adhithya Iyer for assistance with teleoperation systems, Phillip Wu for setting-up the real robot, and Raven Huang, Jathushan Rajasegaran and Yutong Bai for helpful discussions

BibTeX

@misc{singh2024handobjectinteractionpretrainingvideos,
title={Hand-Object Interaction Pretraining from Videos},
author={Himanshu Gaurav Singh and Antonio Loquercio and Carmelo Sferrazza and Jane Wu and Haozhi Qi and Pieter Abbeel and Jitendra Malik},
year={2024},
eprint={2409.08273},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2409.08273}
}