ABM: Attention before Manipulation

Abstract

Vision-language models (VLMs) show promising generalization and zero-shot capabilities, offering a potential solution to the impracticality and cost of enabling robots to comprehend diverse human instructions and scene semantics in the real world. Existing approaches most directly integrate the semantic representations from pre-trained VLMs with policy learning. However, these methods are limited to the labeled data learned, resulting in poor generalization ability to unseen instructions and objects.

To address the above limitation, we propose a simple method called "Attention before Manipulation" (ABM), which fully leverages the object knowledge encoded in CLIP to extract information about the target object in the image. It constructs an Object Mask Field, serving as a better representation of the target object for the model to separate visual grounding from action prediction and acquire specific manipulation skills effectively.

We train ABM for 8 RLBench tasks and 2 real-world tasks via behavior cloning. Extensive experiments show that our method significantly outperforms the baselines in the zero-shot and compositional generalization experiment settings.

ABM

We present ABM, whose key idea is to extract rich commonsense from frozen pre-trained CLIP to construct an Object Mask Field and use it as a better representation of target object for policy learning. Our objective is to empower the agent to successfully execute manipulation tasks involving novel object categories not included in the training dataset, guided by instructions. An overview of the proposed ABM is shown below.

Overview of ABM.

Experiment Results

We trained a single ABM model from real world data and a single ABM model from RLBench simulation data. In both settings, the single trained ABM model is used to evaluate the performance on all tasks.