ABM: Attention before Manipulation

1Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
2Shenzhen University

Abstract

Vision-language models (VLMs) show promising generalization and zero-shot capabilities, offering a potential solution to the impracticality and cost of enabling robots to comprehend diverse human instructions and scene semantics in the real world. Existing approaches most directly integrate the semantic representations from pre-trained VLMs with policy learning. However, these methods are limited to the labeled data learned, resulting in poor generalization ability to unseen instructions and objects.

To address the above limitation, we propose a simple method called "Attention before Manipulation" (ABM), which fully leverages the object knowledge encoded in CLIP to extract information about the target object in the image. It constructs an Object Mask Field, serving as a better representation of the target object for the model to separate visual grounding from action prediction and acquire specific manipulation skills effectively.

We train ABM for 8 RLBench tasks and 2 real-world tasks via behavior cloning. Extensive experiments show that our method significantly outperforms the baselines in the zero-shot and compositional generalization experiment settings.

ABM

We present ABM, whose key idea is to extract rich commonsense from frozen pre-trained CLIP to construct an Object Mask Field and use it as a better representation of target object for policy learning. Our objective is to empower the agent to successfully execute manipulation tasks involving novel object categories not included in the training dataset, guided by instructions. An overview of the proposed ABM is shown below.


Overview of ABM.

Experiment Results

We trained a single ABM model from real world data and a single ABM model from RLBench simulation data. In both settings, the single trained ABM model is used to evaluate the performance on all tasks.


Real World Videos (8X Default Speed)

Place Something in Cabinet

Seen: Place the toy in the cabinet

Unseen: Place the toy in the cabinet

Seen: Place the cup in the cabinet

Unseen: Place the mug in the cabinet

Place Something in Plate

Seen: Put the orange in the plate

Unseen: Put the lemon in the plate

Seen: Put the soda in the plate

Unseen: Put the pepsi in the plate

Simulation Videos

Put in Safe

Seen: Put the money away in the safe on the middle shelf

Unseen: Put the yellow block away in the safe on the top shelf

Place Wine

Seen: Stack the wine bottle to the middle of the rack

Unseen: Stack the green bottle to the right of the rack

Water Plants

Seen: Water plant

Unseen: Water plant

Close Jar

Seen: Close the purple jar

Unseen: Close the green jar

Insert Peg

Seen: Put the ring on the orange spoke

Unseen: Put the ring on the black spoke

Meat off Grill

Seen: Take the chicken off the grill

Unseen: Take the carrot off the grill

Pick Block

Seen: Pick the orange block to yellow target

Unseen: Pick the purple block to green target

Place Food

Seen: Put the carrot in the plate

Unseen: Put the green pepper in the plate

Compositional Tasks

Pick Food

Unseen: Pick the tomato to pink target

Place Block

Unseen: Put the orange cube in the plate