Seen: Place the toy in the cabinet
Vision-language models (VLMs) show promising generalization and zero-shot capabilities, offering a potential solution to the impracticality and cost of enabling robots to comprehend diverse human instructions and scene semantics in the real world. Existing approaches most directly integrate the semantic representations from pre-trained VLMs with policy learning. However, these methods are limited to the labeled data learned, resulting in poor generalization ability to unseen instructions and objects.
To address the above limitation, we propose a simple method called "Attention before Manipulation" (ABM), which fully leverages the object knowledge encoded in CLIP to extract information about the target object in the image. It constructs an Object Mask Field, serving as a better representation of the target object for the model to separate visual grounding from action prediction and acquire specific manipulation skills effectively.
We train ABM for 8 RLBench tasks and 2 real-world tasks via behavior cloning. Extensive experiments show that our method significantly outperforms the baselines in the zero-shot and compositional generalization experiment settings.
We present ABM, whose key idea is to extract rich commonsense from frozen pre-trained CLIP to construct an Object Mask Field and use it as a better representation of target object for policy learning. Our objective is to empower the agent to successfully execute manipulation tasks involving novel object categories not included in the training dataset, guided by instructions. An overview of the proposed ABM is shown below.
We trained a single ABM model from real world data and a single ABM model from RLBench simulation data. In both settings, the single trained ABM model is used to evaluate the performance on all tasks.
Seen: Place the toy in the cabinet
Unseen: Place the toy in the cabinet
Seen: Place the cup in the cabinet
Unseen: Place the mug in the cabinet
Seen: Put the orange in the plate
Unseen: Put the lemon in the plate
Seen: Put the soda in the plate
Unseen: Put the pepsi in the plate
Seen: Put the money away in the safe on the middle shelf
Unseen: Put the yellow block away in the safe on the top shelf
Seen: Stack the wine bottle to the middle of the rack
Unseen: Stack the green bottle to the right of the rack
Seen: Water plant
Unseen: Water plant
Seen: Close the purple jar
Unseen: Close the green jar
Seen: Put the ring on the orange spoke
Unseen: Put the ring on the black spoke
Seen: Take the chicken off the grill
Unseen: Take the carrot off the grill
Seen: Pick the orange block to yellow target
Unseen: Pick the purple block to green target
Seen: Put the carrot in the plate
Unseen: Put the green pepper in the plate
Unseen: Pick the tomato to pink target
Unseen: Put the orange cube in the plate