1 minute read

The transformer architecture has led to major advances in natural language processing tasks with models like BERT, GPT-3, and DALL-E. Now, a new study from Google DeepMind shows these transformers and vision-language models can also be applied to robotics for more capable robotic control.

The key idea: In order to fit both natural language responses and robotic actions into the same format, the DeepMind team expresses the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens.

From the paper:

We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2. Our extensive evaluation (6k evaluation trials) shows that our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training. This includes significantly improved generalization to novel objects, the ability to interpret commands not present in the robot training data (such as placing an object onto a particular number or icon), and the ability to perform rudimentary reasoning in response to user commands (such as picking up the smallest or largest object, or the one closest to another object). We further show that incorporating chain of thought reasoning allows RT-2 to perform multi-stage semantic reasoning, for example figuring out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is too sleepy (an energy drink).

๐Ÿ”— Link to the github.io project page