Google's Gemini model lets humanoid robot carry out multimodal tasks

Google’s humanoid robot hands pass banana test using Gemini’s latest AI models

The bi-arm Franca robot displayed its reasoning skills in the banana test Source: Google DeepMind/YouTube

Google DeepMind released a video on YouTube on September 25, revealing how their humanoids are now capable of performing multi-step, complex tasks using multimodal reasoning.

In a series of tests carried out by engineers, Google’s robots passed the banana test, which involved serving different fruits into plates based on color.

The humanoids’ thinking capabilities are derived from the Gemini Robotics 1.5 family of models, allowing the machines to perceive, think, and execute complex, real-world tasks autonomously.

The Gemini Robotics 1.5 family consists of two models: Gemini Robotics 1.5, which turns visuals and instructions into robot actions, and Gemini Robotics-ER 1.5, which reasons about the world and creates step-by-step plans to solve tasks.

The banana test: Before and after

In the previous version of Gemini Robotics, the main task was to have the robot pick up a banana and place it in a bowl. It enabled these humanoids to follow and execute one instruction at a time.

With Gemini Robotics 1.5, the humanoids sorted three different fruits, including a banana, into different plates based on color. Jie Tan, a Senior Staff Research Specialist at Google DeepMind, demonstrated the experiment.

In doing so, the bi-arm Franka robot demonstrated its ability to follow multi-step instructions and execute them one by one with precision.

Sorting laundry

Another test involved Apptronik’s humanoid, Apollo, sorting laundry. In this scenario, the humanoid sorted clothes based on color in two different baskets – white and black.

To make it more challenging after the first successful attempt, the engineers rotated the baskets. They changed their positions to check if the robot could sense this positional change mid-operation and act as necessary. Apollo successfully recognized the change and sorted the clothes accordingly.

Exploring agentic capabilities

Gemini Robotics 1.5 also allows robots to study their environments using embodiment learning and take actions based on their observations. For instance, Google’s ALOHA 2 robot did most of the tasks in the experiments, but the same can also be carried out by Apollo and the bi-arm Franka robot.

The new AI models also unlock agentic capabilities for robots. For example, a robot might be asked to sort objects into the correct compost, recycling, and trash bins based on local rules.

To do this, it would first need to search online for the area’s recycling guidelines, then visually inspect the objects, decide where each item belongs, and finally carry out the full sequence of steps to put them away.

This type of multi-step reasoning and execution is made possible through an agentic framework. The two models work together to help robots complete real-world tasks with greater transparency and reliability. One focuses on perception-to-action, and the other on reasoning and planning.

The responsibility for safety

Google is emphasizing safety in its Gemini Robotics 1.5 models by teaching robots to think about risks before acting, respect human rules, and avoid accidents.

With support from dedicated safety teams and the updated ASIMOV benchmark, Gemini Robotics-ER 1.5 achieved state-of-the-art results in safety tests, ensuring safer real-world deployment.

Atharva Gosavi Atharva is a full-time content writer with a post-graduate degree in media & amp; entertainment and a graduate degree in electronics & telecommunications. He has written in the sports and technology domains respectively. In his leisure time, Atharva loves learning about digital marketing and watching soccer matches. His main goal behind joining Interesting Engineering is to learn more about how the recent technological advancements are helping human beings on both societal and individual levels in their daily lives.

MasterCard