(a): Existing vision systems with fixed RGB-D cameras cannot handle fine-grained visual information across larger spatial extents. (b): The team's EyeVLA system can perceive broader and finergrained visual information from a fixed position by rotating its viewpoint and zooming in on the target, according to instructions. Credit: arXiv (2025). DOI: 10.48550/arxiv.2511.15279
Embodied artificial intelligence (AI) systems are robotic agents that rely on machine learning algorithms to sense their surroundings, plan their actions and execute them. A key aspect of these systems are visual perception modules, which allow them to analyze images captured by cameras and interpret them.
Most existing visual perception modules for embodied AI agents rely on RGB-D cameras, devices that capture both color (RGB) images and depth (D)-related information. In most cases, however, these cameras are attached to a robot and remain fixed in place, which limits the ability to detect changes in dynamic and complex environments.
Researchers at Shanghai Jiao Tong University, the Chinese Academy of Sciences and Dalian University of Technology recently developed a new robotic system inspired by human eyeballs, which can rotate and zoom-in to acquire clearer images of objects without the need for additional sensors or more expensive cameras. This robotic eyeball, called EyeVLA, was presented in a paper published on the arXiv preprint server.
"Existing vision models and fixed RGB-D camera systems fundamentally fail to reconcile wide-area coverage with fine-grained detail acquisition, severely limiting their efficacy in open-world robotic applications," wrote Jiashu Yang, Yifan Han and their colleagues in their paper.
"To address this issue, we propose EyeVLA, a robotic eyeball for active visual perception that can take proactive actions based on instructions, enabling clear observation of fine-grained target objects and detailed information across a wide spatial extent."
A team's robotic eyeball and its underlying model
In contrast with many other robotic systems for visual perception introduced in the past, the eyeball-like system created by these researchers can rotate itself and zoom-in to capture specific elements in its surroundings more clearly. In addition, the team created machine learning models that allow the robotic eyeball to process users' instructions and change its viewpoint accordingly.
The models they developed, which were trained via reinforcement learning, convert the movements of the camera into 'action tokens,' planning its future actions similarly to how other models might predict words or feature images. They also place 2D boxes around objects to guide the system towards specific areas of interest.
"EyeVLA discretizes action behaviors into action tokens and integrates them with vision-language models (VLMs) that possess strong open-world understanding capabilities, enabling joint modeling of vision, language, and actions within a single autoregressive sequence," wrote the authors.
"By using the 2D bounding box coordinates to guide the reasoning chain and applying reinforcement learning to refine the viewpoint selection policy, we transfer the open-world scene understanding capability of the VLM to a vision language action (VLA) policy using only minimal real-world data."
Remarkable performance and possible future applications
The researchers have already tested their proposed system in a series of experiments in indoor settings, where they assessed its ability to acquire clearer images and accurately interpret them. They found that the system performed remarkably well, without relying on very expensive sensors and cameras.
"Experiments show that our system efficiently performs instructed scenes in real-world environments and actively acquires more accurate visual information through instruction-driven actions of rotation and zoom, thereby achieving strong environmental perception capabilities," wrote Yang, Han and their colleagues.
"EyeVLA introduces a novel robotic vision system that leverages detailed and spatially rich, large-scale embodied data, and actively acquires highly informative visual observations for downstream embodied tasks."
In the future, the eyeball-like vision system created by this team of researchers could be improved further and tested in a broader range of dynamic environments. Eventually, it could be integrated with other robotic components and be deployed in real-world settings.
EyeVLA could ultimately enhance the performance of robots for a wide range of applications, ranging from the inspection of infrastructure, warehouses or public spaces, to the monitoring of natural environments and the efficient completion of household chores.
Written for you by our author Ingrid Fadelli, edited by Robert Egan—this article is the result of careful human work. We rely on readers like you to keep independent science journalism alive. If this reporting matters to you, please consider a donation (especially monthly). You'll get an ad-free account as a thank-you.
More information: Jiashu Yang et al, Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception, arXiv (2025). DOI: 10.48550/arxiv.2511.15279
Citation: New robotic eyeball could enhance visual perception of embodied AI (2025, December 3) retrieved 18 December 2025 from https://techxplore.com/news/2025-12-robotic-eyeball-visual-perception-embodied.html
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.