AI system discovers visual categories while adapting to new contexts

When identifying objects that can be sold at a garage sale, ad-hoc categorization, or OAK for short, can discover new categories like hats or luggage even when only provided with the concept of 'shoes' and a few shoe image examples during training. The system leverages unlabeled data and sparse labels to identify both known and unknown concepts that fit the garage sale context. Credit: Wang et al., 2025.

A new approach called open ad-hoc categorization (OAK) helps AI systems dynamically reinterpret the same image differently depending on the categorization context, rather than using fixed visual interpretation. A University of Michigan-led study on this topic was presented in June 2025 at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) in Nashville, Tennessee.

"When people think about using AI for image categorization, they often assume that each image has a fixed, objective meaning. Our work shows that an image can be viewed from multiple perspectives, depending on the task, context or goals. Just like humans don't see an image as static, but adapt its meaning based on what they need, AI should interpret images flexibly, adjusting based on context and objectives," said Stella Yu, a professor of computer science and engineering at U-M and senior author of the study.

Previous AI categorization used fixed rigid categories like "chair," "car" or "dog" that could not adapt to different purposes or contexts. OAK can instead assess the same image differently depending on the desired context. For example, an image of a person drinking could be categorized by the action "drinking," the location "in a store," or the mood "happy."

The research team built their model by expanding on OpenAI's CLIP, a foundation vision-language AI model that learns to associate images with textual descriptions. They added context tokens that work like specialized instruction sets for the AI model. These tokens, learned from both labeled and unlabeled data, are fed into the system alongside the image data to shape visual features processing for different contexts. This results in the model naturally focusing on relevant image regions—such as hands for action or background for location—without being explicitly told where to look.

Importantly, the new context tokens undergo training while the original CLIP system remains the same, allowing the model to adapt to different purposes without losing existing knowledge.

"We were surprised by how effectively the system learned to focus attention appropriately and organize data cleanly with such a simple mechanism of only a few tokens and a few labeled examples per context," said Zilin Wang, a doctoral student of computer science and engineering at U-M and lead author of the study.

Further, OAK is able to discover new categories it has never seen during training. For example, when asked to recognize items in an image that can be sold at a garage sale, the system would learn to find items like luggage or hats even if it was only shown examples of shoes.

OAK discovers new categories by combining both top-down and bottom-up approaches. Top-down semantic guidance uses language knowledge to propose potential new categories. If you know shoes can be sold at the garage sale, the system extends that to propose hats might also be able to be sold at garage sales, even without seeing an example of a hat during training.

In addition to its knowledge of language, OAK uses bottom-up visual clustering that discovers patterns in unlabeled visual data. The system might notice many suitcases appearing in unlabeled images for the task at hand. It thus discovers a new relevant category for garage sale, even though no suitcase is labeled as a valid item.

The researchers get these two approaches to work together during training. Semantic proposals like hats prompt the visual system to search for hats, and if they're found, it confirms a valid new category. On the other hand, notable visual clusters use CLIP's existing image-text knowledge to help identify what to call the cluster.

"We are looking for new categories using both the top-down and bottom-up methods, and they have to interact," said Wang.

The research team tested OAK's on two image data sets, Stanford and Clevr-4, and compared performance against two groups of baseline models—CLIP with an extended vocabulary and Generalized Category Discovery or GCD.

OAK achieved state-of-the-art in both accuracy and concept discovery across multiple categorizations. Notably, OAK reached 87.4% novel accuracy when identifying mood in the Stanford dataset, surpassing CLIP and GCD by over 50%.

While all methods generate saliency maps, OAK's maps focus on the right part of the image for each context by learning from data rather than being programmed, offering both flexibility and interpretable results.

Moving forward, OAK's contextual approach will be helpful in applications like robotics, where systems need to perceive the same environment differently based on their current task.

The University of California, Berkeley and the Bosch Center for AI also contributed to this research.

More information: Open Ad-hoc Categorization with Contextualized Feature Learning: cvpr.thecvf.com/virtual/2025/poster/34699

Provided by University of Michigan College of Engineering