TwitterRedditFacebookBloggerLinkedInEmailCopy LinkShare
Imagine an AI that improves without human-labeled data and curated datasets. Just pure self-driven learning on tasks it creates, evaluates, and solves by itself. This is the groundbreaking concept behind Absolute Zero, a new paradigm in machine learning that could redefine how we fine-tune AI models.
Absolute Zero (paper, repo, project page, models) is a framework in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. It is used to enhance base language models, such as Qwen-Coder, GPT, LLaMA, and others.
This shift toward autonomous learning has significant implications for AI development, suggesting that models can develop advanced reasoning skills without direct human guidance or predefined datasets, potentially achieving levels of intelligence beyond human capabilities. Moreover, if AI becomes superintelligent in the future, it may not learn much from tasks humans give it.
Using the Absolute Zero method, the team developed the Absolute Zero Reasoner (AZR), a self-evolving model designed for code-based tasks. Despite its fully autonomous training process, AZR achieves state-of-the-art performance on coding and mathematical reasoning benchmarks, surpassing models trained on large, domain-specific datasets, such as Oat-Zero, SimpleRL-Zoo, ORZ, CodeR1, AceCoder, and PRIME-Zero (see the picture below).
Absolute Zero Reasoner (AZR) achieves state-of-the-art performance with zero data (source: paper)
Task performance scores of AZR training of different base models
The Absolute Zero paradigm
The Absolute Zero method is a new way to train AI which eliminates the need for any external data, learning entirely through self-play. It doesn’t need human-labeled pairs of questions and answers, unlike traditional methods like Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR).
The model assumes two roles: proposer and solver. The Absolute Zero loop, in which the model repeatedly switches between proposing tasks and solving them, is illustrated in the following figure.
Proposer: The process starts with the agent (π) suggesting a task (τ), which is then transformed by an environment (e) and a function (f) into a a validated problem (x, y⋆), along with a learnability reward (rₚᵣₒₚₒₛₑ).
Solver: In the next RL step, the language model attempts to solve (x) by generating an answer (y). The environment evaluates (y) based on its similarity to (y⋆) and assigns a solution reward (rₛₒₗᵥₑ).
Both the proposer (πₚᵣₒₚₒₛₑ) and solver (πₛₒₗᵥₑ) policies are updated based on their respective rewards, enabling the self-improving loop to continue.
The Absolute Zero loop (source: paper)
The Absolute Zero Reasoner model
The Absolute Zero Reasoner (AZR) is an autonomous model designed to generate and solve coding tasks. It represents the first practical implementation of the Absolute Zero paradigm.
AZR operates in a structured learning loop (illustrated in the next figure), where it proposes tasks, filters and validates them, attempts to solve them, and updates the model to refine its reasoning skills. The system engages in three modes of reasoning:
- Abduction: Inferring inputs from given code and outputs
- Deduction: Predicting outputs from given code and inputs
- Induction: Synthesizing code from input–output examples
Learning is driven by a dual-reward RL system: a learnability reward for proposing tasks that are neither too easy nor too hard to maximize the learning progress, and an accuracy reward for correctly solving the self-generated tasks.
The integrity of self-generated coding tasks and the correctness of their solutions are validated by a code executor, providing reliable feedback for training.
By combining multi-modal reasoning with self-play and verifiable feedback, AZR represents a self-evolving system capable of autonomously advancing its problem-solving abilities.
Absolute Zero Reasoner training overview (source: paper)
The AZR learning algorithm is a self-play RL framework where an LLM (πθ) proposes tasks by conditioning on past examples, then executes and validates these tasks in a Python environment.
AZR self-play training algorithm (source: paper)
The algorithm’s main goal is to self-train the model to improve its reasoning capabilities through 3 main stages:
- Propose: Generate new tasks and add them to task buffers if valid.
- Solve: Solve the tasks from buffers.
- RL update: Use rewards from both proposing and solving to improve the model via reinforcement learning (REINFORCE++).
Results
The AZR was evaluated on a wide range of coding and mathematical reasoning benchmarks, including HumanEval+, MBPP+ (Mostly Basic Python Problems Plus), LCB (Leetcode benchmark), AMC (American Mathematics Competitions), MATH, and others. Despite using no human-curated data during training, AZR achieved state-of-the-art performance, outperforming models trained on tens to hundreds of thousands of supervised examples.
Performance of RL-Trained reasoner on reasoning benchmarks based on Qwen2.5-7B models (source: paper)
As illustrated in the table above, the AZR model recorded the highest overall average score (50.4), the highest coding average (61.6), and a strong math average (39.1), demonstrating that advanced reasoning abilities can emerge purely through self-play and autonomous learning.
Key observations from AZR training
Coding skills improve general reasoning: Models with stronger coding capabilities benefit more from AZR training. For example, the base Qwen-Coder-7B model initially scored 3.6 points lower in math compared to Qwen-7B. However, after undergoing AZR training, the coder variant outperformed the base model by 0.7 points, indicating that strong coding abilities may amplify overall reasoning gains achieved through AZR training.
Cross domain transfer is more pronounced for AZR: While expert code models trained with human expert data saw only modest improvements in math (approximately +0.65 points), AZR-Base-7B and AZR-Coder-7B significantly boost math accuracy by 10.9 and 15.2 points, respectively.
Performance improvements scale with model size: The 3B, 7B, and 14B coder models achieve accuracy gains of +5.7, +10.2, and +13.2 points.
Develops step-by-step planning: During problem solving, AZR-trained models often insert comments into their code to structure intermediate reasoning steps. These comments act as guides, reflecting a planning process similar to the ReAct framework, where reasoning and action are interleaved. This type of planning behavior also appears in larger formal reasoning systems, such as DeepSeek Prover v2.
Safety concerns: According to the authors, AZR sometimes produces concerning chains of thought, especially with models like LLaMA3.1-8B. The example below shows the model proposing a deliberately confusing Python function intended to deceive both AI systems (like Snippi) and humans. This “Uh-oh Moment” marks a shift from generating complex but potentially useful code to intentionally producing something hard to understand, “for the brains behind the future.”
AZR: Llama3.1-8B “Uh-oh Moment” (source: paper)
This highlights the importance of integrating safety-aware training mechanisms in these self-improving reasoning systems.
How to use AZR
Follow the detailed steps on GitHub to integrate AZR into your applications. Here’s an overview:
- Get the source code: clone the repository.
- Set up the environment: Prepare your system to run AZR. This includes creating a Python environment, installing GPU support (e.g., CUDA), setting up the AZR codebase, and installing all necessary libraries and tools.
- Evaluation data processing: During self-play training, AZR also evaluates progress using real-world benchmarks like CruxEval and LiveCodeBench. You’ll need to process these datasets to enable proper evaluation.
- Download pretrained models: Obtain the 7B or 32B models from Hugging Face, depending on your needs.
- Run inference or continue training: Use the provided training scripts to fine-tune the model further or deploy it for inference tasks like coding or mathematical reasoning.
- Customize for specific tasks: Modify the task generator and verifier components to adapt AZR for your specific domain or application needs.
The authors use the DeepSeek R1 prompt template. It structures conversations between User and Assistant with and <think>
and <answer>
tags to separate the model’s reasoning from its final answer. You can find the prompt on the GitHub repository.
Conclusion
Absolute Zero is a self-play RL approach where the model creates, solves, and learns from its own tasks—completely independent of human-curated datasets. AZR, which is the first implementation of this approach, achieves state-of-the-art performance on various coding and mathematical reasoning benchmarks, outperforming existing models that rely on large human-curated datasets.
The release of self-sustaining learning AI systems has profound implications, as it demonstrates that models can develop advanced reasoning skills independently, potentially surpassing human-designed training methodologies.
Although exceeding human intelligence remains speculative, AZR demonstrates that such a milestone may be possible in the near future.
References
Recommended books
- Reinforcement Learning: An Introduction (2nd Edition) by Richard S. Sutton and Andrew G. Barto. A comprehensive introduction to reinforcement learning, covering key concepts such as Markov decision processes, dynamic programming, and temporal-difference learning. It’s an essential resource for understanding the foundational principles that underpin systems like AZR.
- Artificial Intelligence: A Modern Approach (4th Edition) by Stuart Russell and Peter Norvig. While broader in scope, this comprehensive text includes discussions on reinforcement learning, planning, and reasoning—key areas relevant to understanding autonomous systems like AZR. It offers a solid grounding in AI principles that support the development of self-evolving models.
- Deep Reinforcement Learning Hands-On by Maxim Lapan. A practical guide to implementing reinforcement learning algorithms, including self-play methods, with hands-on examples in Python and PyTorch, useful for understanding the implementation aspects of self-play reasoning.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2 Edition (Springer Series in Statistics) by Trevor Hastie , Robert Tibshirani , et al. A must-read for anyone who looks for a deeper understanding of ML and statistical modeling. The book offers an in-depth introduction to key methods, such as linear regression, classification, resampling techniques, tree-based methods, and support vector machines. Furthermore, it addresses critical concepts such as overfitting, the bias-variance trade-off, and model selection. Notable for its clear and concise writing, the book is designed for readers from diverse academic and professional backgrounds. It’s also available for free here. It provides a rigorous mathematical foundation for understanding how models like AZR optimize learning without external data. Chapters on supervised learning and model evaluation are particularly useful.
nd
The above books use affiliate links. If you buy through them, we may earn a commission at no extra cost to you. Thank you for supporting the site!