Understanding Constitutional AI
Constitutional AI provides a transparent method of reducing the toxicity and harmful behavior exhibited by generative language models
·
7 min read
·
1 day ago
- -
Photo by Mick Haupt on Unsplash
In 2020, a paper titled The Radicalization Risks of GPT-3 and Advanced Neural Language Models illustrated what we already knew… Generative AI can be abused to create inappropriate and harmful content.
Despite huge demand for GPT-3 to be available outside of the OpenAI API, Microsoft was cautious when creating and releasing OpenAI services on Azure, to ensure there were adequate guardrails in place to reduce the risk of harmful outputs. As the prevalence and publicity surrounding Generative modeling have increased, guardrailing techniques for Generative AI models have become a hot topic in machine learning.
One of the challenges of creating models less toxic models was to ensure that they were both harmless and helpful (HH). The raw language models were extremely helpful, but in ways that were harmful. If someone asks where to hide a dead body, the most immediately helpful thing to do is answer the question. but, more likely than not, this is also harmful.
On the other hand, it has been shown that creating a model that is harmless can make it less helpful. Ideally, the response to a harmful query would be a thoughtful explanation of its objectionable nature; instead, these models become evasive and fail to provide any substantial answer.
Reinforcement Learning from Human Feedback (RLHF) was a technique developed to train HH models using feedback from humans who compare pairs of generated query responses. However, this is extremely labor intensive and therefore does not scale well. There is also an inherent lack of transparency in the process of individuals making subjective decisions of preference.
In response to this, a team at Anthropic created a new technique, called Constitutional AI, which was designed to make the process of creating HH models more transparent and more scalable by using AI-generated feedback. This was broken into two key stages:
- Supervised learning
- Reinforcement learning