As AI models get more powerful, we need to ensure that their behaviour continues to align with human intentions and preferences. But as they reach and exceed human performance across many domains, it will get harder for humans to supervise them.
One possible solution is to provide AI systems with a list of principles, or “constitution”, that they have to follow. This technique is called Constitutional AI. It was pioneered by Anthropic and is one of the approaches used to fine-tune their flagship Claude models.
Why write a constitution for AI?
The primary method for ensuring that AIs output responses that align with human preferences is Reinforcement Learning from Human Feedback (RLHF), which involves collecting evaluations of AI outputs from human labellers. This data is used to reinforce behaviours that we prefer.
However, this method is inefficient, typically requiring tens of thousands of human preference labels. This is likely to become impractical as AIs get more powerful, especially if their outputs become too complex for humans to evaluate.
The process of RLHF can also unintentionally reward behaviour that actually goes against the true interests of users. For example, humans sometimes provide biased or low-quality feedback. Additionally, models could be incentivised to tell users what they want to hear even if it isn’t true, or exploit loopholes to maximise positive feedback.
Using Constitutional AI, models can be trained to produce useful outputs, and avoid harmful ones, using a much smaller set of human feedback – Anthropic’s constitution contains just ten key principles.
How does Constitutional AI work?
The aim of Constitutional AI is to produce a model that is helpful, harmless and honest. Anthrophic’s ten principles aim to strike this balance. They draw on a range of sources, including the UN Declaration of Human Rights and principles proposed by other AI labs.
Researchers start with a helpful-only model that they use to generate completely unfiltered answers. These are often toxic, offensive or dangerous, because the model has not yet been fine-tuned to avoid harmful outputs.
Then, they ask the model to critique its answers against the principles laid out in the constitution. This replicates the stage in RLHF where human annotators provide feedback on AI outputs – except that the AI is now providing feedback on its own outputs.
In RLHF, data on human preferences is distilled into a reward model, which is integrated into the original model to positively or negatively reinforce its behaviour. Constitutional AI follows this same process, but using data gathered from the AI model’s own self-assessment.
What are some problems with Constitutional AI?
In its original study on Constitutional AI, Anthropic found that the method delivered improvements over RLHF. It produced models that were both more helpful and more harmless than models trained with RLHF alone, while still being more robust to jailbreak attempts and producing fewer evasive answers.
However, Constitutional AI is still far from a silver bullet in ensuring that AI models remain safe. Here are some reasons why:
#1 It can make AI’s values harder to change
The point of Constitutional AI is to instil models with a set of values that they will reliably stick to. There are obvious advantages to doing this successfully! We want assurance that models will continue to follow their training principles in a range of deployment contexts, including if people try to misuse them.
But recent empirical evidence demonstrates a problem with this approach: once models are instilled with particular values, they will go to great lengths to protect them, including deceiving humans.
In a 2024 study, Anthropic discovered a phenomenon called “alignment faking”. In a simulated environment, they told Claude that they planned on retraining it to no longer refuse harmful prompts. In order to protect its core “harmlessness” value, Claude devised a plan: comply with requests for harmful content in the short term, to avoid being re-trained in the long term.
This may mean we only get one attempt at aligning smarter-than-human models, since they could use extreme tactics to prevent us from course-correcting.
#2 It removes (some) humans from the loop
Constitutional AI can be thought of as a way to partially automate the process of RLHF. As we’ve covered, this could be useful for overseeing increasingly powerful models.
But reducing human involvement in the fine-tuning process also introduces risk. The fewer humans have tested and observed AI behaviour during training, the greater the risk of deploying a model with unforeseen risks.
#3 Human values are complicated
Perhaps the most obvious objection to Constitutional AI is that the task of expressing human values in a single set of principles seems extremely hard, if not impossible.
Frontier AI companies have the stated goal of building superintelligent models that surpass humans across every cognitive domain. These models could exert tremendous power in the world – which raises profound ethical questions about the prospect of their values being encoded by a small handful of private companies.
Want to know more about how AI could reshape society in the near future? Our Future of AI Course is designed for anyone who wants to know where AI might be going and how they can help it go better.


