The main approach used today to ensure AI systems behave in accordance with human preferences is Reinforcement Learning from Human Feedback (RLHF). Humans evaluate AI outputs, and this feedback is used to train a reward model that positively reinforces behaviour we like and negatively reinforces behaviour we don’t.
But when AI models get more capable than us, this approach won’t work anymore – because they will produce outputs that are too complicated for humans to evaluate.
One proposed way of solving this problem is using AI models to assist humans with the evaluation process itself. This is called recursive reward modelling (RRM).
The problem
In order to effectively steer AI models using RLHF, we need to understand whether their behaviour is good or bad. But this will become harder as AI surpasses human ability, and as it carries out more and more complex tasks in the real world.
We can imagine many circumstances where humans may not be able to evaluate AI behaviour. An AI with superhuman coding abilities could produce some extremely complicated code that is unintelligible even to a top human expert. Or, a model could propose an action with complex knock-on effects that will take a long time to manifest, such as the introduction of a new gene into an ecosystem.
How could recursive reward modelling help?
The idea behind recursive reward modelling is simple: when AI outputs get too complex or time-consuming for humans to evaluate, we can use AI to help us provide better feedback.
Here’s how it works:
We train Agent A to perform a task, such as writing some code, using standard RLHF.
We train Agent B, a more powerful model than Agent A. But this time, Agent A helps the human user to evaluate the code written by Agent B. It could do this by, for example, summarising different sections of the code and identifying errors or bugs.
Now, we can use Agent B to assist the user in training an even more powerful model, Agent C.
This process is repeated again and again, creating a recursive chain where each agent is used to help evaluate one slightly more powerful than itself.
As models get more powerful, the user provides less fine-grained feedback – because more of the evaluation process is delegated to AI models. At the same time, what feedback the user does provide is amplified at each step, as it cascades through a larger chain of increasingly capable models.
The user is a bit like an employee progressing through the ranks of a company. The manager of a single team has more direct oversight over the day-to-day activities of employees, but the CEO has amplified influence over the big-picture strategy of the organisation.
RRM (and RLHF!) rest on the core assumption that evaluation is easier than generation – it’s easier to tell if a task has been completed than it is to actually do the task. Everybody can tell who has won the 100-metre sprint at the Olympics, but very few people are Olympic-level sprinters. If this assumption holds true, then it should be possible for AI-assisted humans to provide useful evaluation of models that are more capable than them (at least up to a certain level of capability).
A 2022 study by OpenAI showed that AI assistance helped humans spot more errors in text summaries, providing some empirical evidence that AIs could improve human feedback for reward modelling.
Limitations of recursive reward modelling
Recursive reward modelling is a useful research direction, but not a full solution to the problem of controlling powerful AI.
Here are some reasons why:
#1 RRM won’t scale indefinitely
RRM helps humans provide better feedback on AI outputs. This means that it augments RLHF (and potentially extends the period during which it will be sufficient to supervise powerful AI), but shares the same fundamental limitations. Eventually, AIs may generate outputs that are too complex for humans to evaluate, even with AI assistance.
#2 RRM may not fix foundational problems with reward modelling
Whether recursive reward modelling will produce good outcomes depends on the foundation it’s built upon – the quality of the reward model and how well it reflects genuine human preferences.
Standard RLHF already produces problematic behaviours like sycophancy (telling users what they want to hear), and reward hacking (exploiting loopholes to maximise reward). It isn’t clear that applying recursion to reward modelling solves these core issues.
#3 RRM could lead to compounding errors
RRM could amplify the effect of errors in the reward modelling process. A poorly trained AI evaluator that is, for example, prone to sycophancy, could help train a worse agent in the next layer of the recursion, and so on.
There is a growing effort to ensure the transition to powerful AI goes well, but we need more voices in the conversation. We designed our 2-hour Future of AI Course for anyone who wants to add theirs – no prior knowledge required.


