Weak-to-strong generalisation refers to the phenomenon that weaker AI models can sometimes be used to train and improve more powerful ones.
Researchers hope that if weak to strong generalisation holds true for models much more powerful than today’s, this approach could be used to help us supervise smarter-than-human AI systems.
The problem
Today’s AI models are mainly aligned with human preferences using Reinforcement Learning from Human Feedback (RLHF). This works by having humans evaluate AI outputs, and then training a reward model that positively reinforces behaviour that they rank highly. This works for current systems, because they have yet to surpass human capability across many domains – evaluators can still tell whether model behaviour is good or bad.
But once we develop AI models far more capable than ourselves, we may not be able to reliably assess their outputs. For example, a superhuman model could produce extremely complicated code that is not auditable even by an expert in computer science. Several agendas under the umbrella of scalable oversight aim to find solutions to this problem. Studying weak to strong generalisation is one such agenda.
Can stronger models learn from weaker ones?
Weak-to-strong generalisation rests on the principle that it may be possible for humans to guide superhuman AI systems, in the same way that, for example, teachers and professors can provide mentorship to students who ultimately become more capable than they are.
Could this really work? We can collect some evidence on this question by studying how well stronger AI models can learn from weaker ones – which is an analogy for superintelligent AI learning from humans.
Researchers at OpenAI empirically studied weak-to-strong generalisation, using the following process:
Train a smaller model that will act as the weak supervisor.
Get the weak supervisor to answer a set of questions on data that it has not been trained on. Its answers become weak labels.
Fine-tune the strong model on the weak labels that were generated by the supervisor.
Fine-tune a second strong model on ground truth labels (the correct answers to the questions that the weak supervisor was given). This elicits the full capability of the model, as a point of comparison.
Test the two strong models on a series of tasks, to see how differently they perform. The aim of this experiment is to see how much being trained on weak labels degrades the capabilities of the strong model. Researchers quantified this by measuring the difference in performance between the two strong models. The smaller this difference, the more weak-to-strong generalisation has occurred.
Models were tested on a variety of tasks, including natural language processing (NLP) benchmarks, chess puzzles and ChatGPT reward modelling. Weak-to-strong generalisation occurred most of the time, though to varying extents.
Can weak to strong generalisation help make AI safe?
So, how promising is this approach? The results of the OpenAI study are mixed.
On the one hand, it is encouraging that weak-to-strong generalisation occurs at all. It isn’t obvious that this would be the case! We might have expected that strong models fine-tuned on the outputs of weaker ones would learn to imitate their mistakes. But instead, they appear to leverage the capabilities they acquired in pre-training to solve tasks better than their weak supervisor could.
OpenAI also identified some additional methods that can further increase weak-to-strong generalisation, suggesting that progress on this problem is possible. These include bootstrapping, where weak-to-strong learning is used to train progressively larger models, and auxiliary confidence loss, where an extra term is added to the strong model’s loss function to help it retain confidence in its own judgements, even when fine-tuned on weak labels.
On the other hand, the extent to which weak-to-strong generalisation occurs is inconsistent across tasks. At best, weak-to-strong learning recovered just over 50% of model performance in some NLP tasks. At worst, it recovered almost none when using very weak supervisors on chess tasks. It could never recover more than 20% in ChatGPT reward modelling, even when the strong model had not been trained on much more compute than the weak supervisor.
Just like many proposed agendas to make smarter-than-human AIs safe, weak-to-strong generalisation represents a promising idea, but not a full solution. Weak models supervising strong ones is an imperfect analogy for humans supervising superintelligent AIs. Superhuman AI models imitate human errors more often than strong models imitate weak model errors.
Additionally, OpenAI’s research shows that less weak-to-strong generalisation tends to occur when capability gaps are larger (and the capability gap between humans and superintelligence will be very large indeed!). More research is needed to determine how useful weak-to-strong generalisation will be for aligning very powerful models.
AI is getting more powerful – and fast. We still don’t know how we’ll ensure that very advanced models are safe.
We designed a free, 2-hour course to help you understand how AI could transform the world in just a few years. Start learning today to join the conversation.


