Supervised fine-tuning (SFT) is one step in the process of aligning AI models with human preferences, by training them on a dataset of examples showing desired behaviour. It often precedes Reinforcement Learning From Human Feedback (RLHF).
How does supervised fine-tuning work?
Pre-trained “base” Large Language Models (LLMs) have absorbed vast amounts of data, but have not yet learned to apply this data in ways that are useful for human users. They are good at next-token prediction (predicting the next word in a sequence), but not at responding to queries in a helpful and coherent way. We get LLMs to output friendly, useful responses using a combination of SFT and RLHF.
The term “fine-tuning” broadly refers to the process of taking a pre-trained model and further training it on a dataset to adapt it for a particular purpose. In the context of SFT, the model is being refined for the purpose of providing useful assistance to humans.
Here’s how it works:
First, we curate a dataset of high quality example responses, or “demonstration data”. These are demonstrations of the models behaving as we want them to. They could include a helpful step-by-step breakdown of a maths problem, or polite refusals to output harmful or inappropriate responses like instructions for illegal activities. The best (but most resource-intensive) form of demonstration data is human-written. For example, OpenAI used 40 highly-educated human labellers who had passed a screen test to produce demonstration data for its InstructGPT model.
Next, we fine-tune the model to emulate these examples. This is quite similar to the process used in pre-training – we’re aiming to reduce the loss function (how far a model’s prediction deviates from the ideal examples we’ve provided). The difference is that this time, the dataset is limited to a smaller number of high-quality prompts.
This process is repeated iteratively, targeting any weaknesses with further examples.
What is the difference between supervised fine-tuning and RLHF?
SFT and RLHF are two separate stages in the process of aligning LLMs with human preferences. They are complementary, rather than being alternative approaches.
RLHF trains LLMs on human feedback, rather than pre-written examples. During RLHF, human labellers are asked to evaluate LLM outputs. This feedback could take several forms, such as choosing between two answers to the same prompt, ranking outputs on a scale, or simply labelling them as either “good” or “bad”.
SFT teaches a model to mimic ideal responses, while the aim of RLHF is training it to generalise beyond specific examples to gain a nuanced understanding of human preferences. Research shows that combining SFT and RLHF works better than using SFT on its own. There are a few hypothesised reasons for this:
#1 Prompts have more than one right answer:
During SFT, we’re teaching a model to imitate one “ideal” response to a given prompt, despite the fact that most prompts have a wide variety of ideal responses. This means we could end up penalising the model for correct responses that deviate from provided examples.
The opposite is also true – SFT teaches the model one a plausible response to an input, but there could be better ones! RLHF lets us provide feedback on not just whether an output is good, but how good it is.
#2 RLHF lets us give both positive and negative feedback
While SFT shows the model what to do, RLHF shows it both what to do and what not to do. Adding negative training signals is very powerful, because it lets the model learn from its mistakes.
#3 Curating a high-quality dataset for SFT is hard and expensive
SFT relies on high-quality examples, ideally written by humans. Trying to collect examples of every single behaviour we want to encourage during the alignment process would be extremely time-consuming and expensive.
In fact, research has shown that increasing the number of examples a model is trained on during SFT can hit diminishing returns, because it takes a relatively small number (on the order of tens of thousands) for the model to generate outputs that are indistinguishable from human-written demonstration data. This is one reason that developers pivot to RLHF once they have established a basic level of capability using SFT, rather than attempting to produce an exhaustive dataset of examples.
Combining supervised fine-tuning with RLHF has increased the degree to which LLMs produce outputs that are aligned with human preferences. However, frontier AI companies acknowledge that current techniques will not scale to AI systems that are much smarter than humans. We wrote about some reasons for this in our blog on problems with RLHF for AI safety.
AI is getting more powerful – and fast. We still don’t know how we’ll ensure that very advanced models are safe.
We designed a free, 2-hour course to help you understand how AI could transform the world in just a few years. Start learning today to join the conversation.


