Machine unlearning is the process of removing specific elements from AI models, without re-training them entirely or reducing their capabilities.
There are many different things we might want AI models to “unlearn”, including private data, copyrighted materials, misinformation, or dangerous capabilities.
What can we use machine unlearning for?
Machine unlearning has several applications, including:
Allowing users to delete personal data
Some privacy regulations, such as the EU’s General Data Protection Regulation (GDPR) mandate that users have the right to request deletion of their personal data. Machine unlearning can identify and remove this data.
Avoiding copyright infringements
Today’s Large Language Models are trained on a vast corpus of online data, which inevitably includes copyrighted materials. They can also produce outputs which are extremely similar to others’ intellectual property (this was the heart of a legal dispute between the New York Times and OpenAI in 2023). If courts find an AI developer to be in violation of copyright law, they could require the removal of copyrighted material.
Removing dangerous capabilities
The most safety-relevant application of machine unlearning is removing dangerous capabilities. This could help us prevent malicious actors from extracting hazardous information from AI models that could, for example, help them create a bioweapon.
How does machine unlearning work?
The most obvious way of getting a model to “forget” certain piece information is to retrain it from scratch, while removing that information from the training set – but this would be prohibitively expensive and time-consuming! The field of machine unlearning seeks to find other methods for either removing the influence of targeted information entirely (exact unlearning) or simply reducing this influence (approximate unlearning).
Here are are few examples:
Slice up your dataset
One approach to machine unlearning is to carve up your training dataset into multiple “slices”. Then, you train a series of smaller models, one on each slice. This means you can erase a chosen data point just by re-training the model associated with it – which cuts the cost by however many models you’ve trained.
For example, if you’ve trained an image classifier on 100,000 photos with a dataset split into ten slices, each model only sees 10,000 images. So the cost of retraining is 1/10 what it would have been otherwise.
Calculate the influence of a chosen data point
A different form of machine unlearning uses influence functions to mathematically estimate the influence of the unwanted data point on the model’s parameters. Once you’ve calculated this contribution, you can subtract it from the model’s weights, effectively “undoing” the effect of that data point.
Fine-tune your model to unlearn certain information
Another approach is fine-tuning your model to unlearn certain examples. To achieve this, your data is divided into a “retain” set and a “forget” set. Then, you perform various fine-tuning methods designed to degrade performance on the “forget” data.
For example, you could use gradient ascent (the opposite of gradient descent) on the forget set, to deliberately increase the model’s errors on data you want it to unlearn. Or, you could perform gradient descent on the forget set, but randomise the data labels in order to confuse the model, effectively corrupting the set.
Why is machine unlearning hard?
Although people have proposed several methods for machine unlearning, it can be difficult to do effectively, for several reasons:
It can be hard to specify what you want the model to forget
Sometimes, the examples you need a model to unlearn are well-defined. For example, you might need to erase a set of personal data, such as phone numbers, from your model’s memory.
But the scope of unlearning you need to achieve could be much fuzzier. For example, trying to prevent a model from generating images in a particular artistic style is made tricky by the fact that we may not have a clearly delineated set of examples associated with that style.
Capabilities can’t be localised
We might want a model to “unlearn” a capability, rather than a set of facts, which is particularly relevant for preventing dangerous misuse of models. But this produces new challenges, because capabilities are more difficult to isolate than facts. They arise from a complex interaction of knowledge across multiple domains.
This creates two problems. First, it could be hard to eliminate a dangerous capability without also impacting information that is essential for benign use cases. For example, An AI system that understands chemical reactions well enough to synthesize novel pharmaceuticals could apply that same information to help a user create illegal drugs or chemical weapons. A 2024 paper introduced a new machine unlearning technique that succeeded in reducing hazardous knowledge without compromising performance on general-knowledge benchmarks such as MMLU – but did negatively impact performance in closely related domains like virology and computer security.
On the other hand, we may think we’ve successfully trained a model to unlearn a hazardous capability, but it could reconstruct this capability using its remaining knowledge, in ways that are hard to anticipate.
Models can re-learn information through fine-tuning
Research shows that after unlearning, dangerous capabilities can re-emerge in unexpected contexts. For example, models that have been fine-tuned after deployment – even on seemingly unrelated data – can spontaneously re-learn hazardous information.
Machine unlearning is one approach to solving the privacy, copyright and safety issues posed by AI. But it has serious limitations – and like all measures proposed to mitigate against the risks of AI, it remains unclear how well it will scale to models more powerful than today’s.
If you’re interested in how AI could transform the world in the coming years, our free 2-hour Future of AI Course is the perfect place to start.


