In a new paper published on the open access and non-peer reviewed site arXiv.org, co-authors Ronen Eldan of Microsoft Research and Mark Russinovich of Microsoft Azure propose a new way of editing or removing knowledge of copyrighted works by erasing specific information from a sample LLM — namely, all knowledge of the existence of the Harry Potter books (including characters and plots) from Meta’s open source Llama 2-7B.
As the Microsoft researchers write: “While the model took over 184K GPU-hours to pretrain, we show that in about 1 GPU hour of finetuning, we effectively erase the model’s ability to generate or recall Harry Potter-related content.”
The magic formula
First, they trained a model on the target data (Harry Potter books) to identify tokens most related to it by comparing predictions to a baseline model.
Then, they replaced unique Harry Potter expressions with generic counterparts and generated alternative predictions approximating a model without that training.
Third, they fine-tuned the baseline model on these alternative predictions, effectively erasing the original text from its memory when prompted with the context.
To evaluate, they tested the model’s ability to generate or discuss Harry Potter content using 300 automatically generated prompts, as well as by inspecting token probabilities. As Eldan and Russinovich state, “to the best of our knowledge, this is the first paper to present an effective technique for unlearning in generative language models.”
They found that while the original model could easily discuss intricate Harry Potter plot details, after only an hour of finetuning their technique, “it’s possible for the model to essentially ‘forget’ the intricate narratives of the Harry Potter series.” Performance on standard benchmarks like ARC, BoolQ and Winogrande “remains almost unaffected.”
Expelliarmus-ing expectations
As the authors note, more testing is still needed given limitations of their evaluation approach. Their technique may also be more effective for fictional texts than non-fiction, since fictional worlds contain more unique references.
In summarizing their findings, the authors state: “Our technique offers a promising start, but its applicability across various content types remains to be thoroughly tested. The presented approach offers a foundation, but further research is needed to refine and extend the methodology for broader unlearning tasks in LLMs.”