Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
https://arxiv.org/pdf/2306.03341.pdf
Authors: Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg, Harvard University
Word Count: Approximately 3000 words
Estimated Read Time: Around 10-12 minutes
Code Repo: https://github.com/likenneth/honest_llama
The paper proposes a technique called Inference-Time Intervention (ITI) to enhance the truthfulness of large language models (LLMs). The idea is to shift model activations during inference along directions that are known to produce truthful answers.
The authors experiment with the LLaMA model on the TruthfulQA benchmark, which tests for truthful behavior. They find that ITI significantly improves LLaMA's performance, increasing its true*informative score from 32.5% to 65.1%.
ITI contrasts with existing approaches like RLHF that require huge resources. ITI is computationally inexpensive and data efficient, requiring only a few hundred examples to locate truthful directions.
However, the authors note that ITI by itself is not sufficient to ensure truthful answers from LLMs. But with additional testing and development, it could be useful as part of a more comprehensive approach.
In summary, ITI shows promise as a minimally invasive technique for improving the truthfulness of LLMs. The results suggest that LLMs may have an internal representation of the likelihood of something being true, even if they produce falsehoods on the surface.
In terms of applicability, ITI could potentially be used as one component in developing applications based on LLMs or GANs that require truthful or fact-checked responses. However, more research is needed to better understand ITI's limitations and trade-offs.