this post was submitted on 13 Jun 2023
3 points (100.0% liked)

Machine Learning - Theory | Research

74 readers
1 users here now

We follow Lemmy’s code of conduct.

Communities

Useful links

founded 1 year ago
MODERATORS
 

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

https://arxiv.org/pdf/2306.03341.pdf

Authors: Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg, Harvard University

Word Count: Approximately 3000 words

Estimated Read Time: Around 10-12 minutes

Code Repo: https://github.com/likenneth/honest_llama

The paper proposes a technique called Inference-Time Intervention (ITI) to enhance the truthfulness of large language models (LLMs). The idea is to shift model activations during inference along directions that are known to produce truthful answers.

The authors experiment with the LLaMA model on the TruthfulQA benchmark, which tests for truthful behavior. They find that ITI significantly improves LLaMA's performance, increasing its true*informative score from 32.5% to 65.1%.

ITI contrasts with existing approaches like RLHF that require huge resources. ITI is computationally inexpensive and data efficient, requiring only a few hundred examples to locate truthful directions.

However, the authors note that ITI by itself is not sufficient to ensure truthful answers from LLMs. But with additional testing and development, it could be useful as part of a more comprehensive approach.

In summary, ITI shows promise as a minimally invasive technique for improving the truthfulness of LLMs. The results suggest that LLMs may have an internal representation of the likelihood of something being true, even if they produce falsehoods on the surface.

In terms of applicability, ITI could potentially be used as one component in developing applications based on LLMs or GANs that require truthful or fact-checked responses. However, more research is needed to better understand ITI's limitations and trade-offs.

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here