this post was submitted on 28 Aug 2024
4 points (64.3% liked)

AI

4006 readers
10 users here now

Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals, which involves consciousness and emotionality. The distinction between the former and the latter categories is often revealed by the acronym chosen.

founded 3 years ago
 

I wanted to extract some crime statistics broken by the type of crime and different populations, all of course normalized by the population size. I got a nice set of tables summarizing the data for each year that I requested.

When I shared these summaries I was told this is entirely unreliable due to hallucinations. So my question to you is how common of a problem this is?

I compared results from Chat GPT-4, Copilot and Grok and the results are the same (Gemini says the data is unavailable, btw :)

So is are LLMs reliable for research like that?

you are viewing a single comment's thread
view the rest of the comments
[โ€“] simplymath 18 points 2 months ago (1 children)
[โ€“] Fern 3 points 2 months ago* (last edited 2 months ago)

Definitely. The thing you might want to consider as well is what you are using it for. Is it professional? Not reliable enough. Is it to try to understand things a bit better? Well, it's hard to say if it's reliable enough, but it's heavily biased just as any source might be, so you have to take that into account.

I don't have the experience to tell you how to suss out its biases. Sometimes, you can push it in one direction or another with your wording. Or with follow-up questions. Hallucinations are a thing but not the only concern. Cherrypicking, lack of expertise, the bias of the company behind the llm, what data the llm was trained on, etc.

I have a hard time understanding what a good way to double-check your llm is. I think this is a skill we are currently learning, as we have been learning how to sus out the bias in a headline or an article based on its author, publication, platform, etc. But for llms, it feels fuzzier right now. For certain issues, it may be less reliable than others as well. Anyways, that's my ramble on the issue. Wish I had a better answer, if only I could ask someone smarter than me.


Oh, here's gpt4o's take.

When considering the accuracy and biases of large language models (LLMs) like GPT, there are several key factors to keep in mind:

1. Training Data and Biases

  • Source of Data: LLMs are trained on vast amounts of data from the internet, books, articles, and other text sources. The quality and nature of this data can greatly influence the model's output. Biases present in the training data can lead to biased outputs. For example, if the data contains biased or prejudiced views, the model may unintentionally reflect these biases in its responses.
  • Historical and Cultural Biases: Since data often reflects historical contexts and cultural norms, models might reproduce or amplify existing stereotypes and biases related to gender, race, religion, or other social categories.

2. Accuracy and Hallucinations

  • Factual Inaccuracies: LLMs do not have an understanding of facts; they generate text based on patterns observed during training. They may provide incorrect or misleading information if the topic is not well represented in their training data or if the data is outdated.
  • Hallucinations: LLMs can "hallucinate" details, meaning they can generate plausible-sounding information that is entirely fabricated. This can occur when the model attempts to fill in gaps in its knowledge or when asked about niche or obscure topics.

3. Context and Ambiguity

  • Understanding Context: While LLMs can generate contextually appropriate responses, they might struggle with nuanced understanding, especially in cases where subtle differences in wording or context significantly change the meaning. Ambiguity in a prompt or query can lead to varied interpretations and outputs.
  • Context Window Limitations: LLMs have a fixed context window, meaning they can only "remember" a certain amount of preceding text. This limitation can affect their ability to maintain context over long conversations or complex topics.

4. Updates and Recency

  • Outdated Information: Because LLMs are trained on static datasets, they may not have up-to-date information about recent events, scientific discoveries, or new societal changes unless explicitly fine-tuned or updated.

5. Mitigating Biases and Ensuring Accuracy

  • Awareness and Critical Evaluation: Users should be aware of potential biases and inaccuracies and approach the output critically, especially when discussing sensitive or fact-based topics.
  • Diverse and Balanced Data: Developers can mitigate biases by training models on more diverse and balanced datasets and employing techniques such as debiasing algorithms or fine-tuning with carefully curated data.
  • Human Oversight and Expertise: Where high accuracy is critical (e.g., in legal, medical, or scientific contexts), human oversight is necessary to verify the information provided by LLMs.

6. Ethical Considerations

  • Responsible Use: Users should consider the ethical implications of using LLMs, especially in contexts where biased or inaccurate information could cause harm or reinforce stereotypes.

In summary, while LLMs can provide valuable assistance in generating text and answering queries, their accuracy is not guaranteed, and their outputs may reflect biases present in their training data. Users should use them as tools to aid in tasks, but not as infallible sources of truth. It is essential to apply critical thinking and, when necessary, consult additional reliable sources to verify information.