this post was submitted on 08 Oct 2023
507 points (97.0% liked)

Technology

58144 readers
4515 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
 

BBC will block ChatGPT AI from scraping its content::ChatGPT will be blocked by the BBC from scraping content in a move to protect copyrighted material.

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 0 points 11 months ago

The thing, is realistically it won't make a difference at all, because there are vast amounts of public domain data that remain untapped, so the main "problematic" need for OpenAI is new content that represents up to data language and up to date facts, and my point with the share price of Thomson Reuters is to illustrate that OpenAI is already getting large enough that they can afford to outright buy some of the largest channels of up-to-the-minute content in the world.

As for authors, it might wipe a few works by a few famous authors from the dataset, but they contribute very little to the quality of an LLM, because the LLM can't easily judge during training unless you intentionally reinforce specific works. There are several million books published every year. Most of them make <$100 in royalties for their authors (an average book sell ~200 copies). Want to bet how cheap it'd be to buy a fully licensed set of a few million books? You don't need bestsellers, you need many books that are merely sufficiently good to drag the overall quality of the total dataset up.

The irony is that the largest benefactor of content sources taking a strict view of LLMs will be OpenAI, Google, Meta, and the few others large enough to basically buy datasets or buy companies that own datasets because this creates a moat for those who can't afford to obtain licensed datasets.

The biggest problem won't be for OpenAI, but for people trying to build open models on the cheap.