this post was submitted on 08 Oct 2023
507 points (97.0% liked)
Technology
60021 readers
3299 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
The thing, is realistically it won't make a difference at all, because there are vast amounts of public domain data that remain untapped, so the main "problematic" need for OpenAI is new content that represents up to data language and up to date facts, and my point with the share price of Thomson Reuters is to illustrate that OpenAI is already getting large enough that they can afford to outright buy some of the largest channels of up-to-the-minute content in the world.
As for authors, it might wipe a few works by a few famous authors from the dataset, but they contribute very little to the quality of an LLM, because the LLM can't easily judge during training unless you intentionally reinforce specific works. There are several million books published every year. Most of them make <$100 in royalties for their authors (an average book sell ~200 copies). Want to bet how cheap it'd be to buy a fully licensed set of a few million books? You don't need bestsellers, you need many books that are merely sufficiently good to drag the overall quality of the total dataset up.
The irony is that the largest benefactor of content sources taking a strict view of LLMs will be OpenAI, Google, Meta, and the few others large enough to basically buy datasets or buy companies that own datasets because this creates a moat for those who can't afford to obtain licensed datasets.
The biggest problem won't be for OpenAI, but for people trying to build open models on the cheap.