this post was submitted on 09 Jan 2025
1954 points (98.3% liked)
Technology
61039 readers
5331 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each other!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
- Accounts 7 days and younger will have their posts automatically removed.
Approved Bots
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Yeah, AI has become good enough at this point that you can provide it with a large blob of context material - such as API documentation, source code, etc. - and then have it come up with its own questions and answers about it to create a corpus of "synthetic data" to train on. And you can fine-tune the synthetic data to fit the format and style that you want, such as telling it not to be snarky or passive-aggressive or whatever.
It can reproduce an api. Can't solve actual problems. LLMs are completely incapable of innovation.
And yet the synthetic training data works, and models trained on it continue scoring higher on the benchmarks than ones trained on raw Internet data. Claim what you want about it, the results speak louder.
This is the peak, though. They require new data to get better but most of the available new data is adulterated with AI slop. Once they start eating themselves it's over.
You are speaking of "model collapse", I take it? That doesn't happen in the real world with properly generated and curated synthetic data. Model collapse has only been demonstrated in highly artificial circumstances where many generations of model were "bred" exclusively on the outputs of previous generations, without the sort of curation and blend of additional new data that real-world models are trained with.
There is no sign that we are at "the peak" of AI development yet.
We're already seeing signs of incestuous data input causing damage. The more that AI takes over, the less capable it will be.
Are we, though? Newer models almost universally perform better than older ones, adjusted for scale. What signs are you seeing?