this post was submitted on 29 Jan 2025

200 points (97.2% liked)

Leopards Ate My Face

4648 readers

603 users here now

Rules:

If you don't already have some understanding of what this is, try reading this post. Off-topic posts will be removed.
Please use a high-quality source to explain why your post fits if you think it might not be common knowledge and isn't explained within the post itself.
Links to articles should be high-quality sources – for example, not the Daily Mail, the New York Post, Newsweek, etc. For a rough idea, check out this list. If it's marked in red, it probably isn't allowed; if it's yellow, exercise caution.
The mods are fallible; if you've been banned or had a comment removed, you're encouraged to appeal it.
For accessibility reasons, an image of text must either have alt text or a transcription in the comments.
All Lemmy.World Terms of Service apply.

Also feel free to check out [email protected] (also active).

Icon credit C. Brück on Wikimedia Commons.

founded 2 years ago

MODERATORS

TheTechnician27

418_im_a_teapot

200

Chinese firms ‘distilling’ US AI models to create rival products, warns OpenAI (www.theguardian.com)

submitted 1 month ago by withabeard to c/leopardsatemyface

37 comments fedilink hide all child comments

Honestly an AI firm being salty that someone has potentially taken their work, "distilled" it and selling that on feels hilariously hypocritical.

Not like they've taken the writings, pictures, edits and videos of others, "distilled" them and created something new from it.

top 37 comments

sorted by: hot top controversial new old

[–] brucethemoose 75 points 1 month ago* (last edited 1 month ago) (3 children)

This is a lie.

Some background:

LLMs don't output words, they output lists of word probabilities. Technically they output tokens, but "words" are a good enough analogy.
So for instance, if "My favorite color is" is the input to the LLM, the output could be 30% "blue.", 20% "red.", 10% "yellow.", and so on, for many different possible words. The actual word thats used and shown to the user is selected through a process called sampling, but that's not important now.
This spread can be quite diverse, something like:
A "distillation," as the term is used in LLM land, means running tons of input data through existing LLMs, writing the logit outputs, aka the word probabilities, to disk, and then training the target LLM on that distribution instead of single words. This is extremely efficient because running LLMs is much faster than training them, and you "capture" much more of the LLM's "intelligence" with its logit ouput rather than single words. Just look at the above graph: in one training pass, you get dozens of mostly-valid inputs trained into the model instead of one. It also shrinks the size of the dataset you need, meaning it can be of higher quality.
Because OpenAI are jerks, they stopped offering logit outputs. Awhile ago.
EG, this is a blatant lie! OpenAI does not offer logprobs, so creating distillations from thier models is literally impossible.
OpenAI contributes basically zero research to the open LLM space, so there's very little to copy as well. Some do train on the basic output of openai models, but this only gets you so far.

There are a lot of implications. But basically a bunch of open models from different teams are stronger than a single closed one because they can all theoretically be "distilled" into each other. Hence Deepseek actually built on top of the work of Qwen 2.5 (from Alibaba, not them) to produce the smaller Deepseek R1 models, and this is far from the first effective distillation. Arcee 14B used distilled logits from Mistral, Meta (Llama) and I think Qwen to produce a state-of-the-art 14B model very recently. It didn't make headlines, but was almost as eyebrow raising to me.

[–] Dubiousx99 30 points 1 month ago (1 children)

Posts like yours are why I read comments. It actually has content and I’m able to learn something from it. Thank you for you contribution.

[–] brucethemoose 26 points 1 month ago* (last edited 1 month ago) (1 children)

Thanks! I'm happy to answer questions too!

I feel like one of the worst things OpenAI has encouraged is "LLM ignorance." They want people to use their APIs without knowing how they work internally, and keep the user/dev as dumb as possible.

But even just knowing the basics of what they're doing is enlightening, and explains things like why they're so bad at math or word counting (tokenization), why they mess up so "randomly" (sampling and their serial nature), why they repeat/loop (dumb sampling and bad training, but its complicated), or even just basic things like the format they use to search for knowledge. Among many other things. They're better tools and less "AI bro hype tech" when they aren't a total black box.

[–] [email protected] 2 points 1 month ago (1 children)

Thx for your insight, very insightful!

So a question: Where do you see this AI heading? Is it just chatbots for customer service, fully functional computer programming, or even fully functional 3D printing and CNC programs with just a few inputs? (for example: here's a 3D model upload that I need for this particular machine with this material, now make me a program)

[–] brucethemoose 3 points 1 month ago* (last edited 1 month ago) (1 children)

Depends what you mean by "AI"

Generative models as you know them are pretty much all transformers, and there are already many hacks to let them ingest images, video, sound/music, and even other formats. I believe there are some dedicated 3D models out there, as well as some experiments with "byte-level" LLMs that can theoretically take any data format.

But there are fundamental limitations, like the long context you'd need for 3D model ingestion being inefficient. The entities that can afford to train the best models are "conservative" and tend to shy away from testing exotic implementations, presumably because they might fail.

Some seemingly "solvable" problems like repetition issues you encounter with programming have not had potential solutions adopted either, and the fix they use (literally randomizing the output) makes them fundamentally unreliable. LLMs are great assistants, but you can never fully trust them as is.

What I'm getting at is that everything you said is theoretically possible, but the entities with the purse strings are relatively conservative and tend to pursue profitable pure text performance instead. So I bet they will remain as "interns" and "assistants" until there's a more fundamental architecture shift, maybe something that learns and error corrects during usage instead of being so static.

And as stupid as this sounds, another problem is packaging. There are some incredible models that take media or even 3D as input, for instance... but they are all janky, half functional python repos researchers threw up before moving on. There isn't much integration and user-friendliness in AI land.

[–] [email protected] 1 points 1 month ago (1 children)

I suppose you are right.. they are "learning" models after all.
I just think of the progress with slicers, dynamic infill, computational gcode output with CNC and all the possibilities thereof. There are just so many variables (seemingly infinite). But so are there with LLMs, so maybe there is hope.

[–] brucethemoose 1 points 1 month ago

Basically the world is waiting for the Nvidia monopoly to break and training costs to come down, then we will see...

[–] [email protected] 6 points 1 month ago (1 children)

Wait, so OpenAI's whole kerfuffle here is based on nothing directly stated (e.g. in the paper like I thought), and worse, almost certainly completely unfounded?

Wow just when I thought they couldn't get more ridiculous...

[–] brucethemoose 14 points 1 month ago* (last edited 1 month ago)

Almost all of OpenAI's statements are unfounded. Just watch how the research community reacts whenever Altman opens his mouth.

TSMC allegedly calling him a "podcast bro" is the most accurate descriptor I've seen: https://www.nytimes.com/2024/09/25/business/openai-plan-electricity.html

[–] [email protected] 2 points 1 month ago (1 children)

How does this get used to create a better AI? Is it just that combining distillations together gets you a better AI? Is there a selection process?

[–] brucethemoose 7 points 1 month ago

Chains of distillation is mostly uncharted territory! There aren't a lot of distillations because each one is still very expensive (as in at least tens of thousands of dollars, maybe millions of dollars for big models).

Usually a distillation is used to make a smaller model out of a bigger one.

But the idea of distillations from multiple models is to "add" the knowledge and strengths of each model together. There's no formal selection process, it's just whatever the researchers happen to try. You can read about another example here: https://huggingface.co/arcee-ai/SuperNova-Medius

[–] breadsmasher 58 points 1 month ago

“waaa other people are using our stolen work!”

[–] [email protected] 33 points 1 month ago (1 children)

[–] SoftestSapphic 5 points 1 month ago

But but... AI isn't just autocorrect!

It searches the web for the most likely response to your queries, like a search engine.

BUT IT'S NOT JUST A SEARCH ENGINE WITH AUTO CORRECT!! WAIT!!

[–] inb4_FoundTheVegan 31 points 1 month ago* (last edited 1 month ago) (1 children)

So can we trot out some of the BS excuses we've been hearing from tech bros the last year?

They have to be trained somehow.

AI will free us to make new better things than what we stole.

Don't be a luddite. Technology makes lots of things obsolete.

[–] [email protected] 4 points 1 month ago

hopefully tech can make "open"ai obsolete

[–] rtxn 21 points 1 month ago* (last edited 1 month ago) (1 children)

Cope. Seethe. Mald.

Seeing all these tech bros collectively lose it is filling my heart with joy.

[–] [email protected] 1 points 1 month ago

This is ~~Thancred~~ certainly the last thread I expected to find a FFXIV meme in.

[–] [email protected] 16 points 1 month ago

get rekt, capitalist pigs!

[–] ieatpwns 16 points 1 month ago

National security risk is ceo speak for lost profits

[–] MushuChupacabra 13 points 1 month ago

It fills my heart with joy to see someone scrape his scraped data, and use it to easily make something better, with a fraction of the cost of Open AI.

the vacuum sound made when a fuckton of investor's money gets pulled must be unsettling for Sam.

[–] FabledAepitaph 13 points 1 month ago (2 children)

The data never belonged to Open AI in the first place tho, did it?

[–] [email protected] 3 points 1 month ago

Narrator: No, it did not.

[–] Dkarma -1 points 1 month ago

Not relevant. We are talking models not training data.

The training data is free cuz it's freely found on the internet.

Not hard to understand.

[–] db2 9 points 1 month ago (2 children)

Typical of the type, it isn't a big deal until it's happening to them.

[–] [email protected] 5 points 1 month ago (1 children)

That's literally every conservative. They cannot c comprehend hardship unless they suffer it personally.

[–] db2 2 points 1 month ago

I don't think they comprehend it then either tbh, like the 8 second memory a housefly has.

[–] Dkarma 1 points 1 month ago (1 children)

Post is pretending this isn't apples to oranges.

Openai never stole anything just looked at it.

[–] db2 1 points 1 month ago

You wouldn't download a Skynet?

[–] [email protected] 8 points 1 month ago* (last edited 1 month ago)

Suddenly feels like shit when they do it to you, right? Im not even that impacted by ai but seems like as long as people arent affected directly by something they have zero compassion. Basically leopards ate my face is that but its so fucking anoying. But the people voting for the leopards is just one thing, the leopards are still worse in my opinion. Using a little wordplay, i wish i could hunt leopards, i dont care what counts as democratic as THEY are the ones who are a direct threath against democracy. The whole right is. The assumption that someone has more rights as someone else is undemocratic to the core. People shouldnt have a right to vote over others rights. Thats how you solve the paradox of democracy. Elections should be boring, about the economy and how you manage resources, not about who can vote.

Sorry for the rant(and offtopic), had a long day.

[–] Macaroni_ninja 7 points 1 month ago

Thay are the bad AI people with evil AI, you should listen to us good AI people with the great AI!

[–] Evotech 5 points 1 month ago

Hahahahahahahaha

[–] NotMyOldRedditName 4 points 1 month ago

Gonna need more popcorn.

[–] [email protected] 3 points 1 month ago

Lol, they made a copy of the wrong-o-matic.

[–] [email protected] 3 points 1 month ago (1 children)

ask general electric what china did with their engineering designs for nuclear power plants

a) to expect China not to steal every piece of design they can lay their hands on is foolish and should be part of every tech companies contingency planning, and investor consideration

b) given that deepseek seems to have condensed the processing, i can only imagine openai can now use their processes to make the high end chips work just that much more efficiently

[–] UnderpantsWeevil 4 points 1 month ago* (last edited 1 month ago)

ask general electric what china did with their engineering designs for nuclear power plants

GE: "We have a design for a nuclear power plant that we'd like to build."

Chinese Construction Firm: "Great, we'll pay you to help implement a prototype and then we'll use the schema to build more plants"

GE: "No!! That's stealing!"

Chinese Government: Begins building hundreds of new nuclear power plants

American Government: Won't build any new plants

GE: "China made us lose money!"

[–] Tronn4 1 points 1 month ago

https://youtu.be/U1UtRnGn5hc?si=9yHO8G2FNJ1yZfOg