Technology

61964 readers

3702 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

750

'Meta Torrented over 81 TB of Data Through Anna's Archive, Despite Few Seeders' (torrentfreak.com)

submitted 3 days ago* (last edited 3 days ago) by [email protected] to c/technology

106 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 26 points 3 days ago* (last edited 2 days ago) (8 children)

As a socialist I believe intellectual property is a falsehood and technological advancement should be for the public good. Open source LLMs are for the public good.

Given the options between having open source LLMs and the US Govt banning non-corpo non-proprietary LLMs and giving a free pass to people like Musk and Altman and Zucc to monopolize, I happily pick the former.

You're delusional if you think they will pay anyone, the only way zucc will pay is with a guillotine.

Corpos will make inter-platform deals that'll simply make all online data licensable for the right price and enrich each other so you can't avoid it while still actually being a career creative, but price out academic researchers and the public sector so that all fruits of it stay behind closed R&D doors and be free of ethics etc.

Continuing in your role as a useful idiot, you'll also most likely also foot the bill for it via subsidies from your taxes to "develop the AI sector" in some anti-China dick measuring contest by the US.

You will then be sold this data back via proprietary chat bots via a monthly subscription and you better pay up because once it gets really good, it will become mandatory to use for just about any job, leaving you with no choice.

Or you can support FOSS LLMs.

[–] FooBarrington 1 points 2 days ago (3 children)

I support FOSS LLMs, but which actually exist? Which LLMs have open-sourced all their training data?

[–] [email protected] 1 points 2 days ago* (last edited 2 days ago) (2 children)

Mistral? Deepseek?

Not LLM but also SD which uses a very popular free dataset.

[–] FooBarrington 3 points 2 days ago (1 children)

Can I freely download all the training data for any of those? I was under the impression they were all trained on non-licensed and copyrighted data.

[–] [email protected] 1 points 1 day ago* (last edited 1 day ago)

It's complicated.

I know Stable Diffusion best so I'll speak to that, they used to the LAION-5B dataset, which is, in practice freely available to download and use:

https://www.kaggle.com/code/vitaliykinakh/guie-laion-5b-collect-and-download

https://github.com/opendatalab/laion5b-downloader

It's also on HuggingFace but it's unavailable.

https://huggingface.co/datasets/danielz01/laion-5b

But you can use this smaller newer version:

https://huggingface.co/datasets/laion/relaion2B-en-research

Whether it's appropriately licensed is an unsolved question though.

The dataset itself and the text portion of the text-imags pairs needed for training is CC-BY-SA, the newer versions linked above are CC-BY-4.0. https://creativecommons.org/licenses/by/4.0/deed.en

The images however are technically under their own copyright, which in practice means each of the billions of images could or could not have a licence that implicitly or explicitly forbids AI training use or forbids it only for commercial use.

Whether such a license is legally binding is at present unknown though, since licenses primarily deal with reproductions, which the pro-AI folks argue isn't the case, and that training of NNs is more akin to viewing an image and memorising the patterns and relationships within, like a person viewing it.

That would make it non-infringing and therefore the model itself libre. In that case Mistral and LLaMa are also libre as long as the model itself is open source, which in this case really means "open weights", so not like GPT and anything by """OpenAI""".

Weights are the result of a model being trained essentially. They're they key bit that makes it or breaks it and how it works. Given that and knowing the structure of the model and framework used you can refine, modify and distribute it.

Those against AI will say that it's more akin to file compression and that in one form or another it's misuse. That would make the model an infringing derivative work and as such nor libre even if the model weights are open source.

In a way though you could argue that me vaguely memorising the imagery of a dude dressed in white holding a laser sword is just a lossy compressed copy of the copyrighted work of Star wars, and it'd be absurd to think that's a violation and that infringement only occurs if I reproduce a work of substantial similarity commercially from that memory.

If I use Krita and draw a beautiful landscape which has been informed and inspired by at least in part by a movie I saw, is that copyright infringement or not? What if I use AI?

Well, current laws don't say. We measure infringement in substantial similarity, provenance of information only comes in later (e.g. to prove against accidental similarity).

That's also my own personal stance on the legal side of things, so up to you how you see it.

load more comments (4 replies)