this post was submitted on 26 Sep 2023
126 points (89.9% liked)

Technology

34904 readers
263 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.


Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.


Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 5 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 4 points 1 year ago (2 children)

The article doesn't explain how that's the case at all.

Aren't all the big AI models trained on publicly available data?

[–] [email protected] 3 points 1 year ago* (last edited 1 year ago)

Books3 is the definition of "not publicly available" because it's all from pirated material downloaded from private torrent tracker Bibliotik.

Books3 is literally why several of AI groups are being sued by various authors like Sarah Silverman and George R.R. Martin.

Books3 was always illicitly obtained material which put into question whether an LLM using it could really fall under Fair Use. (It most likely does, but it's still a legal question that hasn't been answered yet.)

Books3 Link: https://huggingface.co/datasets/the_pile_books3

Books3 Description from Link:

This dataset is Shawn Presser's work and is part of EleutherAi/The Pile dataset.

This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1). seems to be similar to OpenAI's mysterious "books2" dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know very little about any differences. People suspect it's "all of libgen", but it's purely conjecture.

[–] [email protected] 2 points 1 year ago (1 children)

I see it more like your address is public in a sense that if I could knock on every door and look through every window I would eventually see where you live. But, I probably wouldn't be able to quickly search where you live because it's not made to be public knowledge.

AI take everything and makes it easily searchable for itself even if it wasn't made to be.