Technology

62004 readers

3736 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

435

Google Researchers’ Attack Prompts ChatGPT to Reveal Its Training Data (www.404media.co)

submitted 1 year ago by [email protected] to c/technology

97 comments fedilink hide all child comments

ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] -1 points 1 year ago (2 children)

ChatGPT is a large language model. The model contains word relationships - a nebulous collection of rules for stringing word together. The model does not contain information. In order for ChatGPT to answer flexibly answer questions, it must have access to information for reference - information that it can index, tag and sort for keywords.

[–] [email protected] 3 points 1 year ago* (last edited 1 year ago) (1 children)

I'm honestly not sure what you're trying to say here. If by "it must have access to information for reference" you mean it has access while it is running, it doesn't. Like I said that information is only available during training. Either you're trying to make a point I'm just not getting or you are misunderstanding how neural networks function.

[–] [email protected] 1 points 1 year ago (1 children)

Like I said that information is only available during training.

This is not correct. I understand how neural networks function, I also understand that the neural network is not a complete system in itself. In order to be useful, the model is connected to other things, including a source of reference information. For instance, earlier this year ChatGPT was connected to the internet so that it could respond to queries with more up-to-date information. At that point, the neural network was frozen. It was not being actively trained on the internet, it was just connected to it for the sake of completing search queries.

[–] brianorca 2 points 1 year ago* (last edited 1 year ago)

That is an optional feature, not required to make use of an LLM. And not even a feature of most LLMs. ChatGPT was usable before they added that, but it can help when you need recent data. And they do continue to train It, with the current cutoff being April of this year, at least for some models. (But training is expensive, so we can expect it to be in conjunction with other design changes that require additional training.)

[–] [email protected] 3 points 1 year ago (1 children)

information that it can index, tag and sort for keywords.

The dataset ChatGPT uses to train on contains data copied unlawfully. They're not just reading the data at its source, they're copying the data into a training database without sufficient license.

Whether ChatGPT itself contains all the works is debatable - is it just word relationships when the system can reproduce significant chunks of copyrighted data from those relationships? - but the process of training inherently requires unlicensed copying.

In terms of fair use, they could argue a research exemption, but this isn't really research, it's product development. The database isn't available as part of scientific research, it's protected as a trade secret. Even if it was considered research, it absolutely is commercial in nature.

In my opinion, there is a stronger argument that OpenAI have broken copyright for commercial gain than that they are legitimately performing fair use copying for the benefit of society.

[–] Womble 3 points 1 year ago (1 children)

Every time you load a webpage you are making a local copy of it for your own use, if it is on the open web you are implicitly given permission to make a copy of it for your own use. You are not given permission to then distribute those copies which is where LLMs may get into trouble, but making a copy for the purpose of training is not a breach of copyright as far as I can understand or have heard.

[–] [email protected] 3 points 1 year ago (1 children)

Yes, you do make a copy of a web page. And every time you load a video game you make a copy into RAM. However, this copying is all permitted under user license - you're allowed to make minor copies as part of the process of running the software and playing the media.

Case in point, the UK courts ruled that playing pirated games was illegal, because when you load the game from a disc you copy it into RAM, and this copying is not licensed by the player.

OpenAI does not have any license for copying into its database. The terms and conditions of web pages say you're allowed to view them, not allowed to take the data and use it for things. They don't explicitly prohibit this (yet), but the lack of a prohibition does not mean a license is implied. OpenAI can only hope for a fair use exemption, and I don't think they qualify because a) it isn't really "research" but product development, and even if it is research b) it is purely for commercial gain.

[–] Womble 3 points 1 year ago (1 children)

Could you point to the judgement on playing copied games was illegal in the UK? I can only find articles about specifically DS copy cartridges which are very obviously intended to make/use unlicensed copies of games to distribute.

Even so, that again hinges on right to distribute, not right to make a copy for personal use. If a game is made freely available on the web for you to play it is not illegal to download that game to play offline or study it.

[–] [email protected] 1 points 1 year ago* (last edited 1 year ago)

I'd have to go digging, sorry I don't have the time right now. It was to do with piracy on the OG X-Box. It wasn't the main part of the case, just a tangential point inside the judge's ruling.

Downloading a game to play it would be copyright infringement. Downloading involves making a copy on your device. However one copy really isn't worth the hassle of claiming against, so it never happens. This is why all the Napster cases inflated the counts of infringement by including everyone you connected to as if you had uploaded a complete copy to them, that's the only way to make the claim worthwhile. Also in the US uploading to someone carried punitive damages, similar cases didn't work so well in the UK with actual damages.

Downloading it to study is fair use under the research exemption, particularly if it's a non-commercial activity.

Copyright infringement happens all the time, but the vast majority of cases aren't worth prosecuting, and there's no penalty for a rights holder not to prosecute. Meanwhile, with Trademarks, the rights holder absolutely can lose their rights if they don't prosecute every infringement they become aware of.