this post was submitted on 29 Jun 2023

236 points (96.8% liked)

Technology

62932 readers

4747 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

236

A lawsuit claims OpenAI stole 'massive amounts of personal data,' including medical records and information about children, to train ChatGPT (www.businessinsider.com)

submitted 2 years ago by L4s to c/technology

47 comments fedilink hide all child comments

The lawsuit alleges OpenAI crawled the web to amass huge amounts of data without people's permission.

top 45 comments

sorted by: hot top controversial new old

[–] Hick 86 points 2 years ago (4 children)

Scraping social media posts and reddit posts doesn’t sound like stealing, they’re public posts.

[–] SamB 30 points 2 years ago (2 children)

I doubt it’s only about some Reddit posts. The scrapping was done on the whole web, capturing everything it could. So besides stealing data and presenting it as its own, it seems to have collected some even more problematic data which wasn’t properly protected.

[–] zekiz 23 points 2 years ago (4 children)

But that really isn't OpenAI's fault. Whoever was in charge of securing the patients data really fucked up.

[–] [email protected] 24 points 2 years ago

Leaving your front door open isn't prudent but doesn't grant permission to others to enter and take/copy your belongings or data.

The security teams may have royally screwed up, but OpenAI has a legal obligation to respect copyright and laws regarding data ownership.

Likewise, they could have scraped pages that included terms of use, copyright, disclaimers, etc., and failed to honor them.

All parties can be in the wrong for different reasons.

[–] almar_quigley 14 points 2 years ago (4 children)

That’s like saying you didn’t lock your front door so whoever robs you is innocent.

[–] [email protected] 6 points 2 years ago (1 children)

I think it's a little closer to being mad that the Google street car drove by and snapped a picture of the front of your house, tbh.

[–] almar_quigley 1 points 2 years ago

Except pii and spi are protected under law, just like your possessions.

[–] Dran_Arcana 6 points 2 years ago (1 children)

But does leaving your front door open allow one to legally take a picture of the inside from across the street? I'd say scraping is more akin to that than it is theft. Nothing is removed in scraping, just copied

[–] BradleyUffner 2 points 2 years ago

Bad analogy. This is like leaving your couch out on the sidewalk, then complaining when someone takes a picture of it.

[–] zekiz 5 points 2 years ago

It's more like leaving an important letter in the open for everyone to read. It's certainly your fault for leaving it that open.

[–] MercuryUprising 2 points 2 years ago

Yeah, but what were all these people whose data was scraped wearing?

[–] [email protected] 7 points 2 years ago (1 children)

It’s certainly their fault that they used it, though.

If they cared, they could have ensured they weren’t using sensitive or otherwise highly problematic information, but they chose not to. That’s on them.

[–] MercuryUprising -3 points 2 years ago (2 children)

It's called "disrupting" the established norms. You wouldn't get it because you're not on the bleeding edge of a revolutionary platform that's seeing scalable vertical growth due to its paradigm shift.

[–] [email protected] 4 points 2 years ago (1 children)

You forgot to mention something about blockchain

[–] assassin_aragorn 1 points 2 years ago

I can't see AI as anything but the next crypto. It seems incredibly overhyped to me

[–] [email protected] 3 points 2 years ago

My sarcasm detector is making strange noises. We may have a false positive here!

[–] [email protected] 1 points 2 years ago

They certainly fucked up, but it might well be OpenAI's post too.

[–] tallwookie 8 points 2 years ago (1 children)

if it was unsecured it's basically public. whomever put that data on a publicly accessible server is at fault

[–] [email protected] 10 points 2 years ago* (last edited 2 years ago) (2 children)

That's not necessarily true. Even if a company makes the mistake of not securing data correctly, those that make use of this data can still be at fault.

If a company leaves a server wide open, you still can't legally steal information from it.

[–] tallwookie 1 points 2 years ago

that's kind of a grey area - digitally copying something that's public domain isnt stealing.

[–] [email protected] 0 points 2 years ago

undefined> If a company leaves a server wide open, you still can’t legally steal information from it.

I don't see how this is any different than if Google search included text from a page that shouldn't be public.

[–] [email protected] 3 points 2 years ago

@Hick I have one problem with that in terms of this generative ai. It's similar to when microsoft trained copilot on github data. Of course it was open source code, it was on Microsoft's servers but with this ai revolution you couldn't expect that someone will be able to create such tool. I mean we're randomly leaving our DNA in multiple different places but does it mean we agreed to be cloned once the technology that makes it possible will arrive?

@L4s

[–] sudneo 2 points 2 years ago (1 children)

Here is not just scraping though, it is also using that data to create other content and to potentially also re-publish that data (we have no way of knowing whether chatGPT will spit out any of that nor where did it take what is spitting out).

The expectation that social media data will be read by anybody is fair, but the fact is that the data has been written to be read, not to be resold and published elsewhere too.

It is similar for blog articles. My blog is public and anybody can read it, but that data is not there to be repackaged and sold. The fact that something is public does not mean I can do whatever I want with it.

[–] seasick 5 points 2 years ago (1 children)

I could read your blog post and write my own blog post, using yours as inspiration. I could quote your post, add a link back to your blog post and even add affiliate links to my blog post.I could be hired to do something like that for the whole day

[–] sudneo 6 points 2 years ago

ChatGPT doesn't get inspired, the process is different and it could very well spit verbatim the content. You can do all the rest (depending on the license) without issues, but once again this is not what chatGPT does, as it doesn't provide attribution.

It's exactly the same with software, in fact.

[–] nH95sp 19 points 2 years ago (3 children)

Likely an unpleasant or possibly infeasible thing to implement, but designing the AI to always be able to “show the receipts” for how it’s formulating any given response could potentially be helpful. Suppose that could result in like a micro-royalties sort of industry to crop up for sourced data being used, akin to movies or TV using music and paying royalties

[–] [email protected] 13 points 2 years ago (1 children)

The way generative AI works is by using things called "tokens". Usually 1 word == 1 token, but compound words would be 2 tokens, punctuation would be a token, things like "-ed" or "-ing" could be tokens, etc.

When you give an AI a prompt, it breaks your response down into tokens. It then finds what tokens were statistically most likely to appear near that content and gives them as a response.

This has been the approach for a while, but the modern breakthroughs have come from layering AIs inside of each other. So in our example, the first AI would give an output. Then a second AI would take that output and apply some different rules to it - this second AI could have a different idea of what a "token" is, for example, or it could apply a different kind of statistical rule. This could be passed to a third AI, etc.

You can "train" these AI by looking at their output and telling it if it was good or bad. The AIs will adjust their internal statistical models accordingly, giving more weight to some things and less weight to others. Over time, they will tend towards giving results that the humans overseeing the AI say are "good". This is very similar to how the human brain learns (and it was inspired by how humans learn).

Eventually, the results of all these AI get combined and given as an output. Depending on what the AIs were trained to do, this could be a sentence/response (ChatGPT) or it could be a collection of color values that humans see as a picture (DALL-E, Midjourney, etc.).

Because there are so many layers of processing, it's hard to say "this word came from this source." Everything the AI did came from a collection of experiences, and generally as long as the training data was sufficiently large you can't really pinpoint "yeah it was inspired by this." It's like how when you think of a dog, you think of all the dogs you've experienced in your lifetime and settle on one idea of "dog" that's a composite of all those dogs.

Interestingly, you can sometimes see some artifacts of this process where the AI learned the "wrong" thing. One example: if you asked an AI what 3 + 4 is, it knows from its experiences that statistically it should say "7". Except people started doing things like asking for what "Geffers + HippoLady" was, and the bot would reply "13", consistently.

It seemed there were these random tokens that the bot kept interpreting as numbers. Usually they were gibberish, but sometimes you could make out individual words being treated as 1 token despite being 2 separate words.

It turned out that if you googled these words, you'd get redirected to a subreddit - specifically /r/counting. The tokens were actually the usernames of people who contributed often to /r/counting. This is one way it was determined that the bot was training on Reddit's data - because these usernames appeared near numbers a lot, the bot assumed they were numbers and treated those tokens accordingly.

[–] nH95sp 3 points 2 years ago (1 children)

Such a detailed response, thank you for that. It walked the line well between keeping it fairly simple but still detailed to understand it.

Because of the complexity and “mystery box” nature of ai for me, it’s hard not to just allow it in my mind to just to consider it as another form of intelligence. But people dismiss it as not intelligent because they have a far better understanding of how the AI has been trained and also in how it came to the results it has. This, VS humans where you’re like “oh I know he went to college in the medical field” but you don’t have as intimate an understanding how how thoughts, ideas and responses are formed because of that obscured source information. Also of course, far more complex with the other impacting factors like chemicals in the body, sleep, mood, experiences and perceptions.

I guess this is a bit of a run on, but it still makes me wonder if it’s just a case of creating enough obscured understanding that allows for consciousness to be as accepted. Not saying that ChatGPT is like a genuine consciousness, but more that it’s the underpinnings or beginnings for something like that. But this is said as someone with absolutely no training in the medical field as well as the artificial intelligence field, so yeah.

Thanks again for your response.

[–] [email protected] 4 points 2 years ago (2 children)

it’s hard not to just allow it in my mind to just to consider it as another form of intelligence.

If it makes you feel better, that's probably a biological response that everyone has, to varying degrees. I heard the phrase "textual pareidolia", meaning that if we see text that looks human enough, we'll automatically put a human face on it and want to treat the author like an actual human. Even though it's just a way of creating sentences that mimics human language, and has no form of "intelligence" whatsoever. It has no idea what it's saying and does not understand the meaning of any of the words it's producing. But our lizard brain is still fooled because it sounds good enough. It's like seeing a face in the clouds or Jesus on burnt toast.

Even though I know what it's doing, and can "break" it by making it sound very not-human without too much effort, it's still hard for me not to end a chat session with "Thank you, have a nice day!"

[–] nH95sp 1 points 2 years ago

Interesting - kind of weird how in the visual realm there’s the uncanny valley, but I suppose that would be explained by how significant and instinctual vision has played a role in human evolution to detect faces/weird faces etc

[–] [email protected] 1 points 2 years ago

Throw in a bit of Cold Reading and it really feels better than it is, but honestly it’s a great tool if you understand the limitations.

[–] [email protected] 5 points 2 years ago* (last edited 2 years ago) (1 children)

When they train a neutral net, all data that it has ever seen is an input to some degree in generating an output, because all inputs contribute to some degree in affecting edge weights, so the answer is "everything I've ever seen".

You are capable of learning higher-level structures and reasoning, and could form distinct memories and associate some memories with those higher-level structures, so in some cases you could remember and name an event that let you build up a piece of reasoning.

So, if you were asked "why did you ground yourself before touching that circuit board", you might say "well, when I was an undergrad, I fried a RAM chip by touching it without grounding myself".

The generative AIs out now are too primitive to and don't reason like that. There's no logic being learned in the way you're thinking of. I guess the closest analog would be if your eyeballs were just wired directly to your cerebellum and had enormous numbers of pictures of flowers flashed at you, each with the word "flower" being said, and then someone said "flower" and recorded the kind of aggregate image that came to mind. All flowers contribute a bit to that aggregate image, but AI-you isn't remembering distinct events and isn't capable of forming logical structures and walking through a thought process.

In fact, if generative AIs could do that, we'd have solved a lot of the things that we want AIs to be able to do and can't today.

And even a human couldn't do that. If I said said "think of a girl" to human-you, maybe you might think of a specific girl or might remember a handful of the individual girls you have seen in life and think that your abstract girl looks more like one than another. But that's still not listing all of the countless girls you have seen that have affected your mental image.

There will probably come a point where we build AIs that can remember specific events (though they won't have unique memory of every event they've seen any more than you do -- a lot of what intelligence does is choose "important" data and throw out the rest, and AIs don't record everything they experience in full any more than you do). And if they could learn to reason, then they might be able to assign specific events. They might misremember, just like a human could, but they could do something of a human's analog of remembering some events, forming logical thought processes, and trying to create some kind of explanation for what events were associated with that thought process. But all that is off in a future where we build something much more analogous to being as capable as a human.

[–] nH95sp 2 points 2 years ago

Right, and I suppose if you still tried to charge for use of references to source data, it would then be a weird slippery slope of weighting for which source data the AI was trained on first. How would you say, bill for references to a circuit board if it was trained on things like dictionaries that include “circuit board” as well as of course, more direct references to circuit boards in tech.

Guess it could be some weird percentage, but I don’t think I would welcome that reality

[–] [email protected] 2 points 2 years ago

Totally unfeasible given current methods, for better or for worse.

[–] 44swagnum 7 points 2 years ago* (last edited 2 years ago) (1 children)

"We have to protect the children"

[–] Protegee9850 1 points 2 years ago

The worst rush to legislation is done in the name of stopping terrorists and saving the children. Always.

[–] [email protected] 7 points 2 years ago

The lawsuit alleges OpenAI crawled the web to amass huge amounts of data without people's permission.

So who exactly is keeping people’s data where it can be easily accessed by a trawler without anybody’s permission? Maybe we should be paying attention to that just as well.

[–] Protegee9850 3 points 2 years ago

Scraping is protected. GPT and the line are more akin to fair use machines than plagiarism machines. This is a lot of hot air to go nowhere. Rage bait

[–] [email protected] 1 points 2 years ago

If they used private personally identifiable info, then they ought to be able to retrieve that info from gpt. what prompt would i need to get that persons info? If they can't say a prompt to use to get their info, their point is invalid.

[+] [email protected] -11 points 2 years ago (1 children)

I would rather an AI have that data, instead of any of the Demons of Google, Twitter, Youtube, etc. The AI won't abuse me the way those companies do on a daily basis.

[–] [email protected] 4 points 2 years ago (1 children)

@Craynak_Zero
But that's why they need this ai trained with enormous amount of data. Such ai can be much better in understanding how to keep your engagement how to convince you to buy something etc. As long as it's (not so)open ai connected with Microsoft I'd say it's exactly the same
@L4s

load more comments