this post was submitted on 29 Nov 2023
435 points (97.4% liked)

Technology

60081 readers
3446 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 2 years ago
MODERATORS
 

ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

you are viewing a single comment's thread
view the rest of the comments
[–] pntha 17 points 1 year ago (1 children)

how do we know the ChatGPT models haven’t crawled the publicly accessible breach forums where private data is known to leak? I imagine the crawler models would have some ‘follow webpage-attachments and then crawl’ function. surely they have crawled all sorts of leaked data online but also genuine question bc i haven’t done any previous research.

[–] [email protected] 9 points 1 year ago* (last edited 1 year ago)

We don't, but from what I've seen in the past, those sort of forums either require registration or payment to access the data, and/or some special means to download it (eg: bittorrent link, often hidden behind a URL forwarders + captchas so that the uploader can earn some bucks). A simple web crawler wouldn't be able to access such data.