this post was submitted on 20 Feb 2024
479 points (97.2% liked)

Technology

59598 readers
3377 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] ilinamorato 9 points 9 months ago* (last edited 9 months ago)

It's ludicrously cheap for the size and quality of the dataset. A set of 829 academic papers at University of Michigan is priced at $25,000—about 1/2400 of this sale. If you were to scale that dollar value to the size of the Reddit dataset, you'd expect it to contain about 2 million academic papers' worth of data.

But Reddit has almost two decades of text written by 200 million chronically-online people. And sure, probably most Reddit users don't write an academic paper amount of content every year; but the average is probably closer to that than not, especially when you consider that some of those subreddits like AskHistorians and AskScientists really are generating the equivalent of dozens of academic papers per day. Just based on the amount of text alone, Reddit should've sold us out for 50-100x what they got for just a single year of data, and 1000-2000x for the full twenty years (though, granted, they didn't have that much data for that entire time, so let's say half that).

Furthermore, those 829 papers in the U of M dataset are disconnected, unlinked text representing a tiny fraction of what U of M's 50,000 students generate in even a single year. Reddit has data with links, images, conversational responses, prompt responses, Q&As, flash fiction, slash fiction, historical deep-dives, investigations, memes, inside jokes, a development of style and consensus over time, and a comprehensive understanding of what it means to interact online, generated by people around the world over the course of 18 years. It's much better data for almost any LLM purpose that isn't just writing academic papers from the perspective of students at a medium size 4-year undergrad institution in the Midwestern US. The quality of the dataset should've made the value even higher. It's hard to say exactly how much higher, but let's just be extremely conservative and say it should have doubled the total.

That means that, conservatively, the value of Reddit's dataset—or, rather, our dataset, which Reddit freebooted from us—was about 1000x what they were paid, based on the proportional value of the U of M dataset.

They should've sold us out for billions.

Of course, we don't know anything about what exclusivity deals or subset of data that they might have included with this deal. It might only be one year of data, and only 6 months of exclusivity. But assuming they sold the rights to the entire dataset, we got sold for pennies.