this post was submitted on 25 Aug 2024

316 points (97.9% liked)

Fuck AI

1355 readers

703 users here now

"We did it, Patrick! We made a technological breakthrough!"

A place for all those who loathe AI to discuss things, post articles, and ridicule the AI hype. Proud supporter of working people. And proud booer of SXSW 2024.

founded 7 months ago

MODERATORS

316

A new web crawler launched by Meta last month is quietly scraping the web for AI training data (fortune.com)

submitted 2 months ago by [email protected] to c/fuck_ai

29 comments fedilink hide all child comments

Meta has quietly unleashed a new web crawler to scour the internet and collect data en masse to feed its AI model.

The crawler, named the Meta External Agent, was launched last month, according to three firms that track web scrapers and bots across the web. The automated bot essentially copies, or “scrapes,” all the data that is publicly displayed on websites, for example the text in news articles or the conversations in online discussion groups.

A representative of Dark Visitors, which offers a tool for website owners to automatically block all known scraper bots, said Meta External Agent is analogous to OpenAI’s GPTBot, which scrapes the web for AI training data. Two other entities involved in tracking web scrapers confirmed the bot’s existence and its use for gathering AI training data.

While close to 25% of the world’s most popular websites now block GPTBot, only 2% are blocking Meta’s new bot, data from Dark Visitors shows.

Earlier this year, Mark Zuckerberg, Meta’s cofounder and longtime CEO, boasted on an earnings call that his company’s social platforms had amassed a data set for AI training that was even “greater than the Common Crawl,” an entity that has scraped roughly 3 billion web pages each month since 2011.

all 30 comments

sorted by: hot top controversial new old

[–] riodoro1 77 points 2 months ago (3 children)

Fuck the planet, we need another one of those useless chatbots.

[–] [email protected] 42 points 2 months ago

Just another billion parameters bro! I swear if we add another billion it'll fix everything!

[–] [email protected] 15 points 2 months ago

the chatbots are there for them to pretend they're doing something useful for the end user, instead of just creating an ever-increasingly detailed unique digital profile of each individual with thousands of data points in order to separate you from your money

[–] [email protected] 4 points 2 months ago* (last edited 2 months ago)

Of course we do. A normal customer support agent of a random e-shop wouldn't write me a python script to send an email alert if my raspberry pi overheats!

[–] eskimofry 43 points 2 months ago (2 children)

These hypocritical assholes don't want people acessing their own data on their websites and lay claim to it. Now they want to steal others' data.

It would make my day on the day they get sued into oblivion for data theft.

[–] [email protected] 15 points 2 months ago (1 children)

Not just data theft, but selling stolen goods (more or less).

They're stealing content and using that to build a service that they sell and profit from.

[–] [email protected] 2 points 2 months ago

They do open source the model at least, which is more than you can say for any of the other major companies doing AI

[–] [email protected] 43 points 2 months ago* (last edited 2 months ago) (1 children)

Ugh, fuck these and their tech bro creators so much. Not only is "AI" is enshittifying everything it touches, it's even passively fucking up things it can't touch.

With the line needlessly blurring between search engines and LLM models, and sites rightfully blocking AI scraper bots, I fully believe we're on the cusp of a digital dark age. If you think search engines suck now, just wait until very little of the quality content on the internet is indexable because people don't want it scraped for training data. Or if it is indexed, the actual content is locked up, requiring registration or otherwise no longer being easily accessible.

These "AI" tech bros are basically strip mining the internet while shitting where they eat (and maybe also pissing in the pool if I haven't mixed enough metaphors for your liking). They're exploiting what makes the internet great while simultaneously ruining it for the future.

For as long as search engines have existed, we had a deal going: search providers could crawl and index site data and show ads to support themselves and in exchange, sites gained visibility. Now they're using those same scrapers to steal content for their own purposes while depriving the sources of traffic. They have broken the deal, and with it, the fundamental way the internet has worked for over 30 years.

I say it again: Fuck these AI-pushing tech bros and the horses they rode in on.

[–] [email protected] 9 points 2 months ago* (last edited 2 months ago)

strip mining the internet

That's such a wonderfully succint way to describe the arc of tech companies over the last decade and a half.

And even earlier than that, I miss the days of actually "surfing" the net. Start on one page you know and get farther and farther down into webrings and personal pages linking to each other. Could really find some awesome things tucked away way back when.

[–] [email protected] 34 points 2 months ago* (last edited 2 months ago) (1 children)

The AI cat is out of the bag. How do they know they’re not feeding AI generated garbage into their models?

Actually I think I’m gonna go in my personal website and add 200 pages of locally generated LLM garbage with hidden links to those pages that only bots should follow.

[–] [email protected] 7 points 2 months ago (1 children)

How do they know they’re not feeding AI generated garbage into their models?

They don't. Any popular place on the internet which lets users type text for people to publicly view is now full of AI trash. They've fucked it, this shit is just gonna spiral into progressively worse garbage

[–] [email protected] 2 points 2 months ago

They screwed the artificial pooch in a manner of speaking.

[–] GarrulousBrevity 13 points 2 months ago* (last edited 2 months ago) (1 children)

Does that mean this new bot is ignoring sites' robots.txt files? The Internet works because of web crawlers, and I'm not sure how this one is different

Edited to add: Apparently one would need to add Meta-ExternalAgent to their robots file unless they had a wildcard rule, so this isn't as widely blocked by virtue of being new. Letting it run for a few months before letting anyone know it exists is kinda shady.

[–] [email protected] 7 points 2 months ago (1 children)

Crawling the web has fuck all to do with the function of the internet. Most crawlers are useless at most to downright disrespectful.

[–] GarrulousBrevity 1 points 2 months ago* (last edited 2 months ago) (1 children)

Have you used a search engine? Crawlers are not generative AI.

[–] [email protected] 7 points 2 months ago* (last edited 2 months ago) (1 children)

The internet is not a search engine, and no - search engines are not generative ai. That's new.

Do you have any idea how many content bot crawlers there are? Most of the corporate sites I host at work are serving content to bots more than half the time.

Do you know altivista still has bots??

When was the last time you used that search engine?

[–] GarrulousBrevity -1 points 2 months ago (2 children)

I guess I don't really see the problem with that though. There are configuration levers you could be pulling, but those sites you're hosting are not. There are lots of shady questions about how these models are getting training data, but crawlers have a well defined opt out mechanism.

The web would not be what we know it as without them, because it's how you find sites. Why shouldn't Alta Vista have one? I don't object to what Alta Vista does with the data.

[–] [email protected] 6 points 2 months ago (1 children)

Mate we have absurdly restrictive robots.txt including a custom WordPress plugin that automatically generates the file and the bots don't give a fuck.

[–] GarrulousBrevity 0 points 2 months ago

But meta's will, and Alta Vista. I'm not angry at them when a script kitty makes a bad crawler

[–] [email protected] 2 points 2 months ago (1 children)

Anybody who thinks "well defined opt out mechanisms" are good has no clue how "consent" works.

[–] GarrulousBrevity 1 points 2 months ago (1 children)

I know what you're trying to say, but that phrasing though. Being able to opt out is an important part of consent. No means no, man.

[–] [email protected] 1 points 2 months ago (1 children)

Even more important to being able to opt out, however, is not having to opt out in the first place.

Otherwise you get this script:

Wanna fuck?

No.

How about now?

No.

It's five minutes later. Have you changed your mind?

No.

. . .

Which is exactly what techbrodudes have been doing to us by having "opt out" features a dozen times a day.

[–] GarrulousBrevity 1 points 2 months ago (1 children)

I think of this as a problem with opt-in only systems. Think of how sites ask you to opt in to allow tracking cookies every goddamn time a page loads. A rule based system which lets you opt in and opt out, like robots.txt, to let you opt out of cookie requests and tell all sites to fuck would be great. @[email protected] is complaining about malicious instances of crawlers that ignore those rules (assuming they're right and that the rules are set up correctly), and lumping that malware with software made by established corporations. However, Meta and other big tech companies haven't historically had a problem with ignoring configurations like robots.txt. They have had an issue with using the data they scrape in ways that are different than what they claimed they would, or scraping data from a site that does not allow scraping by coming at it via a URL on a page that it legitimately scraped, but that's not the kind of shenanigans this article is about, as meta is being pretty upfront about what they're doing with the data. At least after they announced it existed.

An opt-in only solution would just lead to a world where all hosts were being constantly bombarded with requests to opt in. My major take away from how meta handled this is that you should configure any site you own to disallow any action from bots you don't recognize. As much as reddit can fuck off, I don't disagree with their move to change their configuration to:

User-agent: *
Disallow: /

[–] [email protected] 1 points 2 months ago (1 children)

My take is that if never-ending opt-in requests are a pest, perhaps people should stop doing the pesky activities.

Let's move this from the digital world (where people seem to get easily confused on topics of consent) into the physical. Remember the good old days of door-to-door salesmen? (Probably not. I only barely remember them and I'm likely far older than you.) In any case you had some twat interrupting your daily/evening tasks, your family time, your sleep, etc. all so they could sell you some shit you didn't want. They got so obnoxious that regulations had to be put in place to control them: what time they could arrive, what things they could say, what tactics they could or could not use (the old "foot in the door" shit), etc. Finally, over time, people would put up aggressive signs about sales (which salesmen would cheerily ignore, rather like this robots.txt thing), buy dogs to frighten them off, etc.

And this was being done by people selling the products of "established corporations". When taken to task for it they'd throw the salesmen under the bus, claiming that the tactics used were not countenanced by them (but the fact that their sales targets practically mandated this was quietly left unspoken). "Established corporations" are no more prone to ethical behaviour and, indeed, even basically social behaviour than are small agents. It's just that in this day and age when they commit an ethical breach (like Google's camera trucks siphoning personal data that time) it's an 'accident' or 'just some bad apples' and so on.

The reality is that Meta can be trusted as far as you can throw it. Which is to say zero distance. As can Google, Microsoft, anything Elon Musk foists on us at any point, etc. etc. etc. And this whole "opt out" bullshit is how they get away with being antisocial shits.

... but that’s not the kind of shenanigans this article is about, as meta is being pretty upfront about what they’re doing with the data. At least after they announced it existed.

Uh ... Meta is being pretty upfront about what they're doing after they ran it a while and siphoned off the stuff they wanted. This is not the pass you seem to think it is.

[–] GarrulousBrevity 1 points 2 months ago (1 children)

Oh, no, that wasn't excusing Meta in general. Just giving them a pass on that they've had, to my knowledge, a history of respecting robots.txt, which makes this piece of software better than outright malware. Starting it secretly and not giving site hosts a chance to make sure they had their privacy configured the way they liked first was a shady as hell move, no argument there.

[–] [email protected] 1 points 2 months ago

I don't know I'd call it "respecting robots.txt" if you don't tell people that your robot even exists. Basically if you don't just automatically block any and all robots (and then watch many of them cheerfully ignore you), this is an end-run around user desires.

Which is why I give these institutions the same moral regard I give door-to-door salesmen, telemarketers, slug slime, and other moral vacuums.

[–] [email protected] 12 points 2 months ago

Mega wealthy tech oligarchs hate human beings. They want to replace us all with processes that they can kill with less problems.

[–] werefreeatlast 5 points 2 months ago (1 children)

We need automated text generator with generic sentences. Bunch up all dictionary words grouped by type and then make absolutely none sensical but valid sentences. Keep updating as often as the AI bots visit. Add questions and fake answers about random images. And we could do the same thing with books. Download Volumes from Google, change the meaning of various words and rehash the same big texts with all the wrong stuff. Like everything is correct except for the word the, now written with the k in place of the h...tke. tje story about tje cat in tje hat. Then write another big book with the same thing but different topic...tje excelsior returns!

[–] [email protected] 4 points 2 months ago

I wonder if you could do a ton of letter swaps to make things look misspelled, but then provide a custom font that also swaps the glyphs around. So a human would read the normal text, but if you changed the font to a normal font you'd see what an AI would see, e.g. garbage.

Probably not very practical though. Copy-pasting from your website would break for example.