this post was submitted on 21 Mar 2025
186 points (98.9% liked)

Linux

6596 readers
515 users here now

A community for everything relating to the GNU/Linux operating system

Also check out:

Original icon base courtesy of [email protected] and The GIMP

founded 2 years ago
MODERATORS
 

LLM scrapers are taking down FOSS projects' infrastructure, and it's getting worse.

all 28 comments
sorted by: hot top controversial new old
[–] [email protected] 2 points 6 hours ago

Can't we just filter them out by iptables rules?

[–] [email protected] 62 points 1 day ago* (last edited 1 day ago) (1 children)

Wow that was a frustrating read. I dd not know it was quite that bad. Just to highlight one quote

they don’t just crawl a page once and then move on. Oh, no, they come back every 6 hours because lol why not. They also don’t give a single flying fuck about robots.txt, because why should they. [...] If you try to rate-limit them, they’ll just switch to other IPs all the time. If you try to block them by User Agent string, they’ll just switch to a non-bot UA string (no, really). This is literally a DDoS on the entire internet.

[–] [email protected] 25 points 1 day ago (2 children)

the solution here is to require logins. thems the breaks unfortunately. it'll eventually pass as the novelty wears off.

[–] [email protected] 10 points 1 day ago (2 children)

Next you'll have to invest in preventing automated signups

[–] [email protected] 5 points 1 day ago (1 children)

Signups in most platforms are quite hard. Straight up give your phone and do SMS verification, or at least give email and to register that email you will have to provide phone anyway. Captchas nowadays became so hard that even humans struggle with them and it often takes multiple attempts to get it right.

[–] [email protected] 3 points 1 day ago (1 children)

provide phone number to look at this foss project's website, not too sure about that

[–] [email protected] 3 points 23 hours ago

Honestly if any site demands my phone number it can get fucked.

[–] [email protected] 1 points 1 day ago

not really, just tie it with 2fa SMS style and the hurdle is large enough most companies won't bother.

[–] [email protected] 5 points 1 day ago (2 children)

Alternative: require a proof of work calculation.

[–] [email protected] 2 points 1 day ago (1 children)

This is exactly what we need to do. You'd think that a FOSS WAF exists out there somewhere that can do this

[–] [email protected] 2 points 1 day ago (2 children)

There is. That screenshot you see in the article is a picture of a brand new one, Anubis

[–] [email protected] 3 points 1 day ago

Yeah I realised that after posting. I think we need a better one to deal with the cases of letting legitimate users in easier though

[–] [email protected] 1 points 1 day ago

It kind of sucks but it is the best we have for the moment

[–] [email protected] 0 points 1 day ago

Make them mine a BTC block in the Browser!


^Sorry, I'm low in blood and full of mosquito vomit. That's probably making me think weird stuff.^

[–] [email protected] 5 points 1 day ago (1 children)

Whats confusing the hell out of me is: why are they bothering to scrape the git blame page? Just download the entire git repo and feed that into your LLM!

9/10 the best solution is to block nonresidential IPs. Residential proxies exist but they're far more expensive than cloud proxies and providers will ask questions. Residential proxies are sketch AF and basically guarded like munitions. Some rookie LLM maker isn't going to figure that out.

Anubis also sounds trivial to beat. If its just crunching numbers and not attempting to fingerprint the browser then its just a case of feeding the page into playwright and moving on.

[–] [email protected] 3 points 1 day ago* (last edited 1 day ago) (1 children)

I don't like the approach of banning nonresidential IPs. I think it's discriminatory and unfairly blocks out corporate/VPN users and others we might not even be thinking about. I realize there is a bot problem but I wish there was a better solution. Maybe purely proof-of-work solutions will get more popular or something.

[–] [email protected] 0 points 21 hours ago (1 children)

Proof of Work is a terrible solution because it assumes computational costs are significant expense for scrapers compared to proxy costs. It'll never come close to costing the same as residential proxies and meanwhile every smartphone user will be complaining about your website draining their battery.

You can do something like only challenge data data center IPs but you'll have to do better than Proof-of-Work. Canvas fingerprinting would work.

[–] [email protected] 1 points 6 hours ago

Proof of Work is a terrible solution

Hard disagree, because:

it assumes computational costs are significant expense for scrapers compared to proxy costs

The assumption is correct. PoW has been proven to significantly reduce bot traffic... meanwhile the mere existence of residential proxies has exploded the availability of easy bot campaigns.

Canvas fingerprinting would work.

Demonstrably false... people already do this with abysmal results. Need to visit a clownflare site? Endless captcha loops. No thanks

[–] [email protected] 18 points 1 day ago

This is the most crazy read on subject in a while. Most articles just talk about hypothetical issues of tomorrow, while this one actually full of today's problems and even costs of those issues in numbers and hours of pointless extra work. Had no idea it's already this bad.

[–] [email protected] 6 points 1 day ago (1 children)

How much you wanna bet that at least part of this traffic is Microsoft just using other companies infrastructure to mask the fact that it’s them

[–] [email protected] 3 points 1 day ago

I doubt it since Microsoft is big enough to be a little more responsible.

What you should be worried about is the fresh college graduates with 200k of venture capital money.

[–] [email protected] 5 points 1 day ago

Sometimes, I hate humanity.

[–] [email protected] 2 points 1 day ago (2 children)

I'm perfectly fine with Anubis but I think we need a better algorithm for PoW

[–] [email protected] 1 points 1 day ago* (last edited 1 day ago) (1 children)

Tor has one now

Maybe it can be reused for the clearnet.

[–] [email protected] 1 points 1 day ago (1 children)
[–] [email protected] 1 points 1 day ago* (last edited 1 day ago) (1 children)

And Tor itself

It is part of the denial of service protection

[–] [email protected] 1 points 1 day ago

That's neat