I made a little guide on blocking scrapers from your site that I hope will help
"We did it, Patrick! We made a technological breakthrough!"
A place for all those who loathe AI to discuss things, post articles, and ridicule the AI hype. Proud supporter of working people. And proud booer of SXSW 2024.
Again it must always be stressed that this is a false sense of security. You can only block crawlers that identify themselves, or by pulling an IP block list of offenders which means they've already offended in order to be identified and they can just change their IP address.
You can't block them, but you can make their life harder. Return 200 OK on 404 Not Found so malicious bots trying to drive-by you for random URLs like /admin or whatever will think they found something. Make honeypots that redirect and loop, filled with bait wordlists and forms that go nowhere. Poison to well. Deliberately serve incorrect, broken or AI-generated data to known bots.
Waste their time, instead of wasting your own time.
Yeah, that's true. If they're not using names, then there's not a whole lot you can do. And blocking IPs is impossible, because they use different IPs constantly.
With my post, and your suggestions, this is the "something" that's better than doing absolutely nothing
I suppose it comes down to being offensive or defensive. I don't think being defensive is worth my time. I'm not paying for bandwidth and compute-time is so cheap it's irrelevant so I'm on the offensive. You can do both if you want. There's definitely more ready-to-go defensive solutions than there are offensive (your own article, for example), but I think tinkering and adapting my own solution is fun. It's like a game of cat and mouse but they have money to lose and I don't.
Hmm, how would one attempt to actually do this in practice?
Eventually I'm gonna make a proper article about it, but what I'm doing right now boils down to this:
The next iteration of this will include a lot of uncompressed filler data so hopefully the bots have to download half a gigabyte of data every time they do this. I'm not paying for bandwidth, it doesn't matter to me.
See for yourself https://drkt.eu/fdhasklfh
I can see that it works by just looking at my access logs.