this post was submitted on 28 Jan 2025

128 points (97.8% liked)

Pulse of Truth

638 readers

545 users here now

Cyber Security news and links to cyber security stories that could make you go hmmm. The content is exactly as it is consumed through RSS feeds and wont be edited (except for the occasional encoding errors).

This community is automagically fed by an instance of Dittybopper.

founded 1 year ago

MODERATORS

[email protected]

128

AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt (arstechnica.com)

submitted 1 day ago by [email protected] to c/[email protected]

15 comments fedilink hide all child comments

Attackers explain how an anti-spam defense became an AI weapon.

top 15 comments

sorted by: hot top controversial new old

[–] [email protected] 39 points 1 day ago* (last edited 1 day ago)

Attack

To me this looks like defense. If the site asks you to not to scrape and you do it anyway, you are the attacker and deserve the garbage.

[–] [email protected] 18 points 1 day ago (2 children)

I can save you a lot of trouble, actually. You don't need all of this!

Just make a custom 404 page that returns 13 MBs of junk along with status code 200 and has a few dead links (404, so it just goes to itself)

There are no bots on the domain I do this on anymore. From swarming to zero in under a week.

You don't need tar pits or heuristics or anything else fancy. Just make your website so expensive to crawl that it's not worth it so they filter themselves.

[–] Evotech 1 points 23 hours ago (1 children)

Surely any competent web scraper will avoid an infinite loop?

[–] [email protected] 1 points 19 hours ago

Critics debating Nepenthes' utility on Hacker News suggested that most AI crawlers could easily avoid tarpits like Nepenthes, with one commenter describing the attack as being "very crawler 101." Aaron said that was his "favorite comment" because if tarpits are considered elementary attacks, he has "2 million lines of access log that show that Google didn't graduate."

You assume incorrectly that bots, scrapers and drive-by malware attacks are made by competent people. I have years worth of stories I'm not going to post on the open internet that says otherwise. I also have months worth of access logs that say otherwise. AhrefsBot in particular is completely unable to deal with anything you throw at it. It spent weeks in a tarpit I made very similar to the one in the article, looping links, until I finally put it out of its misery.

[–] [email protected] 6 points 1 day ago (1 children)

Just make a custom 404 page that returns 13 MBs of junk along with status code 200

How would you go about doing this part? Asking for a friend who’s an idiot, totally not for me.

[–] [email protected] 7 points 1 day ago* (last edited 1 day ago) (2 children)

I use Apache2 and PHP, here's what I did:

in .htaccess you can set ErrorDocument 404 /error-hole.php https://httpd.apache.org/docs/2.4/custom-error.html

in error-hole.php,

<?php
http_response_code(200);
?>
<p>*paste a string that is 13 megabytes long*</p>

For the string, I used dd to generate 13 MBs of noise from /dev/urandom and then I converted that to base64 so it would paste into error-hole.php

You should probably hide some invisible dead links around your website as honeypots for the bots that normal users can't see.

[–] [email protected] 2 points 1 day ago (2 children)

How does this affect a genuine user who experiences a 404 on your site?

[–] [email protected] 1 points 18 hours ago

They will see a long string of base64 that takes a quarter of a second longer to load then a regular page. If it's important to you, you can make the base64 string invisible and add some HTML to make it appear as a normal 404 page.

[–] [email protected] 3 points 1 day ago* (last edited 22 hours ago) (1 children)

I don't know a lot about this, but I would guess a normal user would like a message, that says something along the lines of "404, couldn't find what you were looking for." The status code and the links back to itself as well as the 13 MBs of noise should probably not irritate them. Hidden links should also not irritate normal users.

[–] [email protected] 2 points 1 day ago (1 children)

I also "don't know a lot about this", but I do know that your browser receiving a 200 means that everything worked properly. From what I can tell, this technique is replaces any and every 404 response with 200, thus tricking the browser (and therefore the user) into thinking the site is working as expected every time they run into a missing webpage on this site.

[–] [email protected] 3 points 1 day ago

The user doesn’t see the status code, they see what’s rendered to the screen.

[–] [email protected] 1 points 1 day ago (1 children)

For the string, I used dd to generate 13 MBs of noise from /dev/urandom and then I converted that to base64 so it would paste into error-hole.php

That string is going to end up being 17MB assuming it’s a utf8 encoded .php file

[–] [email protected] 1 points 19 hours ago

idk what to tell you.

ls -lha
-rw-rw-r-- 1 www-data www-data  14M Jan 14 23:05 error-hole.php

[–] Blaster_M 5 points 1 day ago

Rather funny that some of these are in code if(useragent == chatgpt) kind of sillyness. You need heiuristics to detect scrapers, because they'll just report their user agent as the average user's browser.

[–] [email protected] 2 points 1 day ago

This will be as effective against LLM trainers as Nightshade has been against generative image AI trainers.