Technology

60946 readers

5212 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 2 years ago

MODERATORS

L3s

enu

technopagan

920

Developer Creates Infinite Maze That Traps AI Training Bots (www.404media.co)

submitted 1 day ago by [email protected] to c/technology

127 comments fedilink hide all child comments

A pseudonymous coder has created and released an open source “tar pit” to indefinitely trap AI training web crawlers in an infinitely, randomly-generating series of pages to waste their time and computing power. The program, called Nepenthes after the genus of carnivorous pitcher plants which trap and consume their prey, can be deployed by webpage owners to protect their own content from being scraped or can be deployed “offensively” as a honeypot trap to waste AI companies’ resources.

“It's less like flypaper and more an infinite maze holding a minotaur, except the crawler is the minotaur that cannot get out. The typical web crawler doesn't appear to have a lot of logic. It downloads a URL, and if it sees links to other URLs, it downloads those too. Nepenthes generates random links that always point back to itself - the crawler downloads those new links. Nepenthes happily just returns more and more lists of links pointing back to itself,” Aaron B, the creator of Nepenthes, told 404 Media.

you are viewing a single comment's thread
view the rest of the comments

[–] Jordan117 65 points 1 day ago (3 children)

More accurately, it traps any web crawler, including regular search engines and benign projects like the Internet Archive. This should not be used without an allowlist for known trusted crawlers at least.

[–] Treczoks 34 points 21 hours ago (1 children)

Just put the trap in a space roped off by robots.txt - any crawler that ventures there deserves being roasted.

[–] [email protected] 2 points 19 hours ago

Yup, put all the bad stuff into "not-robots.txt". Works every time.

[–] DreamlandLividity 21 points 20 hours ago* (last edited 20 hours ago) (1 children)

More accurately, it traps any web crawler

More accurately, it does not trap any competent crawlers, which have per domain limits on how many pages they crawl.

[–] [email protected] 5 points 14 hours ago

You would still want to tell the crawlers that obey robots.txt do not pay attention to that part of the website. Otherwise it's just going to break your SEO

[–] finitebanjo 0 points 22 hours ago (1 children)

How exactly would that work? Would trusted crawlers be blocked from accessing the maze?

[–] [email protected] 5 points 22 hours ago (2 children)

You can tell what crawler its is by useragent header

[–] Treczoks 6 points 21 hours ago (2 children)

Which can easily be faked.

[–] [email protected] 3 points 14 hours ago* (last edited 14 hours ago) (1 children)

But then they're probably not going to obey robots.txt anyway so it doesn't matter

[–] Treczoks 1 points 5 hours ago

Most legal robots do. Those who don't - among them many AI feeders - deserve to be drowned in the shit that the honeypot delivers.

[–] [email protected] 4 points 15 hours ago

All of cyber security is an arms race of moving targets. It doesn't need to be foolproof to mitigate traffic for a while.

[–] finitebanjo 1 points 21 hours ago

Yeah and then you allowlist them by blacklisting them from the maze.