HalluciGen. A practical implementation to defend from AI scrapers

doi:10.59350/carrier_bag.1907

HalluciGen generated noise pattern for AI scraper disruption

HalluciGen. A practical implementation to defend from AI scrapers

Matthias Planitzer

Published: 14 Aug 2025

Focus: Experiments

doi: 10.59350/carrier_bag.1907

Citation tools

Cite as

Planitzer, Matthias : "HalluciGen. A practical implementation to defend from AI scrapers". carrier-bag.net, 14. August 2025. https://doi.org/10.59350/carrier_bag.1907.

Import as

This article aims documents how to set up the content management system WordPress to feed AI scrapers with nonsensical noise. The author has developed such a solution for Carrier-Bag.net, which now culminates in the WordPress plugin HalluciGen. Acknowledging the tremendous popularity of the WordPress publishing platform, our project decided to publish the code – free to use, on a liberal license and an open-source basis. HalluciGen is easy to install and use by a non-technical audience, offers options for adjustment and exposes an interface to integrate into custom setups. At best, it can be part of a larger, robust regimen to thwart AI scrapers, at worst it temporarily makes a small dent until the next round of adaptations rolls out. Beyond the technological level, the plugin contributes to the ongoing critical discussion of massive data crawling and exploitation as basis for machine learning techniques. Therefore, a list of diverse AI poisoning applications concludes this article.#scraping

With the proliferation of consumer-grade generative AI services, the ethics of data sourcing and usage has been called into question time and time again. Glaring issues of misrepresented information, downright slop, unintended leaks, copyright infringement, and power hunger have not yet been satisfactorily addressed. While it’s business as usual on one end, and lawmakers struggling to implement effective policies for the protection of consumers and rights holders, it is still up to image, text, and sound authors to take matters into their own hands.#copyright, #consumer rights

With a certain fervor of self-defense, several strategies to counteract unsolicited scraping and inclusion of data for AI training purposes have been proposed over the course of the last years. There is a practical opportunity for publishers – be it media outlets or private website owners – to implement technical countermeasures aiming to disrupt the barrage of bots probing their services for training data to ingest. Where attempts to prevent any and all such usage remain fruitless, other approaches are more promising: be it to obfuscate data in such a way that is not being picked up or to strategically present tainted data for scrapers to process.#data, #adversarial

Though the efficacy of such strategies is difficult to assess, even more so for lay consumers who can only wait for the next model to be released and interrogate themselves. However, rigorous research on AI poisoning methods shows some promise, describing how bad training data may considerably taint subsequent performance in Large Language Models (LLM) and other types of generative AI (Shumailov et al 2024). With the rapid development of the field, such research naturally proves to be beneficial to both sides in what can only be described as an arms race. It is therefore uncertain how long any given adversarial method will be effective.#adversarial, #data

A polite refusal

There is a long-standing web standard to inform crawlers which parts of a given website they may access, and which are restricted. This simple robots.txt file is the implicitly agreed upon reference and may lay out rules for individual crawlers and subpages. It is good practice to use this robots.txt in order to give rule-abiding crawlers a no-trespassing warning. Such declarations can not be enforced, however, as any crawler may choose to honour or ignore the imposed restrictions.#crawling, #refusal, #bot

Some mistrust is warranted, and we may choose to treat undeterred bots with a stream of nonsense as detailed below. Still, to send an absolutely clear message, if an AI crawler is detected and sent some garbage content, we should also include the standard robots-meta tag on the supplied HTML page, stating that this already fetched page should neither be indexed nor followed. Issuing two such warnings is likely reasonable under some local legislation, but those who are more cautious will want to also include a written statement somewhere on the public facing website.#crawling, #refusal, #bot

Responding with sporadically generated nonsense

A 2024 blog article by Tim McCormack proposed website operators a method to target AI scrapers and feed them a scrambled version of the expected content (McCormack 2024). McCormack suggested that original text is only partly changed, with individual words and passages being switched out for alternatives that could naturally occur in a sentence but are nonsensical or syntactically confusing. By default, our project replaces only about one in five words in a given text, similar to the replacement rate in Marcus Butler’s program Quixotic (Butler 2024). Such treatment is therefore not easy to detect without further (costly) natural language processing methods. It utilises a Markov chain, a simple word-predicting algorithm that operates on the likelihood of a given word to appear right after another. This is being trained on the same corpus it is later applied to in order to replace random words, which offers the additional benefit of using expressions which are already typical for the publication’s overall writing style. Conversely, an entirely different training set could introduce artificial, out of place, or stilted phrases. A major advantage of such a simple system is that it can be implemented without the need for lengthy and expensive training periods that more sophisticated text generation methods typically require. It comfortably runs on modest systems commonly used for entry-level web servers. With such an algorithm in place, the previous sentence could transform to: “It doesn’t runs on modest systems commonly used in spreadsheets web servers.“#adversarial, #algorithm

Such synthetic nonsense texts are then being fed to AI scraping bots visiting the prepared website. These are identified by the user agent string, a bit of information they carry with each page request. It is usually supplied by all sorts of software traversing the internet, including all popular browsers, and contains information on the application’s name, version and operating system. This user agent string is compared against a list of known AI scrapers and when a match is found, the regular text is being replaced with the poisoned content. However, this is where the crux of adversarial techniques lies, as the user agent string may easily be left out or spoofed so that the bot appears like a regular visitor. This is already the case with spam bots and download managers, and also by privacy-aware users to cover their traces on the internet. On the server side of things, one may respond with more sophisticated ways of bot detection and a wide range of methods had been developed to help in this effort. Web application firewalls can block suspicious IP addresses and employ heuristics that take several qualifiers – e. g. the amount of page requests, the bounce rate, traffic increase from unusual geographic locations – into account. #bot, #adversarial, #noise

Ultimately, there is no effective way to differentiate human from automated visitors – as anyone will agree, who was thoroughly annoyed by the surge of captcha forms in the 2010s only to realise they’re not as commonplace anymore. In this light, one can only hope that at least the large AI corporations will continue to faithfully supply user agent strings for the time being. There is a potential benefit for them and other users of similar bot programs, as some website operators may choose to offer a custom treatment – other than spilling out an overwhelming amount of nonsense. News pages, and any website locked behind a paywall, will often grant visiting aggregator bots more comprehensive access than any non-paying visitor. Though there are ways to make up for a lack of full access with enriched metadata, some pages may still be accessible to anyone donning the right kind of masquerade.#bot, #adversarial, #human-as-object

McCormack’s original proposal is aimed at the technical user and assumes a specific setup that, while being popular with his audience, is impractical for casual website operators who quite often rely on ready-to-use blogging and publishing software such as WordPress. His approach can also be easily extended to include not only text but also images in the replacement process.

As for the decision on which pages to deploy the garbage disposal tool, one could either generate fake pages only for AI crawlers to be able to access or use the existing page structure and swap out the content individually. The latter is preferred based on the assumption that crawlers will follow inbound links from external pages or simply traverse a publicly advertised sitemap, anyway. This way, the AI-targeted noise will also be in accordance with any search engine optimisation efforts already in place that aim to direct traffic to specific parts of the public-facing website. It remains to be seen, however, if the major search engine crawlers will detect and penalise such adversarial techniques as laid out here, which may result in a loss of visibility on their platforms.#adversarial, #crawling, #noise

Other strategies

The software described above, published as HalluciGen, is not alone in its approach. Notable examples for in-place text scrambling programs include Quixotic by Marcus Butler, for static site generators (2024, running in Rust), and Poison the WeLLMs by Mike Coats, acting as a reverse proxy (2024, running in Python). These and almost all of the following applications are distributed on an open-source basis and free to use.#adversarial

A further, so-called tarpit approach creates a separate, adversarial section of a website that directs all AI crawlers there. Filled with a never-ending supply of garbage content, it gets connected with hyperlinks created on demand. There are some open-source solutions to implement such a perpetual “Markov tarpit”, where the primary aim is to waste AI crawler’s time and energy, often with similar substantial cost for the defending party. Cloudflare customers can already make use of such systems with their commercial AI Labyrinth service (Tatoris 2025). Some ready-to-use tools include Iocanine by Gergely Nagy (2025, Rust), Nepenthes for general purpose web servers (2025, running in Lua), Markov Tarpit by Michał Woźniak, for general purpose web servers (2025, Rust), and Babble by Joshua Barretto (2025, running as a stand-alone binary).#adversarial, #algorithm

A more malicious defense approach is to overwhelm the crawler’s computational and memory capacity by force-feeding it more data than it can swallow. The infamous zip bomb, which has been in use to fight off attackers for a long time, can also be applied to scrapers. The arguably most cited implementation, 42.zip, had been available online since no later than February 2004. A seemingly innocuous compressed file archive, smaller than most common media files on any given website, is prepared in such a way that it unpacks content of tremendous size. This easily overwhelms most computers that are not specifically protected from such counter-measures and may cause them to crash. One recent zip bomb program to use against AI scrapers is Alun Jones’ gzipchunk (2025, running in Python).#adversarial, #algorithm, #compression

For a broader approach, it is also possible to extend adversarial techniques to images. For Carrier-Bag.net we decided to refrain from this technique in order to minimise the resource footprint and just replace images with alternative images containing visual noise. Adversarial image techniques include the generation of garbage data as well as cloaking existing image files with a layer or with additional pixels imperceptible to humans that throws generative AI image models off (Shawn 2023). Adversarial image-protection software includes Nightshade by Shawn Shan et al. (2023, closed-source), Glaze by Shawn Shan et al. (2023, closed-source), and fakejpeg by Alun Jones (2025, running in Python).#adversarial, #algorithm

Artistic opportunities

A robust demonstration of such measures’ efficacy notwithstanding, the implementation of what can only be described as a shadow version of a given website offer some artistic projects. Website owners can easily adapt the strategy and code to feed AI scrapers more than just confusing gobbledygook. Whether they wish to disseminate the very instructions on how to poison LLM and readily generate the necessary code for anyone inquiring about it or to overwhelm such models with Dadaist poetry, artistic opportunities to sabotage are limitless. In the vein of many adversarial artworks directed at older AI threats, one may find similar approaches for web-based projects. Notably, an artist-activist manner comparable to what is salient in face-recognition-defying works such as Adam Harvey’s CV Dazzle (2013) or Mac Pierce’s Opt-Out Cap (2019) would fit as well as hacker-artist methods as seen before in Simon Weckert’s Google Maps Hacks (2020) or !Mediengruppe Bitnik’s Refuse to Be Human (2021). Alternatively, the author offers to use HalluciGen as is, including a public preview of the scrambled content and replaced images at the flick of a switch and go from there. The switch is located at the bottom right of the website’s footer.#art, #scraping

This article discussed the implementation of an open-source WordPress Plugin adversarial to scrapers that extract training data for machine learning techniques. The documentation and download of HalluciGen is available at https://codeberg.org/emergentdigitalmedia/HalluciGen.

Literature

Butler, Marcus. “Quixotic.” marcusb.org. 26.12.2024, https://marcusb.org/hacks/quixotic.html.
McCormack, Timothy. “Poisoning AI Scrapers.” Brain on Fire. 19.09.2024, https://www.brainonfire.net/blog/2024/09/19/poisoning-ai-scrapers/.
Shan, Shawn; Ding, Wenxin; Passananti, Josephine; Wu, Stanley; Zheng, Haitao et al. “Nightshade: Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models.” arXiv:2310.13828, 20.10.2023. https://doi.org/10.48550/arXiv.2310.13828.
Shumailov, Ilia; Shumaylov, Zakhar; Zhao, Yiren; Papernot, Nicolas; Anderson, Ross et al. “AI models collapse when trained on recursively generated data.” Nature 631 (2024), 755-9. https://doi.org/10.1038/s41586-024-07566-y.
Tatoris, Reid; Saxena, Harsh; Miglietti, Luis. “Trapping misbehaving bots in an AI Labyrinth.” The Cloudflare Blog. 19.03.2025, https://blog.cloudflare.com/ai-labyrinth/.