AI Labyrinth, a novel mitigation technique that slows down, perplexes, and wastes the resources of AI Crawlers and other bots who disregard “no crawl” instructions by using AI-generated content. Customers do not need to establish any custom rules because Cloudflare will automatically deploy an AI-generated set of linked sites when it detect inappropriate bot activity when you opt in.
All users, including those on the Free subscription, can opt in to AI Labyrinth.
Using Generative AI as a defensive weapon
Four of the top 20 Facebook posts from last autumn were purportedly AI-generated content, demonstrating the explosion of this type of content. Furthermore, according to Medium, AI-generated material makes up 47% of all content on their platform. Like any newer tool, it can be used for both good and bad.
Meanwhile, the number of new crawlers that AI businesses deploy to harvest data for model training has skyrocketed. Every day, more than 50 billion requests or slightly less than 1% of all web requests are sent to the Cloudflare network by AI crawlers. Although Cloudflare offers a number of tools to detect and stop illegal AI crawling, it have discovered that banning harmful bots can let the attacker know you are aware of their presence, which can cause them to change tactics and start an endless arms race. Therefore, it intended to develop a novel method of blocking these undesired bots without alerting them to the block.
Cloudflare chose to leverage AI-generated content, a new offensive tool in the bot creator’s toolbox that it haven’t seen much use of defensively, to do this. Instead of denying the request when it identify unauthorised crawling,it will connect to a sequence of artificial intelligence (AI)-generated sites that are convincing enough to lure a crawler through them.
AI Labyrinth also serves as a next-generation honeypot, which is an extra bonus. A true person wouldn’t navigate through a labyrinth of AI-generated gibberish. This gives us a new method to detect and fingerprint malicious bots, which it add to the list of known bad actors, since any visitor who does so is most likely a bot.
How the labyrinth was constructed
Instead of obtaining data from your real website, AI crawlers who follow these links squander valuable computational resources processing irrelevant material. As a result, they are much less able to collect the necessary data to properly train their models.
Cloudflare combined Workers AI with an open source model to produce original HTML pages on a variety of subjects in order to produce material that is convincingly human-like.It is used a pre-generation pipeline that sanitises the material to avoid any XSS vulnerabilities and stores it in R2 for quicker retrieval, as opposed to producing this content on-demand, which can affect performance.
Cloudflare discovered that more varied and compelling outcomes were obtained when a wide selection of themes was first generated, followed by content creation for each topic. The content it can create authentic and based on scientific facts, but it is neither proprietary nor relevant to the website being crawled. It take care to avoid creating erroneous content that aids in the dissemination of false information online.
Using the unique HTML transformation technique, this pre-generated material is effortlessly incorporated as hidden links on pre-existing pages without altering the page’s original content or structure. Every created page has the proper meta directives to guard against search engine crawling and preserve SEO.
By using carefully chosen characteristics and design, it also made sure that these links would not be accessible to human visitors. Cloudflare made sure that only suspected AI scrapers are shown these URLs, allowing confirmed crawlers and legitimate users to browse regularly, in order to further reduce the impact on ordinary visitors.
This method’s function in the constantly improving bot detection system is what makes it so successful. Since human users and legitimate browsers would never view or click on these links, it can be extremely certain that these are automated crawler activity. In addition to producing useful data that supports the machine learning models, this gives us a strong identifying method.
Cloudflare can find new bot patterns and signatures that could otherwise go unnoticed by examining which crawlers are using these covert routes. By taking this proactive stance, it are able to maintain lead over AI scrapers and enhance the detection skills without interfering with users’ regular surfing experiences.
Through the usage of the developer platform, it have developed a system that provides consistent quality and instantaneous delivery of convincing decoy material without affecting the user experience or performance of your website.
How to prevent AI crawlers using AI Labyrinth
It only takes one toggle on your Cloudflare dashboard to enable AI Labyrinth. Toggle on the new AI Labyrinth setting by navigating to your zone’s bot administration section:
The AI Labyrinth requires no further configuration once it is enabled; it starts operating right away.
AI-generated honeypots
Confusion and distraction of bots is AI Labyrinth’s primary advantage. A further advantage, though, is that it can be used as a honeypot of the future. A honeypot in this case is just an invisible link that a visitor to the website cannot see, but a bot that is parsing HTML would see and click on it, exposing itself as a bot.
Since the Cuckoo’s Egg incident in late 1986, hackers have been apprehended using honeypots. The inventors of Cloudflare also developed Project Honeypot in 2004 (before Cloudflare was founded) to make it simple for anyone to set up free email honeypots and obtain lists of crawler IPs in return for adding to the database. This strategy is less successful now that bots have developed to actively search for honeypot strategies like hidden links.
AI Labyrinth will eventually build entire networks of connected URLs that are far more realistic and difficult for automated systems to detect, rather than just adding unseen links. Although it is clear that no person would spend much time on the pages, AI bots are designed to sift through them in order to get as much information as they can. It can be certain that bots are not real people when they visit these URLs, and machine learning algorithms automatically record and use this information to help us identify bots more accurately. As a result, every scraping effort contributes to the protection of all Cloudflare users, creating a positive feedback loop.
What comes next?
The use of generative AI to stop bots for us is just getting started. Even if the content it currently produce appears to be human, it won’t fit into every website’s current layout. Cloudflare keep working to make these connections more difficult to find and blend in with the established architecture of the website on which they are embedded. You can assist us by signing up right away.