How to defend against Account Takeovers
Learn about account takeover threats, protection strategies, and detection methods to secure your digital accounts and prevent unauthorised access.
Support FAQ
AI and Large Language Model (LLM) web scrapers are automated crawlers that request website pages so an AI system can train on, index, summarize, cite, or act on the content. Some identify themselves clearly and follow robots.txt. Others use browser automation, proxy networks, or misleading user-agent strings to keep collecting data after a site has expressed a preference not to be crawled.
Traditional search crawlers usually provide a direct exchange: they crawl a page, add it to a search index, and send visitors back to the source website. AI crawlers can create a different exchange. They may use the page to improve a model, generate an answer, retrieve a live citation, compare products, or guide an agentic workflow without the user visiting the source page.
Training crawlers collect public content for model training or model improvement. Examples include GPTBot, anthropic-ai, ClaudeBot, CCBot, and other dataset or model-provider crawlers.
The risk is usually content control. A site may be happy to appear in search results but not want articles, product descriptions, documentation, prices, images, reviews, or forum posts copied into training data.
AI search crawlers build indexes used by answer engines and AI search products. Examples include OAI-SearchBot, PerplexityBot, and AI-search crawlers from search or assistant providers.
The risk is mixed. Some traffic may improve visibility in AI search, while other traffic may produce answers that reduce direct visits or copy page content into summaries without enough value returning to the site.
Live retrieval crawlers fetch pages in response to a user prompt. ChatGPT-User is an example of this pattern. These crawlers may be more valuable than bulk training crawlers because a real user is asking for current information.
The risk is that a simple block can also block legitimate AI-assisted discovery. A better policy may allow some live retrieval traffic while blocking training crawlers or aggressive scraping patterns.
Agentic systems use browsers or browser-like automation to research, compare, click, fill forms, or complete tasks. They may look less like a classic crawler and more like a scripted user session.
The risk is intent. A trusted shopping assistant checking one product is different from an automated workflow scraping an entire catalogue, probing checkout, or comparing prices across thousands of pages.
AI scraping can affect a business in several ways:
Not every AI crawler is bad. The right policy depends on whether the crawler creates value for the site, whether it is transparent, whether it follows stated rules, and whether its request pattern is proportionate.
Many AI crawlers publish user-agent strings, and those strings are useful for reporting and basic policy. They are not enough for enforcement. A user-agent is just an HTTP header, so it can be changed easily.
Modern AI crawler management should compare the claimed identity with harder-to-fake evidence:
This is especially important when a crawler changes identity, rotates through proxy infrastructure, or uses a real browser to perform machine-speed browsing.
Start by deciding what outcome you want for each class of traffic:
| Traffic type | Common policy |
|---|---|
| Major search crawlers | Verify and allow with crawl-rate controls |
| AI training crawlers | Block or require explicit commercial approval |
| AI search crawlers | Allow, block, or rate-limit depending on visibility goals |
| Live retrieval crawlers | Often allow with route and rate controls |
| Unknown or spoofed crawlers | Challenge, rate-limit, or block based on risk |
| High-volume scraping patterns | Block or slow at the edge before origin load grows |
Then enforce the policy in layers:
robots.txt.Peakhour's approach is to connect the crawler label to request evidence: route, cadence, fingerprint, proxy, and behavior signals. That gives teams a practical way to allow useful crawlers, slow noisy ones, and block traffic that ignores the site's rules.
Important crawler and AI-agent user-agent names include:
GPTBotChatGPT-UserOAI-SearchBotanthropic-aiClaudeBotClaude-WebPerplexityBotGoogle-ExtendedGoogle-CloudVertexApplebot-ExtendedCCBotBytespiderMeta-ExternalAgentAmazonbotMistralAI-UserSee AI crawler user agents for a maintained reference, how to detect AI crawlers for evidence sources, and how to block AI crawlers for enforcement options.
Learn about account takeover threats, protection strategies, and detection methods to secure your digital accounts and prevent unauthorised access.
An overview of Account Takeover Attacks
A practical reference for common AI crawler user agents, operators, purposes, and recommended Peakhour bot-management actions.
AI For Cybersecurity explains the concept in the context of AI security, with practical checks and mitigation considerations for site operators.
AI Image Generation explains the concept in the context of AI security, with practical checks and mitigation considerations for site operators.
AI Misuse explains the concept in the context of AI security, with practical checks and mitigation considerations for site operators.
© PEAKHOUR.IO PTY LTD 2025 ABN 76 619 930 826 All rights reserved.