How to defend against Account Takeovers
Learn about account takeover threats, protection strategies, and detection methods to secure your digital accounts and prevent unauthorised access.
With the explosion of Large Language Model providers, there has been a corresponding explosion of bots scraping the web to provide training material. These bots scrape websites in a similar manner to Search Engine Crawlers like Googlebot and Bingbot. Indeed, both Google and Bing have bots for the purpose of scraping the web to provide training material for their AI models.
Here at Peakhour we classify these as Grey bots, potentially even as Malicious. They provide next to no value for the vast majority of websites and are typically very aggressive at crawling your website. This aggressive crawling can severely impact website performance for your legitimate users and inflate your cloud bill.
GPTBot (OpenAI): GPTBot is a web scraper used by OpenAI to gather data for training its GPT models. It respects robots.txt directives.
ClaudeBot (Anthropic): Used by Anthropic for training its language models. ClaudeBot respects robots.txt directives.
CCBot (Common Crawl Bot): This bot collects data for the Common Crawl dataset, which is widely used for training ' language models. It respects robots.txt rules and aims to minimize disruption to websites.
MSBot (Microsoft): Used by Microsoft for various AI and language model training purposes. It adheres to robots.txt directives and is designed to gather useful data while respecting website owners’ preferences.
ByteSpider (ByteDance/TikTok): ByteDance uses this bot for training its language models. ByteSpider does not appear to respect robots.txt and is extremely aggressive.
PerplexityBot: Used by Perplexity AI for gathering data to improve their models. This bot respects robots.txt files when generally crawling but not when generating responses to user-generated queries to its LLM. It has also been reported to be pretending to be other browsers.
AmazonBot: While not technically an LLM training bot, Amazon says it uses AmazonBot to train Alexa to provide better responses.
ImagesiftBot: Owned by Hive, this bot scrapes the internet for publicly available images. While primarily used for reverse image search, the data can also be used to train image generation models. ImagesiftBot respects the robots.txt file.
You can try disallowing each of these bots in your robots.txt file. If you have a firewall service that allows you to customise it, then you can make a rule to deny each bot by its user agent value.
Learn about account takeover threats, protection strategies, and detection methods to secure your digital accounts and prevent unauthorised access.
An overview of Account Takeover Attacks
A step-by-step breakdown of how credential stuffing attacks are carried out, from obtaining stolen credentials to bypassing defenses and taking over accounts.
An introduction to Anycast DNS
A quick description about what an Apex Domain is.
Learn the essential best practices for managing and rotating API keys to enhance security, prevent unauthorized access, and minimize the impact of key compromise.
© PEAKHOUR.IO PTY LTD 2025 ABN 76 619 930 826 All rights reserved.