When Bots Break Bad

Tue 16 May 2023

Co-Founder

5 min read

Bots account for a large share of web traffic. Recent studies put automated traffic at nearly 50% of all internet requests. Some bots are useful, such as search engine crawlers that index your site. Some are clearly harmful, such as scrapers and sneaker bots. Others sit in a grey area, including backlink and marketing bots from services such as Ahrefs and SEMrush. Even useful bots can create problems when they crawl too hard. This article looks at the main bot types and how to manage them with robots.txt and bot management tools.

Understanding the Different Types of Bots

'Good Bots'

Good bots perform legitimate work. Search engine crawlers like Googlebot and Bingbot index webpages so search results can stay current and relevant. Other examples include uptime and performance monitoring bots.

'Bad Bots'

Bad bots harm websites, users, or both. Common examples include:

Scraping content, copying and repurposing data from websites.
Sneaker bots, automatically purchasing limited-edition products (like sneakers) before human users can.
Spam bots, posting unsolicited messages and advertisements in comment sections or forums.
Vulnerability Scanners, trying thousands of website URLs to find security vulnerabilities.
Account Takeover, attempting to gain access to existing user/admin accounts using either credential stuffing or brute-force attacks.

'Grey Bots'

Grey bots sit between good and bad. They often serve a useful purpose and may follow crawling directives in robots.txt, but they can still cause problems when they crawl too aggressively. Common examples include:

AhrefsBot: A backlink analysis bot used by Ahrefs, an SEO tool.
SEMrushBot: A bot used by SEMrush, another popular SEO and digital marketing tool.
MJ12bot: A bot used by Majestic, a service that provides backlink data and analysis.
ScreamingFrog: An SEO analyser run from a local desktop.

When Grey bots (and even Good Bots) go bad.

Left unattended, grey bots can create practical problems:

Slow page loading times, which affect user experience.
Strain on server resources, potentially causing crashes, downtime, and higher costs.
Distorted website analytics, when bot traffic is mistaken for human traffic.

Managing Grey Bots with Robots.txt

The robots.txt file is a simple text file that tells web crawlers which parts of your site they can or cannot access. You can use it to manage bot behaviour and protect your website from aggressive crawling. Useful controls include:

Disallowing specific bots: You can block specific bots from accessing your site by adding a "User-agent" and "Disallow" directive to your robots.txt file. For example:

User-agent: AhrefsBot
Disallow: /

Limiting crawl rate: You can ask bots to slow down their crawling by adding a "Crawl-delay" directive:

User-agent: SEMrushBot
Crawl-delay: 10

Not all bots will follow robots.txt. ScreamingFrog, for example, can be instructed to ignore robots.txt and crawl a site as quickly as possible. You would not want a competitor doing this to your site.

Bot Management Tools

In addition to robots.txt, bot management tools (like those provided by Peakhour) can protect your website from abusive bots. Good bot management tools automatically block most unwanted traffic using a combination of Threat Intelligence, Fingerprinting techniques, Reverse DNS verification, and Header Inspection.

Advanced techniques like rate limiting and machine learning can help identify more sophisticated bad bots.

Search Bots and Double Crawling

Search bots like Bingbot can sometimes blindly follow links and crawl the same page multiple times due to different URL parameters. This double, triple, or worse crawling can increase server load and make indexing less efficient. eCommerce sites are especially exposed because product catalogues often have several filtering paths. We've seen Bing go haywire on a number of sites. Most recently, it was issuing around 50,000 requests per day to the search function of a Magento 2 store while cycling through parameters. This dropped to 2-3k requests per day when fixed. On another store, Bing was responsible for nearly half of all page requests (40k page requests) on a busy OpenCart store. Configuring it to ignore parameters dropped this to around 4k per day.

Configuring Search Bots to Ignore Query Parameters

Note: Since publishing both Google and Bing have removed the ability to ignore parameters when crawling via their webmaster/search console tools. See using robots.txt to instruct search engines to ignore query string parameters

To help search bots crawl your site efficiently, you can configure them to ignore specific query parameters. Use these methods:

Configuring Bing Webmaster Tools

Bing Webmaster Tools provides an option to specify URL parameters that should be ignored during the crawling process. To configure this setting, follow these steps:

Sign in to your Bing Webmaster Tools account and select the website you want to manage.
Navigate to the "Configure My Site" section and click on "URL Parameters."
Click on "Add Parameter" and enter the parameter name you want Bingbot to ignore.
Select "Ignore this parameter" from the dropdown menu and click on "Save."
Configuring Bing Webmaster Tools this way helps stop Bingbot double crawling pages with specific URL parameters, reducing server load and improving indexing efficiency.

Managing Other Search Bots

For other search engines like Google, use the relevant webmaster tools to manage URL parameters. In Google Search Console, follow these steps:

Sign in to your Google Search Console account and select the property you want to manage.
Navigate to the "Crawl" section and click on "URL Parameters."
Click on "Add Parameter" and enter the parameter name you want Googlebot to ignore.
Choose "No URLs" from the "Does this parameter change page content seen by the user?" dropdown menu.
Click on "Save."
Specifying the parameters you want search bots to ignore can prevent double crawling and make indexing more efficient.

Final Thoughts

When good or grey bots crawl too aggressively, they can cause the same operational problems as malicious bots: overloaded servers, slower pages, and worse user experience. Monitor website traffic and server load, set clear robots.txt rules, and use the major search engines' webmaster tools to control inefficient crawling. Done properly, this improves website performance and can lower infrastructure costs.

#Bot Management #SEO #Residential Proxies #DNS #Web Performance #Anomaly Detection