3 min read

Last year we wrote about the problem of excessive crawling from search engine spiders. Search engines such as Google and Bing aim to index as much content as possible. For ecommerce sites, this often means indexing pages with query string parameters used for sorting, filtering, or pagination. Those parameters help users navigate the site, but they can cause a few predictable crawler problems:

  • Over-Crawling: Search engines may spend too much time crawling similar pages with different parameters, wasting crawl budget.
  • Duplicate Content: Pages with different parameters can be treated as duplicate content, weakening SEO performance.
  • Server Load: Excessive crawling can increase server load, slow down your site, and affect user experience. Search engines typically account for 30-50% of page requests to an ecommerce store. Managing their crawling effectively can have a material effect on site speed and server spend.

Another common cause of over crawling is internal searches being indexed.

In our previous article we mentioned using the webmaster tools provided by Google and Microsoft to manage crawler behaviour by adding ignored parameters. Since then, both tools have been updated and no longer allow you to add parameters to ignore during a crawl.

Differences in Crawling and Indexing

Search engines maintain an 'index' of web pages. Pages in this index are what appear in search results. To maintain the index, the search engine crawls a website to 'discover' new content and keep existing entries up to date. Webmasters can control what gets indexed with tags or headers in their web pages. These include:

  • Canonical Tags can be used to indicate the preferred version of a page. This helps consolidate link 'juice' and tell the search engine which URL to index.
  • Noindex tags can be used to prevent specific pages from being indexed. This is useful for thank you pages, admin pages or any content you don't want to appear in search results.
  • Nofollow links can be used to indicate to a search engine not to pass on SEO value to the linked page.

However, controlling what does or does not get indexed does not prevent content from being crawled. The only way to do that is via the robots.txt file. You may be familiar with the Disallow directive in the robots.txt file, but you can also use wildcards to prevent crawling of url parameters.

An example...

Consider an ecommerce store that has a category page which can then be customised with the following parameters:

    orderBy
    colors
    brands
    page
    results

These may appear in any order, and the combinations can result in 100s or even 1000s of variations of essentially the same page. Google is fairly smart when presented with this scenario, but Bing.... Bing can crawl very aggressively and it likes to try everything. In our example above, we may want to stop crawling everything except the page number, in which case an effective way to control crawler behaviour would be:

    User-agent: *
    Disallow: /*?*orderBy=*
    Disallow: /*?*colors=*
    Disallow: /*?*brands=*
    Disallow: /*?*results=*

We can't really do this in a single Disallow because the parameters might be in any order. By including the ? in the url we're ensuring that the parameter names are only in the query string, not in the main url path. This prevents crawlers from wasting crawl budget and putting unnecessary load on server resources.

Final Thoughts

Search engines can often make up 30-50% of the overall page requests to a website. Managing their behaviour helps maximise useful crawling and minimise server utilisation. Keep an eye on your access logs for unwanted behaviour, and use robots.txt where it gives you the right level of control.