Following on from our introduction to the crawling process used to discover new pages, it is important to understand the main rules and conditions around crawling that search engines incorporate as part of their algorithms. After reading this you will have an understanding of crawl budget, demand and rate.

What is crawl budget?

Crawl budget is a function of crawl rate and crawl demand. The Google Webmaster Central Blog defines crawl budget as the following:

“Taking crawl rate and crawl demand together we define crawl budget as the number of URLs Googlebot can and wants to crawl.”

Crawl budget is basically the number of URLs on a website which a search engine will crawl in a given time period.

Why is crawl budget limited?

Crawl budget is constrained in order to ensure that a website’s server is not overloaded with too many concurrent connections or too much demand for server resources which could adversely impact the experience of the site’s visitors.

Every IP (web host) has a maximum number of connections that it can handle. Many websites can be hosted on a shared server, so if a website shares a server or IP with several other websites it may have a lower crawl budget than a website hosted on a dedicated server.

Equally, a website which is hosted on a cluster of dedicated servers which responds quickly will typically have a higher crawl budget than a website which is hosted on a single server and begins to respond more slowly when there is a high amount of traffic.

It is worth bearing in mind that just because a website responds quickly and has the resources to sustain a high crawl rate, it doesn’t mean that search engines will want to dedicate a high amount of their own resources if the content is not considered important enough.

What is crawl rate and crawl rate limit?

Crawl rate is defined as the number of URLs per second that search engines will attempt to crawl a site. This is normally proportional to the number of active HTTP connections that they choose to open simultaneously.

Crawl rate limit can be defined as the maximum fetching that can be achieved without degrading the experience of visitors to a site.

There are a couple of factors that can cause fluctuations in crawl rate. These include:

  • Crawl health – Faster responding websites may see increases in crawl rate, whereas slower websites may see reductions in crawl rate.
  • Limiting the rate at which Google crawls your website in Google Search Console by going into Settings and navigating to the Crawl Rate section.
  • What is crawl demand?

    In addition to crawl health and crawl rate limits specified by the webmaster, crawl rate will vary from page to page based on the demand for a specific page.

    The demand from users for previously indexed pages impacts how often a search engine crawls those pages. Pages that are more popular will likely be crawled more often than pages that are rarely visited, or those that are not updated or hold little value. New or important pages are normally prioritised over old pages which do not change often.

    Managing crawl budget

    Issues with larger sites

    Managing crawl budget is particularly important for larger sites with many URLs and a high turnover of content.

    Larger sites may encounter issues with getting new pages which have never been crawled and indexed to appear in a search engine’s results pages. It may also be the case that pages that have already been indexed take longer to be re-crawled, meaning that changes take longer to be detected and then updated in the index.

    Issues with low value URLs

    Another important part of managing crawl budget is about dealing with low value URLs which can consume a large amount of crawl budget. This can be problematic because it could mean that crawl budget is being wasted on low value URLs while higher value URLs are crawled less often than you would ideally like.

    Examples of low value URLs which may consume crawl budget are:

  • URLs with tracking parameters and session identifiers
  • On-site duplicate content
  • Soft error pages, such as discontinued products
  • Multi-facet categorisation
  • Site search results pages
  • When/why/how can I influence crawl budget?

    Most search engines will provide you with statistics about the number of pages crawled per day within their webmaster interfaces (such as Google Search Console or Bing Webmaster Tools).

    Alternatively you can analyse server log files, which record every time a page is requested by a search engine and provide the most accurate data on which URLs are crawled and how frequently.

    Do all websites need to consider crawl budget?

    Managing crawl budget isn’t something that needs to be worried about on the majority of websites because sites with fewer than a few thousand URLs and new pages can be crawled in one day. This means that crawl budget isn’t something that demands attention for smaller sites.

    Influencing crawl budget

    Managing crawl activity is more of a consideration for larger sites and those that auto-generate content based on URL parameters.

    So what can large sites do to influence the crawl activity by search engine bots to ensure their high value pages are crawled regularly?

    Ensuring high priority pages are accessible to crawlers

    Large sites should ensure the .htaccess and robots.txt files don’t prevent crawlers from accessing high priority pages on the website. Additionally, web crawlers should also be able to crawl CSS and JavaScript files.

    Disallowing pages not to be indexed

    Regardless of the size of a site, there are always going to be pages that you will want to disallow from search engine indexes. A few examples include:

  • Duplicate or near duplicate pages – Pages that present predominantly duplicate content should be disallowed.
  • Dynamically generated URLs – Such as onsite search results which should also be disallowed.
  • Thin or law value content – Pages with little content or little valuable content are also good candidates for being excluded from indexes.
  • Robots.txt

    The robots.txt file is used to provide instructions to web crawlers using the Robots Exclusion Protocol. Disallowing directories and pages that should not be crawled in the robots.txt file is a good method for freeing up valuable crawl budget on large sites.

    Noindex robots meta tag & X-Robots-Tag

    Robots.txt disallow instructions do not guarantee that a page will not be crawled and shown in the search results. Search engines use other information, such as internal links, which may guide web crawlers toward a page which should ideally be omitted.

    To prevent most search engine crawlers from indexing a page, the following meta tag should be placed in the section of the page.

    <meta name=”robots” content=”noindex”>

    An alternative to the noindex robots meta tag is to return an X-Robots-Tag: a noindex header in response to a page request.

    HTTP/1.1 200 OK
    Date: Tue, 25 May 2010 21:42:43 GMT
    (…)
    X-Robots-Tag: noindex
    (…)

    Managing parameter/URL sprawl

    A common cause of crawl budget wastage is poor management of parameters and URLs; known as URL sprawl. The best strategy to avoid URL sprawl on a website is to design it so that URLs are only created for unique and useful pages.

    If there is already an issue with URL sprawl on a website, there are several steps that should be taken to address this:

  • Stop using useless parameters – These are parameters that do not make meaningful changes to the content on a page and could include session IDs, tracking parameters and sorting parameters.
  • Uniform casing Ensure that all URLs share the same casing i.e. all lower case or camel case.
  • Trailing slashes – Check that all URLs follow the same trailing slash rules i.e. every URL has a trailing slash or it doesn’t.
  • All URLs which don’t follow the above rules should be redirected to their canonical version. You should also ensure all links are updated to point to the canonical versions. Additionally you should use rel=”nofollow” for URLs that don’t follow these rules i.e. links to pages with sorting parameters.

    Nofollow links

    Using rel=”nofollow” tells search engines not to pass link equity via that link to the linked URL. There is good evidence to suggest that Googlebot will honour the nofollow attribute and not follow the link to crawl and discover content. This means that nofollow can be used by webmasters to moderate crawl activity within a website.

    It should also be noted that external links which do not use the rel=”nofollow” attribute will provide a pathway for search engine bots to crawl the linked resource.

    Fix broken links

    If there are broken links (external and internal) on a site, this will expend crawl budget unnecessarily. The number of broken links should be monitored regularly on a site and kept to an absolute minimum.

    Avoid unnecessary redirects

    Unnecessary redirects can often occur after a page’s URL has been changed, with a 301 redirect being implemented from the old URL to the new one. However, other onsite links may be neglected and not updated to reflect new URLs, resulting in unnecessary redirects.

    Unnecessary redirects can delay crawling and indexation of the target URL, as well as impacting user experience by increasing load time.

    Next: Robots.txt