Crawling

Before a page can be indexed, and therefore appear within search results, it must be crawled by search engine crawlers, like Googlebot. There are many things to consider in order to get pages crawled and ensure they are adhering to the correct guidelines. These are covered within our Hangout Notes, as well as further research and recommendations.

Algorithm Changes May Result in Changes to Crawl Rate

February 21, 2020 Source

The number of pages which Google wants to crawl may change during algorithm changes, which may be due to some pages being considered less important to show in search results, or from crawling optimization improvements.


Specify Timezone Formats Consistently Across Site & Sitemaps

February 18, 2020 Source

Google is able to understand different timezone formats, for example, UTC vs GMT. However, it’s important to use one timezone format consistently across a site and its sitemaps to avoid confusing Google.


If a Robots.txt File Returns a Server Error for a Brief Period of Time Google Will Not Crawl Anything From the Site

January 31, 2020 Source

If a robots.txt file returns a server error for a brief period of time Google will not crawl anything from the website until they are able to access it and crawl normally again. During the period of time where they are blocked from reaching the file they would assume all URLs are blocked and would therefore flag this in GSC. You can use the robots.txt request in your server logs to identify where this has occurred by reviewing the response size and code that was returned during each request.


It is Normal for Google to Occassionally Crawl Old URLs

January 31, 2020 Source

Due to their rendering processes, Google will occasionally re-crawl old URLs in order to check their set up. You may see this within your log files, but it is normal and will not cause any problems.


Having a Reasonable Amount of HTML Comments Has No Effect on SEO

January 24, 2020 Source

Comments within the HTML of a page do not have any effect on SEO unless there is a large amount, as they can make it difficult for Google to figure out where the content is and may impact the size and speed of the page. However, John confirmed he has never come across a page where HTML comments have been a problem.


Upper Limit For Recrawling Pages is Six Months

January 22, 2020 Source

Google tends to recrawl pages at least once every six months as an upper limit.


Google is Able to Display Structured Data Results as Soon as the Page Has Been Re-crawled

January 10, 2020 Source

After configuring pages to send structured data to Google, it will be able to display the structured data results the next time it crawls and indexes that page.


Google Still Respects the Meta Robots Unavailable After Directive

January 10, 2020 Source

Google still respects the meta robots unavailable_after directive, this is used to specify a date when a page will no longer be available. John explained that it is likely their systems will recrawl the page around the date specified in order to make sure they are not just removing pages from the index that are still available.


Google Doesn’t Crawl Any URLs From Hostname When Robots.txt Temporarily 503s

December 13, 2019 Source

If Google encounters a 503 when crawling a robots.txt file, it will temporarily not crawl any URLs on that hostname.


Related Topics

Indexing Crawl Budget Crawl Errors Crawl Rate Disallow Sitemaps Last Modified Nofollow Noindex RSS Canonicalization Fetch and Render