Crawling

Before a page can be indexed, and therefore appear within search results, it must be crawled by search engine crawlers, like Googlebot. There are many things to consider in order to get pages crawled and ensure they are adhering to the correct guidelines. These are covered within our Hangout Notes, as well as further research and recommendations.

Google Doesn’t Crawl Any URLs From Hostname When Robots.txt Temporarily 503s

December 13, 2019 Source

If Google encounters a 503 when crawling a robots.txt file, it will temporarily not crawl any URLs on that hostname.


Google May Crawl More Frequently if it Detects Site Structure Has Changed

December 10, 2019 Source

If you remove a large number of URLs, causing Google to crawl a lot of 404 pages, it may take this as a signal that your site structure has changed. This may lead to Google crawling the site more frequently in order to understand the changes.


Google May Still Crawl Parts of a Site With Desktop Crawler

November 15, 2019 Source

Even with the shift to mobile-first indexing, Google may still crawl parts of a site with the desktop crawler. John explained that this will not impact the site as long as things are working well on mobile.


Use View Source or Inspect Element to Ensure Hidden Content is Readily Accessible in the HTML

November 1, 2019 Source

If you have content hidden behind a tab or accordion, John recommends using the view source or inspect element tool to ensure the content is in the HTML by default. Content pre-loaded on the HTML will be treated as normal content on the page, however, if it requires an interaction to load, Google will not be able to crawl or index it.


404 or 410 Status Codes Will Not Impact a Website’s Rankings

November 1, 2019 Source

If Google identifies 404 or 410 pages on a site, it will continue to crawl these pages in case anything changes, but will begin to phase out the crawling frequency to concentrate more on the pages which return 200 status codes.


Last Modification Dates Important For Recrawling Changed Pages on Large Sites

October 29, 2019 Source

Including last modification dates on large sites can be important for Google because it helps prioritize the crawling of a changed page which might otherwise take much longer to be recrawled.


Google Has a Separate User Agent For Crawling Sitemaps & For GSC Verification

October 1, 2019 Source

Google has a separate user agent that fetches the sitemap file, as well as one to crawl for GSC verification. John recommends making sure you are not blocking these.


Blocking Googlebot’s IP is The Best Way to Prevent Google From Crawling Your Site While Allowing Other Tools to Access It

October 1, 2019 Source

If you want to block Googlebot from crawling a staging site, but want to allow other crawling tools access, John recommends whitelisting the IPs of the users and tools you need to view the site but disallowing Googlebot. This is because Google may crawl pages they find on a site, even if they have a noindex tag, or index pages without crawling them, even if they are blocked in robots.txt.


Ensure Google is Able to Crawl All Pages Involved Within Infinite Scroll

September 27, 2019 Source

When implementing infinite scroll, ensure Google is able to reach all of the pages involved. John recommends the best way to do this is by linking to all the pages individually through a pagination set up, to ensure each page can be crawled.


Related Topics

Indexing Crawl Budget Crawl Errors Crawl Rate Disallow Sitemaps Last Modified Nofollow Noindex RSS Canonicalization Fetch and Render