Crawling

Before a page can be indexed, and therefore appear within search results, it must be crawled by search engine crawlers, like Googlebot. There are many things to consider in order to get pages crawled and ensure they are adhering to the correct guidelines. These are covered within our Hangout Notes, as well as further research and recommendations.

There are several possible reasons a page may be crawled but not indexed

March 17, 2022 Source

John explains that pages appearing as ‘crawled, not indexed’ in GSC should be relatively infrequent. The most common scenarios are when a page is crawled and then Google sees an error code, or the page is crawled and then a noindex tag is found. Alternatively, Google might choose not to index content after it’s crawled if it finds a duplicate of the page elsewhere. Content quality may also play a role, but Google is more likely to avoid crawling pages altogether if they believe there is a clear quality issue on the site.


How does Google handle infinite scrolling? Well, it depends…

March 17, 2022 Source

One user asked whether Googlebot is advanced enough yet to handle infinite scrolling. John explains that pages are rendered by Google using a fairly high viewport. Usually, this means that some amount of infinite scrolling is triggered. However, it all depends on the implementation. The best way to check is to run the page through the Inspection tool to get a clear view of what Google is actually picking up.


Robots.txt file size doesn’t impact SEO, but smaller files are recommended

January 27, 2022 Source

John confirmed that the size of a website’s robots.txt file has no direct impact on SEO. He does, however, point out that larger files can be more difficult to maintain, which may in turn make it harder to spot errors when they arise.

Keeping your robots.txt file to a manageable size is therefore recommended where possible. John also stated that there’s no SEO benefit to linking to sitemaps from robots.txt. As long as Google can find them, it’s perfectly fine to just submit your sitemaps to GSC (although we should caveat that linking to sitemaps from robots.txt is a good way to ensure that other search engines and crawlers can find them).


Regularly changing image URLs can impact Image Search

January 10, 2022 Source

A question was asked about whether query strings for cache validation at the end of image URLs would impact SEO. John replied that it wouldn’t affect SEO but explained that it’s not ideal to regularly change image URLs as images are recrawled and reprocessed less frequently than normal HTML pages.

Regularly changing the image URLs means that it would take Google longer to re-find them and put them in the image index. He specifically mentioned avoiding changing image URLs very frequently, such as adding a session ID or today’s date. In these instances it’s likely they would change more often than Google would reprocess the image URL and would not be indexed. Regular image URL changes should be avoided where possible, if Image Search is important for your website.


Crawl rate is not affected by a large number of 304 responses

January 10, 2022 Source

A question was asked about whether a large number of 304 responses could affect crawling. John replied that if a 304 is encountered, it means that Googlebot could reuse that request and crawl something else on the website and that it would not affect the crawl budget. If most pages on a website return a 304, it wouldn’t mean that the crawl rate would be reduced, just that the focus would be on the pages of the website where they see updates happening.


Blocking Googlebot in robots.txt does not affect Adsbot

January 10, 2022 Source

A participant found that Googlebot was crawling their ad landing pages more than their normal pages. They asked if they could block Googlebot via the robots.txt and if doing so would impact their ad pages. John responded that blocking the ads landing pages for Googlebot is fine but make sure not to block Adsbot as it’s used to perform quality checks on the ads landing pages. He clarified that Adsbot doesn’t follow the normal robots.txt directives and in order to be blocked would require the specific user-agents to be named explicitly in the robots.txt file. Therefore, by just blocking Googlebot as suggested, Adsbot would still have access to those landing pages.


Having a high ratio of ‘noindex’ vs indexable URLs could affect website crawlability

November 17, 2021 Source

Having noindex URLs normally does not affect how Google crawls the rest of your website—unless you have a large number of noindexed pages that need to be crawled in order to reach a small number of indexable pages.

John gave the example of if a website that has millions of pages with 90% of them noindexed, as Google needs to crawl a page first in order to see the noindex, Google could get bogged down with crawling millions of pages just to find those 100 indexable ones. If you have a normal ratio of indexable / no-indexable URLs and the indexable ones can be discovered quickly, he doesn’t see that as an issue to crawlability. This is not due to quality reasons, but more of a technical issue due to the high number of URLs that will need to be crawled to see what is there.


It can take years for crawling on migrated domains to be stopped completely

November 17, 2021 Source

John confirmed that it takes a very long time (even years) for the Google systems to completely stop crawling a domain, even after they are redirected.


Speed up re-crawling of previously noindexed pages by temporarily linking to them on important pages

November 17, 2021 Source

Temporarily internally linking to previously noindexed URLs on important pages (such as the homepage) can speed up recrawling of those URLs if crawling has slowed down due to the earlier presence of a noindex tag. The example given was of previously noindexed product pages and John’s suggestion was to link to them for a couple of weeks via a special product section on the homepage. Google will see the internal linking changes and then go and crawl those linked-to URLs. It helps to show they are important pages relative to the website. However, he also stated that if significant changes are made to internal linking, it can cause other parts of your site which are barely indexed to drop out of the index—this is why he suggests using these links as a temporary measure to get them recrawled at the regular rate, before changing it back.


Related Topics

Indexing Crawl Budget Crawl Errors Crawl Rate Disallow Directives in Robots.txt Sitemaps Last Modified Nofollow Noindex RSS Canonicalization Fetch and Render