There are several possible reasons a page may be crawled but not indexed
John explains that pages appearing as ‘crawled, not indexed’ in GSC should be relatively infrequent. The most common scenarios are when a page is crawled and then Google sees an error code, or the page is crawled and then a noindex tag is found. Alternatively, Google might choose not to index content after it’s crawled if it finds a duplicate of the page elsewhere. Content quality may also play a role, but Google is more likely to avoid crawling pages altogether if they believe there is a clear quality issue on the site.
If URLs that are blocked by robots.txt are getting indexed by Google, it may point to insufficient content on the site’s accessible pages
Why might an eCommerce site’s faceted or filtered URLs that are blocked by robots.txt (and have a canonical in place) still get indexed by Google? Would adding a noindex tag help? John replied that the noindex tag would not help in this situation, as the robots.txt block means it would not be seen by Google.
He pointed out that URLs might get indexed without content in this situation (as Google cannot crawl them with the block in robots.txt), but they would be unlikely to show up for users in the SERPs, so should not cause issues. He went on to mention that, if you do see these blocked URLs being returned for practical queries, then it can be a sign that the rest of your website is hard for Google to understand. It could mean that the visible content on your website is not sufficient for Google to understand that the normal (and accessible) pages are relevant for those queries. So he would first recommend looking into whether or not searchers are actually finding those URLs that are blocked by robots.txt. If not, then it should be fine. Otherwise, you may need to look at other parts of the website to understand why Google might be struggling to understand it.
503s can help prevent pages dropping from the index due to technical issues
One user described seeing a loss of pages from the index after a technical issue caused their website to be down for around 14 hours. John suggests that the best way to safeguard your site against outages like this is to set up a 503 rule ready for when things go wrong. That way, Google will see that the issue is temporary and will come back later to check whether it’s been resolved. Returning a 404 or another error page as the HTTP status code means that Google could interpret the outage as pages being removed permanently, which is why some pages drop so quickly out of the index if a site is down temporarily.
Regularly changing image URLs can impact Image Search
A question was asked about whether query strings for cache validation at the end of image URLs would impact SEO. John replied that it wouldn’t affect SEO but explained that it’s not ideal to regularly change image URLs as images are recrawled and reprocessed less frequently than normal HTML pages.
Regularly changing the image URLs means that it would take Google longer to re-find them and put them in the image index. He specifically mentioned avoiding changing image URLs very frequently, such as adding a session ID or today’s date. In these instances it’s likely they would change more often than Google would reprocess the image URL and would not be indexed. Regular image URL changes should be avoided where possible, if Image Search is important for your website.
There’s generally no SEO benefit to repurposing an old or expired domain
When asked about using old, parked domains for new sites, John clarifies that users will still need to put the work in to get the site re-established. If the domain has been out of action for some time and comes back into focus with different content, there generally won’t be any SEO benefit to gain. In the same vein, it typically doesn’t make sense to buy expired domains if you’re only doing so in the hopes of a visibility boost. The amount of work needed to establish the site would be similar to using an entirely new domain.
Best practices for canonicals on paginated pages can depend on your wider internal linking structure
John tackled one of the most common questions asked of SEOs; how should we be handling canonical attributes on paginated pages? Ultimately, it depends on the site architecture. If internal linking is strong enough across the wider site, it’s feasible to canonicalize all paginated URLs to page 1 without content dropping from the index. However, if you rely on Google crawling pages 2, 3… and so on to find all of the content you want to be crawled, make sure that paginated URLs self-canonicalize.
Google can only index what Googlebot sees
In response to a question about whether there are cloaking issues around showing Google different content vs. what a user would see on a more personalized page, John clarified that only what Googlebot sees is indexed. Googlebot usually crawls from the US and crawls without cookies, so whatever content is there would be what is indexed for the website. So, on personalized pages, make sure that you’re only changing things for users that are not critical to how you want to be seen in search.
Speed up re-crawling of previously noindexed pages by temporarily linking to them on important pages
Temporarily internally linking to previously noindexed URLs on important pages (such as the homepage) can speed up recrawling of those URLs if crawling has slowed down due to the earlier presence of a noindex tag. The example given was of previously noindexed product pages and John’s suggestion was to link to them for a couple of weeks via a special product section on the homepage. Google will see the internal linking changes and then go and crawl those linked-to URLs. It helps to show they are important pages relative to the website. However, he also stated that if significant changes are made to internal linking, it can cause other parts of your site which are barely indexed to drop out of the index—this is why he suggests using these links as a temporary measure to get them recrawled at the regular rate, before changing it back.
If a page is noindexed for a long period of time, crawling will slow down
Having a page set to noindex for a long time will cause Google’s crawling for it to slow down. Once a page is indexable again, crawling will pick up again, but it can take time for that initial recrawling to happen. He also mentioned that Search Console reports can show a worse situation than it actually is but you can use things like sitemaps and internal linking to speed up recrawling of them.