If URLs that are blocked by robots.txt are getting indexed by Google, it may point to insufficient content on the site’s accessible pages
Why might an eCommerce site’s faceted or filtered URLs that are blocked by robots.txt (and have a canonical in place) still get indexed by Google? Would adding a noindex tag help? John replied that the noindex tag would not help in this situation, as the robots.txt block means it would not be seen by Google.
He pointed out that URLs might get indexed without content in this situation (as Google cannot crawl them with the block in robots.txt), but they would be unlikely to show up for users in the SERPs, so should not cause issues. He went on to mention that, if you do see these blocked URLs being returned for practical queries, then it can be a sign that the rest of your website is hard for Google to understand. It could mean that the visible content on your website is not sufficient for Google to understand that the normal (and accessible) pages are relevant for those queries. So he would first recommend looking into whether or not searchers are actually finding those URLs that are blocked by robots.txt. If not, then it should be fine. Otherwise, you may need to look at other parts of the website to understand why Google might be struggling to understand it.
Google Only Needs to Crawl Facet Pages That Include Otherwise Unlinked Products
For ecommerce sites, if Google can access and crawl all of your products through the main category page then it won’t need to crawl any of the facets. However, facets should be made crawlable if they contain products that aren’t linked to from anywhere else on the site.
Ensure All Product Pages Can be Crawled With Considered Use of Noindex
eCommerce sites with facets should be careful which pages are noindexed because this may make it difficult for Googlebot to crawl individual product pages e.g. noindexing all category pages. Webmasters might consider noindexing specific facets or deciding that everything after a certain number of pages in a paginated set be noindexed.
Googlebot Can Recognise Faceted Navigation & Slow Down Crawling
Googlebot understands URL strucures well and can recognise faceted navigation and will slow down when it realises where the primary content is and where it has strayed from that. This is aided by GSC parameter handling.
Canonicalization For Filter Results Pages Isn’t Recommended
Canonicalization shouldn’t be used for filter pages. This is because canonical tags can be ignored and filter pages aren’t always the same as they have different types of results.
Canonicalise Faceted pages to Non-filtered Version
Google recommends allowing crawling of faceted pages but canonicalise to non-filtered version of that page instead of blocking them with robots.txt.
Indexable Product Variations Should Reflect Search Behaviour
Variations of pages which people are searching for should be made indexable, otherwise the variations should be folded together.
Prevent Excessive Crawling on Filters, Sort Orders and Pagination with Nofollow
Add nofollow to filtered, sorted and paginated results pages to prevent excessive crawling.
Use Noindex or Canonical on Faceted URLs Instead of Disallow
John recommends against using robots.txt disallow to prevent facet URLs from being crawled as they may still be indexed, and allow them to be crawled and use a noindex or canonical tag, unless they are causing a server performance issue.