The three words above might sound like SEO gobbledegook, but they’re words worth knowing, since understanding how to use them means you can order Googlebot around. Which is fun.
So let’s start with the basics: there are three ways to control which parts of your site search engines will crawl:
- Noindex: tells search engines not to include your page(s) in search results.
- Disallow: tells them not to crawl your page(s).
- Nofollow: tells them not to follow the links on your page.
Noindex in pages
A ‘noindex’ tag tells search engines not to include the page in search results.
The most common method of noindexing a page is to add a tag in the head section of the HTML, or in the response headers. To allow search engines to see this information, the page must not already be blocked (disallowed) in a robots.txt file. If the page is blocked via your robots.txt file, Google will never see the noindex tag and the page might still appear in search results.
To tell search engines not to index your page, simply add the following to the </head> section:
<meta name=”robots” content=”noindex, follow”>
The second part of the content tag here indicates that all the links on this page should be followed, which we’ll discuss below.
Alternatively, the nofollow tag can be used in an X-Robots-Tag in the HTTP header:
For more information see Google Developers’ post on Robots meta tag and X-Robots-Tag HTTP header specifications.
Noindex in robots.txt
A ‘noindex’ tag in your robots.txt file also tells search engines not to include the page in search results, but is a quicker and easier way to noindex lots of pages at once, especially if you have access to your robots.txt file. For example, you could noindex any URLs in a specific folder.
Here’s an example of a noindex directive that could be placed in the robots.txt file:
However, Google advise against using this method: John Mueller has stated that ‘you shouldn’t rely on it’.
Disallowing a page means you’re telling search engines not to crawl it, which must be done in the robots.txt file of your site. It’s useful if you have lots of pages or files that are of no use to readers or search traffic, as it means search engines won’t waste time crawling those pages.
To add a disallow, simply add the following into your robots.txt file:
If the page has external links or canonical tags pointing to it, it could still be indexed and ranked, so it’s important to combine a disallow with a noindex tag, as described below.
A word of caution: by disallowing a page you’re effectively removing it from your site.
Disallowed pages cannot pass PageRank to anywhere else – so any links on those pages are effectively useless from an SEO perspective – and disallowing pages that are supposed to be included can have disastrous results for your traffic, so be extra careful when writing disallow directives.
Combining noindex and disallow
Noindex (page) + Disallow: Disallow can’t be combined with noindex on the page, because the page is blocked and therefore search engines won’t crawl it to know that they’re not supposed to leave the page out of the index.
Noindex (robots.txt) + Disallow: This prevents pages appearing in the index, and also prevents the pages being crawled. However, remember that no PageRank can pass through this page.
To combine a disallow with a noindex in your robots.txt, simply add both directives to your robots.txt file:
A nofollow tag on a link tells search engines not to use a link to decide on the importance of the linked pages (PageRank) or discover more URLs within the same site.
Common uses for nofollows include links in comments and other content that you don’t control, paid links, embeds such as widgets or infographics, links in guest posts, or anything off-topic that you still want to link people to.
Historically SEOs have also selectively nofollowed links, to funnel internal PageRank to more important pages.
Nofollow tags can be added in one of two places:
- The <head> of the page (to nofollow all links on that page): <meta name=”robots” content=”nofollow” />
- The link code (to nofollow an individual link): <a href=”example.html” rel=”nofollow”>example page</a>
A nofollow won’t prevent the linked page from being crawled completely; it just prevents it being crawled through that specific link. Our own tests, and others, have shown that Google will not crawl a URL which it finds in a nofollowed link.
Google state that if another site links to the same page without using a nofollow tag or the page appears in a Sitemap, the page might still appear in search results. Similarly, if it’s a URL that search engines already know about, adding a nofollow link won’t remove it from the index.
To prevent the page from being indexed, you’ll also need to noindex the page. To stop Google crawling the page completely, you should also disallow it (see above).
For more information on when to use a nofollow, see our Kick-Ass Outbound Link Audit.
There are other ways to tell Google and other search engines how to treat URLs:
- Canonical tags tell search engines which page from a group of similar pages should be indexed. Canonicalized (ie. secondary pages that direct search engines toward a primary version) are not included in the index. If you have separate mobile and desktop sites, you are supposed to canonicalize your mobile URLs to your desktop ones.
- Pagination groups multiple pages together so that search engines know they are part of a set. Search engines should prioritize page one of each set when ranking pages, but all pages within the set will stay in the index.
- Hreflang tells search engines which international versions of the same content are for which region, so that they can prioritize the correct version for each audience. All of these versions will stay in the index.
How much time should you spend on reducing crawl budget?
You might hear a lot of talk on SEO forums about how important crawl efficiency and crawl budget is for SEO and, while it’s common practice to disallow and noindex large groups of pages that have no benefit to search engines or readers (for example, back-end code that is only used for the running of the site, or some types of duplicate content), deciding whether to hide lots of individual pages is probably not the best use of time and effort.
Google likes to index as many URLs as possible, so, unless there is a specific reason to hide a page from search engines, it’s usually ok to leave the decision up to Google. In any case, even if you hide pages from search engines, Google will still keep checking to see if those URLs have changed. This is especially pertinent if there are links pointing to that page; even if Google has forgotten about the URL, it might re-discover it the next time a link is found to it anyway.
Testing using Search Console, DeepCrawl and Robotto
Test robots.txt using Search Console
The robots.txt Tester tool in Search Console (under Crawl) is a popular and largely effective way to check a new version of your file for any errors before it goes live, or test a specific URL to see whether it’s blocked:
However, this tool doesn’t work exactly the same way as Google, with some subtle differences in conflicting Allow/Disallow rules which are the same length.
The robots.txt testing tool reports these as Allowed, however Google has said ‘If the outcome is undefined, robots.txt evaluators may choose to either allow or disallow crawling. Because of that, it’s not recommended to rely on either outcome being used across the board.’
For more detail, read this Webmaster Central Help Forum discussion.
Find all non-indexable pages using DeepCrawl
Run a Universal crawl without any restrictions (but with the robots.txt conditions applied) to allow DeepCrawl to return all of your URLs and show you all indexable/non-indexable pages.
If you have URL parameters that have been blocked from Googlebot using Search Console, you can mimick this set-up for your crawl using the Remove Parameters field under Advanced Settings > URL Rewriting.
You can then use the following reports to check that the site is set-up as you’d expect on your first crawl, and then combine them with the built-in change logs on subsequent crawls.
Indexation > Noindex Pages
This report will show you all pages that contain a noindex tag in the meta information, HTTP header or robots.txt file.
Indexation > Disallowed URLs
This report contains all URLs that can’t be crawled because of a disallow rule in the robots.txt file. There are figures for both of these reports in the dashboard of your report:
Use the Matches / Does not match fields at the top of each report to check particular folders and spot patterns in URLs that you might otherwise miss:
Test a new robots.txt file using DeepCrawl
Use DeepCrawl’s Robots.txt Overwrite function in Advanced Settings to replace the live file with a custom one.
You can then use your test version instead of the live version next time you start a crawl.
The Added and Removed Disallowed URLs reports will then show exactly which URLs were affected by the changed robots.txt, making evaluation very simple.
For a complete guide to managing robots.txt changes with DeepCrawl, read our previous guide.
Monitor robots.txt file changes using Robotto
Keep an eye on any changes made to your robots.txt file or HTTP responses using Robotto, which will automatically run checks on your site and record when changes are made, as well as letting you know when it finds any potential issues.