The Author
Sam Marsden

Sam Marsden is DeepCrawl's SEO & Content Manager. Sam speaks regularly at marketing conferences, like BrightonSEO, and is a contributor to industry publications such as Search Engine Journal and State of Digital.

Read more from Sam Marsden

Common Robots.txt Mistakes and How to Avoid Them

Robots.txt is a critical tool in an SEO’s arsenal, which is used to establish rules that instruct crawlers and robots about which sections of a site should and shouldn’t be crawled. However, when it comes to editing a robots.txt file, we need to remember that with great power comes great responsibility. This is because even a small mistake could potentially deindex an entire site from search engines.

Given how important it is that a site’s robots.txt file is set up correctly, I quizzed our Professional Services team to uncover some common mistakes you’ll want to avoid so you can ensure search engines and other bots can crawl the pages that you want them to.

1. Not repeating general user-agent directives in specific user-agents blocks

Search engine bots will adhere to the closest matching user-agent block in a robots.txt file and other user-agent blocks will be ignored.

In this example, Googlebot would only follow the single rule specifically stated for Googlebot, and ignore the rest.

User-agent: *
Disallow: /something1
Disallow: /something2
Disallow: /something3

User-agent: Googlebot
Disallow: /something-else

Given this, it is important to repeat general user-agent directives that apply to more specific bots when adding rules for them too.

2. Forgetting that the longest matching rule wins

When using allow rules, they will only apply if the number of characters in the matching rule is longer.

For example:

Disallow: /somewords
Allow: /someword

In the above example, example.com/somewords will be disallowed, as there are more matching characters in the disallow rule.

However, you can trick this specification by using extra wildcard (*) characters to make the allow rule longer in this example.

Disallow: /somewords
Allow: /someword*

3. Adding wildcards to the end of rules

The * wildcard character doesn’t need to be added to the end of rules in robots.txt, unless you’re using them so they are the longest matching rule, as they are broad matching at the end of the rule by default. While this doesn’t usually cause any problems, it may cause you to lose the respect of colleagues and family members.

Disallow: /somewords*

4. Not using separate rules for each subdomain and protocol

Robots.txt files should avoid including rules spanning different subdomains and protocols. Each subdomain and protocol on a domain requires its own separate robots.txt file. For example, separate robots.txt files should exist for https://www.example.com, http://www.example.com as well as subdomain.example.com.

5. Including relative sitemap directive URLs

In a robots.txt file, sitemaps cannot be indicated using a relative path, it must be absolute. For example,

  • /sitemap.xml
  • would not be respected but

  • https://www.example.com/sitemap.xml
  • would.

    Sitemap: /sitemap.xml

    Sitemap: https://www.example.com/sitemap.xml

    6. Ignoring case sensitivity

    Matching rules in robots.txt are case sensitive, which means you will need to implement multiple rules in order to match different cases.

    Disallow: /something
    Disallow: /Something

    7. Adding a non-existent trailing slash

    Make sure not to add a trailing slash to a rule in robots.txt when the URL doesn’t have one as it won’t be matched. For example, disallowing /path/ when the actual URL is /path will mean that www.example.com/path will not be matched and disallowed.

    Disallow: /path/

    Disallow: /path

    8. Not starting a disallow rule with a slash

    If you’re specifying a root path in the robots.txt file, you should start rules with a slash, not a wildcard to avoid the risk of accidentally disallowing a deeper path.

    This rule would only disallow every URL which sits on the root path www.example.com/something.

    Disallow: /something

    This rule would disallow every URL which contains ‘something’, e.g. www.example.com/stuff/something-else.

    Disallow: *something

    9. Forgetting Googlebot user agents can fall back to more generic user agent tokens

    Googlebot user agents will fall back to the more generic user agent token if there are no specific blocks included for that particular one. For example, googlebot-news will fall back to googlebot if there are no specific blocks for googlebot-news.

    Google has published a full list of which user agent tokens apply to which crawlers.

    10. Matching encoded URLs to unencoded rules

    Encoded URLs should match to unencoded rules, however, unencoded URLs will not match to encoded rules. Make sure to keep your rules unencoded, at least according to the robots.txt testing tool.

    Even if you’re an experienced SEO, we hope that the above has unearthed some points about the Robots Exclusion Standard that you didn’t know previously. If you’re interested in learning more about this topic, then you might be interested in reading our introductory guide to robots.txt or we’ve also written about the best-kept secret in SEO: Robots.txt Noindex.

    Be the First to Know About the Latest Insights From Google

    Hangout notes

    Loop Me In!