Google’s Hidden Interpretation of Robots.txt

The original Robots.txt syntax was pretty straightforward. You could only use the Disallow directive to exclude pages and each Disallow directive acted like a broad match at the end. This seemed pretty intuitive to most people and for a while the world was a a happy place.

A few people with large and complicated sites discovered some exceptions that couldn’t be covered and so the Robots.txt was extended to allow further control with a some new features.

  • Allow:
  • * (match any number of any characters)
  • $ (ends with)

Unfortunately, these were interpreted differently by each search engine and the supporting documentation is pretty thin on the ground.

Google’s documentation on the Allow command extends to a single example combining all 3 features.

The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).

Wikipedia’s page on Robots.txt suggests that Google process all Allow Commands first and only then moves on to Disallow.

From this limited information we interpreted the following 3 rules…

  1. An Allow directive would always take effect over a Disallow directive.
  2. A specific match using a $ would always beat a wildcard match using *.
  3. A directive with a * at the end would work the same as a match with nothing at the end as all matches are broad by default

If you thought the same as us then prepare to be very surprised. After experimenting with the Robots.txt testing tool in Webmaster Tools, we found something completely different.

Example 1

Disallow: /example.html
Allow:    /example*

In the example number 1 Disallow beats Allow directly contradicting the rule number 1

Example 2

Disallow: */example.html*
Allow:    /example.html$

Example number 2 tears rule number 2 into pieces as despite the strong $ Allow loses against Disallow.

Example 3

Disallow: /example.html*
Allow:    /example.html
Disallow: /example.html
Allow:    /example.html*

And the final examples comes with no surprise rebutting the rule number 3 – depending on the placement of the wildcard, a Disallow beats Allow or an Allow beats a Disallow.

It took us a while to figure out and might take a minute to get your head around, but the answer is rather simple. The number of characters you use in the directive path is critical in the evaluation of an Allow against a Disallow. The rule to rule them all is as follows:

A matching Allow directive beats a matching Disallow only if it contains more or equal number of characters in the path

Just to clarify we’re talking about the number of characters in the matching directive path after the Allow: or Disallow: statement. This includes all the * and $ characters. e.g.

Disallow: /example*      (9 characters)
Allow:    /example.htm$  (13 characters)
Allow:    /*htm$         (6 characters)

In the following example, the URL /example.htm will be disallowed because the Disallow directive contains more characters (7) than the Allow directive (6).

Allow:    /exam*
Disallow: /examp*

If you add a single character to the Allow directive, the number of characters is equal and the Allow wins. An Allow directive with equal or more characters always beats a Disallow.

Allow:    /examp*
Disallow: /examp*

This even applies to exact matches using $. In the example below, the URL /example.htm will be disallowed because the matching Disallow directive contains more characters.

Allow:    /example.htm$
Disallow: */*example*htm

Another interesting side effect is that a broad match using a wildcard at the end becomes more powerful than one without due to the additional character. In the following example, the URL /example.htm will be disallowed because the Disallow directive contains more characters than the Allow directive due to the additional * character.

Allow:    /example
Disallow: /example*

Top Tip #1 – Tab space your Allow and Disallow directives

Which of these would win? The directives are not lined up so it’s hard to see.

Allow: /example.htm
Disallow: /********htm

This is better. You can see they are the same length so the Allow would win.

Allow:    /example.htm
Disallow: /********htm

Top Tip #2 – Retest a list of sample URLs every time you update the robots.txt

Use Robotto for free to monitor your robots.txt text files for changes and remind you to re-test a list of sample URLs in Webmaster Tools.

Or DeepCrawl to crawl your site in full and show you exactly what’s indexable and what’s disallowed, noindexed or canonicalised.

Bonus Info

After playing around with the robots.txt testing tool for a while we found 2 other interesting anomalies. These don’t actually affect the way robots.txt works because they only occur within competing Allow or Disallow statements and when the lengths of the directives are identical. We’ve included them here because they might be useful to help explain something that hasn’t been discovered yet and for completeness. They also suggest that the solution developed by Google might not have been as carefully planned as one would have expected or could give a  clue as to the underlying technology.

Within Allow or Disallow, A * beats a $

In this example, the second Disallow wins because it uses a * whereas the other uses a $. Both have identical numbers of characters in the directive.

Disallow: /example.htm$
Disallow: /example.htm*

Within Allow or Disallow, the highest number of non-wildcard characters wins

In this example, the first Disallow wins because it has a greater number of non-wildcard characters excluding the wildcards. Both have identical numbers of characters in the directive.

Disallow: /*xample.htm
Disallow: /****ple.htm

Conclusion

Although the way Google handles robots.txt files allows very powerful combinations to cover any scenario, it’s not intuitive or even documented sufficiently which is likely to result in a number of sites being incorrectly indexed. What do you think?

Should Google provide better documentation for the existing robots.txt behaviour or change it?

View Results

Loading ... Loading ...
Posted on November 15, 2010 at 8:03 am by chris · Permalink
In: Google, Robots.txt, Site architecture, Webmaster Tools
blog comments powered by Disqus