Thursday, February 09, 2006

New robots.txt tool

The Sitemaps team just introduced a new robots.txt tool into Sitemaps. The robots.txt file is one of the easiest things for a webmaster to make a mistake on. Brett Tabke’s Search Engine World has a great robots.txt tutorial and even a robots.txt validator.

Despite good info on the web, even experts can have a hard time knowing with 100% confidence what a certain robots.txt will do. When Danny Sullivan recently asked a question about prefixing matching, I had to go ask the crawl team to be completely sure. Part of the problem is that mucking around with robots.txt files is pretty rare; once you get it right once, you usually never have to think about the file again. Another issue is that if you get the file wrong, it can have a large impact on your site, so most people don’t mess with their robots.txt file very often. Finally, each search engine has slightly different extra options that they support. For example, Google permits wildcards (*) and the “Allow:” directive.

The nice thing about the robots.txt checker from the Sitemaps team is that it lets you take a robots.txt file out for a test drive and see how the real Googlebot would handle a file. Want to play with wildcards to allow all files except for ‘*.gif’? Go for it. Want to experiment with upper vs. lower case? Answer: upper vs. lower case doesn’t matter. Want to check whether hyphens matter for Google? Go wild. Answer: we’ll accept “UserAgent” or “User-Agent”, but we’ll remind you that the hyphenated version is the correct version.

The best part is that you can test a robots.txt file without risking anything by doing it on your live site. For example, Google permits the “Allow:” directive, and it also permits more specific directives to override more general directives. Imagine that you wanted to disallow every bot except for Googlebot. You could test out this file:

User-Agent: *Disallow: /
User-Agent: GooglebotAllow: /
Then you can throw in a URL like http://www.seoservicesgroup.com/ and a user agent like Googlebot and get back a red or green color-coded response:

Googlebot

Allowed by line5: Allow: /
Detected as a directory: Specific files may have different restrictions

I like that you can test out different robots.txt files without running any risk, and I like that you can see how Google’s real bot would respond as you tweak and tune it.

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home