« Site Explorer Update: Authenticating Yahoo! Stores | Main | Better Shopping In Search »

November 02, 2006

Yahoo! Search Crawler (Yahoo! Slurp) - Supporting wildcards in robots.txt

I was going through my notes from Danny Sullivan's Open Feedback sessions that occur during the ?Meet the Crawlers? panel at Search Engine Strategies. One of the items on my list was a request for enhanced syntax in robots.txt to make it easier for webmasters to manage how search crawlers, including Slurp, access your content.

For those who may not be as familiar with search index terminology, webmasters use the robots.txt file to direct robots that visit their site, including search engine crawlers, which files should be crawled and which shouldn't be. You can read about our support for robots directives in the help for Yahoo! Slurp.

Well, we can scratch that one off the list, since we have just updated Yahoo! Slurp to recognize two additional symbols in the robots.txt directives ? '*' and '$'. The semantics of these is what is as widely understood for robots.txt files.

'*' - matches a sequence of characters

You can now use '*' in robots directives for Yahoo! Slurp to wildcard match a sequence of characters in your URL. You can use this symbol in any part of the URL string you provide in the robots directive. For example,

User-Agent: Yahoo! Slurp
Allow: /public*/
Disallow: /*_print*.html
Disallow: /*?sessionid

The robots directives above will:


  • allow all directories that begin with 'public', such as '/public_html/' or '/public_graphs/' to be crawled
  • disallow any files or directories which contain '_print', such as '/card_print.html' or '/store_print/product.html' to be crawled
  • disallow any files with '?sessionid' in their URL string, such as '/cart.php?sessionid=342bca31? to be crawled

Note that a trailing '*' is redundant since that is existing matching behavior for Slurp. So, the following two directives are equivalent:

User-Agent: Yahoo! Slurp
Disallow: /private*
Disallow: /private

'$' ? anchors at the end of the URL string

You can now also use '$' in robots directives for Slurp to anchor the match to the end of the URL string. Without this symbol, Yahoo! Slurp would match all URLs against the directives, treating the directives as a prefix. For example:

User-Agent: Yahoo! Slurp
Disallow: /*.gif$
Allow: /*?$

The robots directives above will


  • Disallow all files ending in '.gif' in your entire site. Note that without the '$', this would disallow all files containing '.gif' in their file path
  • Allow all files ending in '?' to be included. This would not automatically allow files that just contain '?' somewhere in the URL string

As you can see, this symbol only makes sense at the end of the string. Hence, when we see it, we assume that your directive terminates there and any characters after that symbol are ignored.

Oh, by the way, if you thought we didn't support the 'Allow' tag, as you can see from these examples, we do.

If you have any questions about the new syntax or any particular cases you are concerned about, please write in at the Site Explorer forums or read up our area.

Next time you see me at SES, you should ask me what else is on my list!

Priyank Garg
Product Manager, Yahoo! Search

Comments

Hi,

I am surprised to see:
User-Agent: Yahoo! Slurp

It used to be:
User-Agent: Slurp

Are both versions of the user-agent understood by all Yahoo! bots ?

Jean-Luc

I think that'll be very helpful for webmasters if they can use wildcards, but other search engines should also follow if webmasters are really going to use wildcards.

Yes, 'Slurp' also works as the user-agent. Essentially we look for the presence of 'Slurp' in the user-agent string. Please refer

http://help.yahoo.com/help/us/ysearch/slurp/slurp-02.html

I agree with Mike and Jean-Luc above. It would be much more helpful if ALL the search engines are going to follow the directives.

Or, you need to at least add the capability to test the robots.txt file for Yahoo directives in your Site Explorer (like Google does for their interpretation of the robots.txt file in their "Webmaster Tools").

There is already confusion about what works and what doesn't (as Jean-Luc points out), so how else is a webmaster to be certain that changes will work?

This will simplify our work of writing the robots.txt file.

Hi
Yes Brian, it will really be helpful if we can get the robots.txt approved from some Yahoo! authority.
What I get from the update is:

User-Agent: Yahoo! Slurp
Disallow: /*.gif$

using this will restrict this url
/public/images/xyz.gif

and allow
/public/images/xyz.gif?sessid=1234asd..

Please let me know if this is what it intends?
And what if I would like to disallow all files in a folder except one?

User-Agent: Yahoo! Slurp
Disallow: /images
Allow: /images/xyz.gif
Just out of curiosity :)

Google also supports the wildcard characters *, ?, and $.

You wrote: "Oh, by the way, if you thought we didn't support the 'Allow' tag, as you can see from these examples, we do."

We need to know the priority rules that apply when "Allow:" and "Disallow:" directives cover the same URL's (as far as I know, it is not standardized).

Example :
User-agent: Slurp
Allow: /*.html$
Disallow: /backup/

Will /backup/yes_or_not.html be spidered ?

Regarding these priority rules, there is an opinion here ( http://www.conman.org/people/spc/robots2.html#format.directives.allow ) and another conflicting opinion here ( http://books.google.com/webmasters/bot.html#robots ). What about Yahoo's implementation ?

Jean-Luc

Hello

So if we use /*? will these exclude all urls with query strings?

Pete

What about a link from another site that contains tracking code... even though the robots.txt file tells the spider not to index the page - it is not coming from the site initially and if it is the arrival page are the rules of the robots.txt file still applied and disallow what is already recorded?