Yahoo! Labs

  • Download Dataset

  • You must have a Yahoo! account, register with the competition and complete a data sharing agreement before you can download the data.


The datasets come from web search ranking and are of a subset of what Yahoo! uses to train its ranking function. They consist of features vectors extracted from query-urls pairs along with relevance judgments. The relevance judgments can take 5 different values from 0 (irrelevant) to 4 (perfectly relevant). The queries, urls and features descriptions are not disclosed, only the feature values. There are two datasets for this challenge, each corresponding to a different country: a large one (labeled set1) and a small one (labeled set2). Both datasets are related, but also different to some extent. Each dataset is divided into 3 sets: training, validation, and test.

The statistics for the various sets are as follows:

  Set 1 Set 2
  Train Val Test Train Val Test
# queries 19,944 2,994 6,983 1,266 1,266 3,798
# urls 473,134 71,083 165,660 34,815 34,881 103,174
# features 519 596

There are 700 features in total. Some of them are defined in set1 or set2 only, while some others are defined in both sets. When a feature is undefined for a set, its value is 0. All the features have been normalized to be in the [0,1] range.

Relevance labels on the validation and test sets are hidden and replaced by -1.


The format for each of the 6 files is the same as the one used in SVMLight

<line> .=. <relevance> qid:<qid> <feature>:<value> ... <feature>:<value> 
<relevance> .=. <integer>
<qid> .=. <positive integer>
<feature> .=. <positive integer>
<value> .=. <float>