Поиск Почта Карты Маркет Новости Словари Блоги Видео Картинки
компания → интернет-математика

Task and Datasets

Task description

The task of the ‘Internet Mathematics 2009’ contest is to obtain a document ranking formula using machine learning methods. Real data – feature vectors of query-document pairs and relevance judgments made by Yandex assessors – are used for learning and testing.

Data set

Within ‘Internet Mathematics 2009’ we distribute real relevance tables that are used for learning ranking formula at Yandex. The tables contain computed and normalized features of query-document pairs as well as relevance judgments made by Yandex assessors. The tables do not contain original queries or URLs of original documents, semantics of the features is not revealed (features are just numbered). Examples of the features presented in the table are TF*IDF, PageRank, query length in words.

Data set is divided into two files – learning set (imat2009_learning.txt) and test set (imat2009_test.txt). File with the learning set contains 97 290 lines that correspond to 9 124 queries. Test set (115 643 lines) is divided into two parts – the first one for the preliminary public evaluation (the first 21 103 lines), the second one for final evaluation (the rest). The breakdown of the data set looks as follows: 45% - learning, 10% - public testing, and 45% - final testing. Each line in the data files corresponds to a query-document pair. All features are either binary – possess the value from {0, 1}, or continuous. Values of continuous features are mapped to the range [0, 1]. Each query-document pair is described by 245 features. Data are represented in SVMlight format. If feature value is equal to zero it is omitted. Query ID is indicated as comment at the end of the line. Learning set contains relevance judgments with values from range [0, 4] (4 – ‘highly relevant’, 0 – ‘irrelevant’).

More formally file format of the learning set looks as follows:

<line> .=. <relevance> <feature>:<value> <feature>:<value> ... <feature>:<value> # <queryid>
<relevance> .=. <float>
<feature> .=. <integer>
<value> .=. <float>
<queryid> .=. <integer>


Participants’ relevance estimates are sorted in descending order within each query, in case of equal values document with worse assessor judgment is ranked higher. The main quality metrics is Discounted Cumulative Gain (DCG) averaged over all queries. We use the following formula for DCG:


Contest task completion is a file with exactly 115 643 lines, each line contains a number (obtained relevance estimate) and corresponds to the line in the test set file. The first 21 103 lines are used for preliminary public evaluation, the rest is used for final evaluation. The current rating of solutions is compiled based on preliminary public evaluation results. Each team can upload results many times before deadline, but no more than once in 10 minutes. After deadline final evaluation is conducted over the second part of the test set. Based on final evaluation results the winners are announced.

Download data

Data are available only for personal use and exclusively for participation in the ‘Internet Mathematics 2009’ contest.

Download archive .tar.bz2 (37 MB)

Download archive .zip (58 MB)