In celebration of more than three years of delivering the best search experience on the Internet, Google is sponsoring the first annual Google Programming Contest.
Google is providing a selection of about 900,000 web pages in pre-parsed and raw format, together with a "ripper" program that provides a framework for processing the pre-parsed data. Your mission is to write a program (most likely by adding code to the ripper) that does something interesting with the data, in such a way that it would scale to a web-sized collection of documents. Part of your job is to convince us of why your program is interesting and why it will scale; other than that, you're free to implement whatever strikes your fancy.
We suggest you fit your entry in one of two different tracks: Systems or Applications.
Entries in the Systems track generally pertain to infrastructure for handling the data, where typical goals are systems-related (i.e., speed/space properties). Some examples of possible projects include:
- Achieving better compression for the repository (starting from either the pre-parsed or raw formats). You might make a case for why your compression scheme saves the most space, or saves space while still allowing quick access to the data.
- Designing and implementing an efficient index structure to quickly find all documents that contain a given word or phrase.
- Constructing a link graph for the data and providing fast access to it.
Entries in the Applications track generally deal with the semantics of the data. Some examples include:
- Detecting common templates in pages, and separating out the common structure from the individual content.
- Classifying links on a page.
- Detecting pages that are near-duplicates of one another.
- Clustering pages by topic or type.
The supplied repository is several orders of magnitude smaller than the ultimate target repository for the code, because of the limitations of the distribution media and the likely resource constraints of many entrants. Keep this in mind when designing your implementation. You should assume that your code will ultimately run on a collection of networked machines with a reasonable amount of memory (~2-4 gigabytes each), where the data is divided among them. You will probably need to combine partial results from each machine to form a single final result.
The limited size of the repository being distributed and the selection of
documents may preclude certain interesting kinds of document processing. This
repository includes a selection of HTML Web pages from 100 different sites in
the "edu" domain.
How to Enter
Read the Contest Rules located at the bottom of this page. By participating in this contest, you agree to be bound by these Contest Rules.
When you get to the point of needing the full data set, if you
are unable to download it you may request that we mail you the data on a set
of five CDs. Note that CDs will not be mailed out until late February, so
we strongly encourage you to start your development by downloading the code
and sample data provided. E-mail your request for CDs, including a postal
address, to email@example.com.
We provide source code in C++. You may also choose to write your code in Java or Python, in which case you are responsible for implementing any necessary interface code. Your submission must include a Makefile and README, and must compile on Linux 2.2 or 2.4 using g++ (for C++ code) or standard Sun tools (for Java code) or Python version 2.2. If your code depends on third-party packages, you must include a complete list of all packages, including exact version information and download URLs. Sorry, we cannot accept entries that require commercial software or other software that is not provided as open source or under GPL.
You may submit multiple entries. Keep copies for your records. Google assumes no responsibility for lost, misdirected, illegible or late entries or for failed computer transmissions or technical failures.
If you want to discuss ideas and problems related to the programming contest with other participants, visit the Google Groups programming contest newsgroup: google.public.programming-contest.
Winners will be selected by a panel of Google staff scientists. The judges will grade entries using the following criteria:
The judges shall have the sole authority and discretion to select the award recipient(s).
To participate in the Google Programming Contest (the "Contest"), you must be at least 18 years old. The Contest is open to individuals or teams of up to 3 people, but not to corporate entries. Employees and contractors of Google, Inc. ("Google") and members of their immediate families are not eligible to enter. Void where prohibited.
With regard to the software and repository that you obtain for the Contest, you agree to the license terms as stated in files you download or receive. With regard to an entry you submit as part of the Contest, you grant Google a worldwide, perpetual, fully paid-up, non-exclusive license to make, sell, or use the technology related thereto, including but not limited to the software, algorithms, techniques, concepts, etc., associated with the entry.
If you are selected as a contest winner, you agree that Google may publicize your name, likeness, and the description of work you did to win the contest. Apart from the prizes associated with being selected as a winner, Google shall not be obligated to compensate you in any way for such publicity.
One $10,000 cash prize will be awarded to the winning entry. If the winning entry is submitted by more than one individual, the $10,000 cash prize will be divided equally among the participants who submit the winning entry. In addition, Google shall provide each member of the winning team a round trip ticket for a commercial carrier flight to the San Francisco Bay Area, and will reimburse each member of the winning team for up to 3 nights stay at a hotel to be designated by Google, Inc.
Each entrant shall indemnify, defend, and hold Google harmless from any third party claims arising from or related to that entrant's participation in the Contest. In no event shall Google be liable to an entrant for acts or omissions arising out of or related to the Contest or that entrant's participation in the Contest.
Odds of winning depend on the number and quality of entries received. All taxes, including income taxes, are the sole responsibility of winners. No prize substitution is permitted. Winner(s) may be required to verify their entry.
The winning entry will be announced on the Google.com site by Google Inc. on May 31, 2002. Following the announcement, individual winners will be notified by e-mail. Winners have 14 days from notification to claim the prize. Prize may be claimed by return e-mail. Unclaimed prizes will not be awarded.
Contact Google Inc. at firstname.lastname@example.org.