Web Science and Digital Libraries Research Group: July 2009

Thursday, July 30, 2009

2009-07-30: Position Paper Published in Educause Review

The July/August 2009 issue of Educause Review has a position paper of mine entitled "Data Driven Science: A New Paradigm?" This invited paper is essentially a cleaned-up version of my position paper at the 2007 NSF/JISC Workshop on Data-Driven Science and Scholarship held in Arizona, April 17-19 2007. Prior to the workshop, we were all assigned topics on which we were to write a short position paper. My topic was to address the question of is "data-driven science is becoming a new scientific paradigm – ranking with theory, experimentation, and computational science?"

You can judge my response by the original paper's more cheeky title of "I Don't Know and I Don't Care". My argument can be summed up as "we've always had data-driven science at whatever was the largest feasible scale; it just happens that the scale is now very large." Scale is important, in fact some days I might argue that scale is all there is. But partitioning into paradigms does not seem helpful -- every other dimension of our life is now at web-scale, so why not our science?

Thanks to Ron Larsen for co-hosting (with Bill Arms) the workshop in 2007 and for resurrecting the paper for Educause Review.

--Michael

Thursday, July 16, 2009

2009-07-17: Technical Report "Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure"

This week I uploaded the technical report which is co-authored by Michael L. Nelson to the e-print service arxiv.org. The underlying idea of this research is to utilize the web infrastructure (search engines, their caches, the Internet Archive, etc) to rediscover missing web pages - pages that return the 404 "Page not Found" error. We apply various methods to generate search engine queries based on the content of the web page and user created annotations about the page. We then compare the retrieval performance of all methods and introduce a framework to combine such methods to achieve the optimal retrieval performance.

The applied methods are:

5- and 7-term lexical signatures of the page
the title of the page
tags users annotated the page with on delicious.com
5- and 7-term lexical signatures of the page neighborhood (up to 50 pages linking to the missing page)

We query the big three search engines (Google, Yahoo and MSN Live) with the outcome of all methods and analyze the result sets to investigate the performance.

We have shown in recent work (published at ECDL 2008) that lexical signatures perform very well for rediscovering missing web pages.

As shown on the left we distinguish between four retrieval categories: top (the URLs is returned top ranked), top10 (returned within the top10 but not top), top100 (returned between rank 11 and 100) and undiscovered (not returned in any of the categories above). Displayed here is the retrieval performance of 5- and 7-term lexical signatures. A somewhat binary pattern is visible meaning the vast majority of URLs are either returned within the top 10 or are undiscovered.

However in this study we found that the pages' titles perform equally well. We further found that neither tags about the pages nor lexical signatures based on the page neighborhood performed satisfactorily. We need to mention though that we were not able to obtain tags for all URLs of our data set.

Inspired by the performance of titles we also conducted a small scale analysis of the consistency of our titles with respect to their retrieval performance. We looked at the title length in terms of the number of terms and characters as well as the number of stop words.

Since the title of a web page is much cheaper to obtain compared to the complex computation of a lexical signature we recommend using the title first and the lexical signature second for URLs that were not discovered in the first step. This experiment for one is a follow-up study of our work published at ECDL 2008 and for two forms the basis for a larger-scale study in the future.

martin

2009-07-16: The July issue of D-Lib Magazine has JCDL and InDP reports.

The July/August 2009 issue of D-Lib Magazine has just published reports for the 2009 ACM/IEEE JCDL (written by me) and InDP (written by Frank and his co-organizers), as well as several other reports for JCDL workshops and other conferences (such as Open Repositories 2009). Whereas my previous entry about JCDL & InDP was focused on our group's experiences, these reports give a broader summary of the events.

--Michael

Tuesday, July 7, 2009

2009-07-07: Hypertext 2009

From June 30th through July 1st I attended Hypertext 2009 (HT 2009) in Torino Italy. The conference saw a 70% increase in submissions (117 total) compared to last year but due to the equally increased number of accepted papers (26 long and 11 short) and posters maintain last years acceptance rate of roughly 32%. HT 2009 also had a record of 150 registered attendees.

I presented our paper titled "Comparing the Performance of US College Football Teams in the Web and on the Field" (DOI) which was joint work with Olena Hunsicker under the supervision of Michael L. Nelson. The paper describes an extensive study on the correlation of expert rankings of real world entities and search engine rankings of their representative resources on the web.

Comparing the Performance of US College Football Teams in the Web and on the Field

View more presentations from martinklein0815.

We published a poster, "Correlation of Music Charts and Search Engine Rankings" (DOI), with the results of a similar experiment but of much smaller scale at JCDL 2009.

It was my first time attending HT and from my point of view there were four highlights that I would like to report on (in the order of their occurrences):

1) Mark Bernstein gave a very inspiring talk "On Hypertext Narrative" and also advertised his new book titled "Reading Hypertext". He further is the chief scientist of Eastgate Systems and the designer of Tinderbox.

2) Lada Adamic's keynote "The Social Hyperlink" (slides). She talked about various experiments with social networks e.g., the propagation of knowledge through social networks and how assets (such as dance moves) propagate in Second Life. She argued that it is often hard to differentiate between influence and correlation in social networks.

3) I got to meet and talk to Theodor (Ted) Nelson. Ted coined the term Hypertext and is the father of the Xanadu project. He authored various books including his last work "Geeks Bearing Gifts". The best newcomer paper award at HT is named after him.

4) Ricardo Baeza-Yates' keynote "Relating Content by Web Usage" where he argued that web search is no longer about document retrieval (a sad statement for IR fanatics) but about exploiting the wisdom of the crowds since that provides popularity, diversity, coverage and quality. Search moves towards identifying the user's task and enable its completion. He makes a case for search transitioning from returning web documents to web objects such as people, places and businesses since these objects satisfy the user's intent.

Besides the impressions I got from the conference a few useless facts that I feel like sharing:

Torino seems like a nice place but I did not get a chance to walk around and explore the city.

Italians dine (similar to the French) in several courses so do not make the same rookie mistake I did and fill yourself up on the appetizers assuming its all you get.

Italian cab drivers of course do not understand a single word of the English language unless it comes to how much tip you give them.

There are conference hotels on the face of this planet that do not provide irons for their guests...

Hypertext 2010 will be held in Toronta Canada June 14th - 17th 2010.