Web Science and Digital Libraries Research Group: data set

Monday, July 22, 2013

2013-07-15: Temporal Intention Relevancy Model (TIRM) Data Set

In the third anniversary of the Haiti earthquake, president Barack Obama held a press conference and discussed the need to keep helping the Haitian community and to invest more in rebuilding the economy. A user was watching the press conference tweeted about it on the 14th of January, and

provided a link to the streamed news. A couple of days later when I read this tweet and clicked on the link and instead of seeing anything related to the press conference, Haiti, or President Obama, I got a stream feed of the Mercedes-Benz Super Dome in New Orleans in preparation for the 2013 Super Bowl. It is worth mentioning that at the time of writing this blog the tweet above was actually deleted, proving that social posts don't persist throughout time as we discussed in our earlier post.

This scenario illustrates the problem we are trying to detect, model, and solve. The inconsistency between what is intended at the time of sharing and what the reader sees at the time of clicking the link in the tweet.
It is evident that resources change, relocate, or even disappear. In some cases it is tolerable but in other times when it is related to sharing significantly important content (e.g., related to a revolution, protest, corruption claims, and others).

From these observations we decided to perform experiments to detect and model this "user intention" of the author at the time of tweeting and measure how accurately it is perceived by the reader at any point in time. In our JCDL 2013 paper, we deduced that the problem of intention is not straightforward and in order to correctly model it a mapping should be performed to transform the intention problem to a relevancy and change problem.

Amazon's Mechanical Turk is utilized initially in a direct manner to collect data from workers about intention, unfortunately this approach produced very low accuracy in inter-rater agreement.

After a closer look at the most popular tasks on Mechanical Turk, we found out that categorization and classification problems are the most prominent. The questions that are asked to the workers are simpler and require far less explanation.

We introduce the Temporal Intention Relevancy Model or TIRM to illustrate the mapping between intention and relevancy. Let's consider the following tweet from Pfizer. The tweet has a link which leads to the newsletter that is updated with the latest announcements of the company.

Check out Pfizer's latest news here http://t.co/4sWLMtHb
— Pfizer Inc. (@pfizer_news) May 9, 2012

At any point in time this page is still relevant to tweet, thus we can deduce that the intention behind posting this tweet is to check whatever the current state of the page is. In other words, if the page changed from its initial state at the time of tweeting and it is still relevant we can assume the intention is: current state.

Similarly, we notice a different pattern upon inspecting a tweet posted on the day Michael Jackson died and linking to CNN.com. The front page of CNN.com has definitely changed since the time of the tweet and the content is no longer relevant to the tweet.

Michael Jackson had died due to cardiac arrest, just saw it on CNN.com.... Farrah Fawcett died earlier today... http://cnn.com
— Jeff Homan (@mdnitehk) June 25, 2009

Thus, the author's intention was for the reader to see the state of the page at the time he tweeted about it. In conclusion, if the page changed and is no longer relevant to the tweet we can assume that the author's intention is: past state of the resource. So, we dig it up from the web archives.

In a large number of social posts the resource remains unchanged and still relevant to the post. In this case we assume that this is state of the resource at the point in time when the author published this post, but also since it is unchanged a current version will do as well.

Finally! NewScientist magazine confirms: Humans prefer cockiness to expertise http://bit.ly/bwPCX (RT @iA_Cyrill) Damn, I KNEW it.
— Chris Lüscher (@iA_Chris) July 30, 2009

Finally, when the resource is changed and has never been related to the post. Then in this case we do not have enough information to decide which user intention the author wanted to convey. This scenario happens often in spam posts.

Find out who stalks your twitter! http://t.co/0GINxHCg
— Aaron Irizarry (@aaroni268) April 11, 2012

We use Mechanical Turk to collect the training data for our model along with multiple features related to the social post, such as its nature, archivability, social presence, and resource’s content.

The resulting dataset was utilized in extracting 39 different textual and semantic features that was used to train a classifier to implement the TIRM. We argue that this gold standard dataset will pave the way for future temporal intention based studies. Currently, we are extending the experiments and refining the utilized features.

For further details, please refer to the paper:

Hany M. SalahEldeen, Michael L. Nelson. Reading the Correct History? Modeling Temporal Intention in Resource Sharing. Proceedings of the Joint Conference on Digital Libraries JCDL 2013, Indianapolis, Indiana. 2013, also available as a technical report http://arxiv.org/abs/1307.4063

- Hany SalahEldeen

Friday, June 17, 2011

2011-06-17: The "Book of the Dead" Corpus

We are delighted to introduce the "Book of the Dead", a corpus of missing web pages. The corpus contains 233 URIs all of which are dead meaning they result in a 404 "Page not Found" response. The pages were collected during a crawl conducted by the Library of Congress for web pages related to the topics of federal elections and terror between 2004 and 2006.

We created the corpus to test the performance of our methods to rediscover missing web pages introduced in the paper "Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure" published at JCDL 2010. In addition we now thankfully have Synchronicity, a tool that can help overcome the 404 detriment to everyone's browsing experience in real time.

To the best of our knowledge the Book of the Dead is the first corpus of this kind. It is publicly available and we are hopeful that fellow researchers can benefit from it by conducting related work. The corpus can be downloaded at: http://bit.ly/Book-of-the-Dead

And one more thing... not only does the corpus include the missing URIs, it also contains a best guess of what each of the URIs used to be about. We used Amazon's Mechanical Turk and asked workers to guess what the content of the missing pages used to. We only provided the URIs and the general topics elections and terror. The workers were supposed to just analyze the URI and draw their conclusions. Sometime this can be an easy task, for example the URI:

http://www.de.lp.org/election2004/morris.html

is clearly about an election event in 2004. Maybe one could know that "lp" stands for Libertarian Party and "de" for Delaware. Now this URI makes real sense and most likely "Morris" was a candidate running for office during the elections.

All together the Book of the Dead now offers missing URIs and their estimated "aboutness" which makes it a valuable dataset for retrieval and archival research.
--
martin

Web Science and Digital Libraries Research Group

Monday, July 22, 2013

2013-07-15: Temporal Intention Relevancy Model (TIRM) Data Set

Friday, June 17, 2011

2011-06-17: The "Book of the Dead" Corpus