Web Science and Digital Libraries Research Group: IA

Friday, October 28, 2016

2016-10-27: UrduTech - The GeoCities of Urdu Blogosphere

On December 12, 2008, an Urdu blogger Muhammad Waris reported an issue in Urdu Mehfil about his lost blog that was hosted on UrduTech.net. Not just Waris, but many other Urdu bloggers of that time were anxious about their lost blogs due to a sudden outage of the blogging service UrduTech. The downtime lasted for several weeks which has changed the shape of the Urdu blogosphere.

Before diving into the UrduTech story, let's have a brief look into the Urdu language and the role of the Urdu Mehfil forum in promoting Urdu on the Web. Urdu is a language spoken by more than 100 million people worldwide (about 1.5% of the global population), primarily in India and Pakistan. It has a rich literature, while being one of the premier languages of poetry in South Asia for centuries. However, the digital footprint of Urdu has been relatively smaller than some other languages like Arabic or Hindi. In the early days of the Web, computers were not easily available to the masses of the Urdu speaking community. Urdu input support was often not built-in or would require additional software installation and configuration. The right-to-left (RTL) direction of the text flow in Urdu script was another issue of writing and reading it on devices that were optimized for left-to-right languages. There were not many fonts that support Urdu character set completely and properly. The most commonly used Nastaleeq typeface was initially only available in a propriety page-making software called InPage which did not support Unicode and locked-in the content of books and news papers. Early online Urdu news sites used to export the content as images and publish on the web.

Urdu community used to write Urdu text in Roman script on the Web initially, but the efforts of promoting Unicode Urdu were happening on small scales; one such early effort was Urdu Computing Yahoo Group by Eijaz Ubaid. In the year 2005, some people from the Urdu community including Nabeel, Zack, and many others took an initiative to build a platform to promote Unicode Urdu on the Web and created UrduWeb and a discussion board under that with the name Urdu Mehfil. This has quickly become the hub for Urdu related discussions, development, and idea exchange. The community created tools to ease the process of reading and writing Urdu in computers and on the Web. They created many beautiful Urdu fonts and keyboard layouts, translated various software and CMS systems and customized themes to make them RTL friendly, created dictionaries and encyclopedia, developed plugins for various software to enable Urdu in them, developed Urdu variants of Linux OS, provided technical help and support, digitized printed books, created Urdu blog aggregator (Saiyarah) to promote blogging and increase the visibility of new bloggers, and gave a platform to share literary work. These are just a few of many contributions of UrduWeb. These efforts played a significant role in shaping up the presence of Urdu on the Web.

I, Sawood Alam, am associated with UrduWeb since early 2008 with my continuing interest in getting the language and culture online. For the last seven years I am administering UrduWeb. In this period I have mentored various projects, developed many tools, and took various initiatives. I recently collaborated with Fateh, another UrduWeb member, to published a paper entitled, "Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages" (PDF), in an effort to enable easy and fast lookup in many classical and culturally significant Urdu dictionaries that are available in scanned form in the Internet Archive.

To give a sense of the increased activity and presence of Urdu on the Web we can take a couple examples. In the year 2007 when UrduTech was introduced as a blogging platform, Urdu Wikipedia was in the third group of languages on Wikipedia based on the number of articles, with only 1,000+ articles. Fast forward eight years, now in 2016 it has jumped to the second group of languages with 100,000+ articles and actively growing.

In May, 2015 Google Translate Community hosted a translation challenge, in which Urdu languages surfaced in the top ten most contributing languages that was highlighted by Google Translate as, "Notably Bengali and Urdu are in the lead along with some larger languages."

Now, back to the Urdu blogging story, in the year 2007, WordPress CMS was the most popular blogging software for those who can afford to host their site and make it work. For those who were not technically sound or did not want to pay for hosting, WordPress and Blogger were among the most popular hosted free blogging platforms. However, when it comes to Urdu, both platforms had some limitations. WordPress allowed flexible options of plugins, translations, and theming etc., but only if one runs the CMS on their server, hosted free service in contrast, had limited number of themes of which none were RTL friendly and it did not allow custom plugins either. This means, changing CSS to better suit the rendering of the mixed bidirectional content was not allowed that would render the lines containing bidirectional text (which is not uncommon in Urdu) in an unnatural and unreadable order. Lack of custom plugin support would also mean that providing JavaScript based Urdu input support in the reply form was not an option as a result articles would receive more comments in Roman script than in Urdu. On the other hand, blogger allowed theme customization, but the comment form was rendered inside an iframe that had no way to inject external JavaScript in it to allow Urdu input support. As a result, those Urdu bloggers who chose one of these hosted free blogging services had some compromises.

The technical friction of getting things to work for Urdu was a big reason for the slow adoption of Urdu blogging. To make it easier, Imran Hameed, a member of UrduWeb, introduced UrduTech blogging service. People from UrduWeb including Mohib, Ammar, Mawra, and some others encouraged many people to start Urdu blogging. UrduTech used WordPress MU to allow multi-user blogging on a single installation. It was hosted on a shared hosting service. Creating a new blog was as simple as filling an online form with three fields and hit the "Next" button. From there, one can choose from a handful of beautiful RTL-friendly themes and enable pre-installed add-ons to allow Urdu input support, both in the dashboard for post writing and on the public facing site for comments. Removing all the frictions WordPress and Blogger had, UrduTech gave a big boost to the Urdu community and many people started creating their blogs.

It turned out that creating a new blog on UrduTech was easy not just for legitimate people, but for spammers as well. This is evident from the earliest capture of UrduTech.net in the Internet Archive. Unfortunately, the styleseets, images, and other resources were not well archived, so please bear with the ugly looking (damaged Memento) screenshots.

Later captures in the web archive show that as the Urdu bloggers community grew on UrduTech, so did the attack from spam bots. This has increased the burden of the moderation to actively and regularly clean the spam registrations.

The service ran for a little over a year with occasional minor down times. Urdu blogosphere has started evolving slowly and the diversity of the content increased. During this period, some people have slowly started migrating to other blogging platforms such as their personal free or paid hosting, other Urdu blogging offerings, or hosted free services of WordPress and Blogger. This is evident from the blogroll of various bloggers in their archived copies.

Increasing activity on UrduTech from both human and bots lead to the point where the shared hosting provider decided to shut the service down without any warning. People were anxious of the sudden loss of their content and demanding for the backup. Who makes backups? (Hint: Web archives!) Imran, the founder of the service, was busy in his other priorities that took him more than a month to bring the service back online. In the interim, people either decided to never do blogging again or swiftly moved on to other more robust options to start over from scratch (so did Waris) with the lesson learned the hard way to make backup of their content regularly.

"Did Waris really lost all his hard work and hundreds of valuable articles he wrote about Urdu and Persian literature and poetry?" I asked myself. The answer was perhaps to be found somewhere in 20,000 hard drives of the Internet Archive. However, I didn't know his lost blog's URL, but the Internet Archive was there to help. I first looked through a few captures of the UrduTech in the archive, from there I was able to find his blog link. I was happy to discover that his blog's home page a was archived a few times, however the permalinks of individual blog posts were not. Also, the pages of the blog home with older posts were not archived either. This means, from the last capture, only the 25 latest posts can be retrieved (without comments). When other earlier captures of the home page are combined, a few more posts can be archived, but perhaps not all of them. Although the stylesheet and various template resources are missing, the images in the post are archived, which is great.

What happened to the UrduTech service? When it came back online after a long outage, many people have already lost their interest and trust in the service. In less than three months, the service went down again, but this time it was the ultimate death of the service until the domain name registration expired.

Due to its popularity and search engine ranking, the domain was a good target for drop catching. Mementos (captures) during November 27, 2011 and December 18, 2014 show a blank page when viewed using WayBack Machine. A closer inspection of the page source reveals what is happening there. Using JavaScript, the page is loaded in the top frame (if not already) and the page has frames to load more content. Unfortunately, resources in the frame are not archived, so it is difficult to say how the page might have looked in that duration. However, there is some plain text for "noframe" fallback that reveals that the domain drop catchers were trying to exploit the "tech" keyword present in the UrduTech name, though they have nothing to do with Urdu.

Sometime before March 25, 2015, the domain name was presumably went through another drop catch. Alternatively, it is possible that the same domain name owner has decided to host a different type of content on that domain. Whatever is the case, since then the domain is serving a health-related "legitimate-looking fake" site, it is still live, and adding new content every now and then. However, the content of the site has nothing to do with neither "Urdu" nor "tech".

UrduTech simplified a challenging task at that time, made it accessible to people with the little technical skills, proliferated the community, killed the service, but the community has moved on (though the hard way) and transformed into a more mature and stable blogosphere. It has played the same role for Urdu blogging what the GeoCities did for personal home page hosting, only on a smaller scale for a specific community. Over the time the Web technology matured, support for Urdu in computer and smart phones became better, awareness of the tools and technologies grew in the community in general, and various new communication media such as social media sites helped spread the word and connect people together. Now, the Urdu blogosphere has grown significantly and people in the community organize regular meetups and Urdu blogger conferences. Manzarnamah, another initiative from UrduWeb members, introduces new bloggers in the community, publishes interviews of regular bloggers, and distributes annual awards to bloggers. Bilal, another member of the UrduWeb, is independently creating tools and guides to help new bloggers and the Urdu community in general. UrduTech was certainly not the only driving force for Urdu blogging, but it did play a significant role.

On the occasion of 20th birthday celebration of the Internet Archive (#IA20), on behalf of WS-DL Research Group and the Urdu community I extend my gratitude for preserving the Web for 20 long years. Happy Birthday Internet Archive, keep preserving the Web for many many more years to come. I could only wish that the preservation was more complete and less damaged, but having something is better than nothing and as DSHR puts it, "You get what you get and you don't get upset". Without these archived copies I would not be able to augment my own memories and tell the story of the evolution of a community that is very dear to me and to many others. I can only imagine how many more such stories are buried in the spinning discs of the Internet Archive.

--
Sawood Alam

Friday, August 21, 2015

2015-08-20: ODU, L3S, Stanford, and Internet Archive Web Archiving Meeting

Two weeks ago (on Aug 3, 2015), I was glad to be invited to visit Internet Archive in San Francisco in order to share our latest work with a set of the Web Archiving pioneers from around the world.

The attendees were Jefferson Bailey and Vinay Goel from IA, Nicholas Taylor and Ahmed AlSum from Stanford, and Wolfgang Nejdl, Ivana Marenzi and Helge Holzmann from L3S.

First, we took a quick introduction to each others mentioning the purpose and the nature of our work to IA.

Then, Nejdl introduced the Alexandria project, and demoed the ArchiveWeb project, which aims to develop tools and techniques to explore and analyze Web archives in a meaningful way. In the project, they develop tools that will allow users to visualize and collaboratively interact with Archive-it collections by adding new resources in the form of tags and comments. Furthermore, it contains a collaborative search and sharing platform.

I presented the off-topic detection work with a live demo for the tool, which can be downloaded and tested from https://github.com/yasmina85/OffTopic-Detection.

Detecting Off-Topic Pages in Web Archives from Yasmina Anwar

The off-topic tool aims to automatically detect when the archived page goes off-topic, which means the page changed through time to move away from the initial scope of the page. The tool suggests a list of off-topic pages based on a specific threshold that is input by the user. Based on evaluating the tool, we suggest values for the threshold in a research paper* that can be used to detect the off-topic pages.

A site for one of the candidates for Egypt’s 2012 presidential election. Many of the captures of hamdeensabhay.com are not about the Egyptian Revolution. Later versions show an expired domain (as does the live Web version).

Examples for the usage of the tool:
--------

Example 1: Detecting off-topic pages in 1826 collection

python detect_off_topic.py -i 1826 -th 0.15
extracting seed list
…
http://agroecol.umd.edu/Research/index.cfm
http://casademaryland.org
…
50 URIs are extracted from collection https://archive-it.org/collections/1826
Downloading timemap using uri http://wayback.archive-it.org/1826/timemap/link/http://agroecol.umd.edu/Research/index.cfm
Downloading timemap using uri http://wayback.archive-it.org/1826/timemap/link/http://casademaryland.org
…
Downloading 4 mementos out of 306
Downloading 14 mementos out of 306
…
Detecting off-topic mementos using Cosine Similarity method

Similarity memento_uri
0.0 http://wayback.archive-it.org/1826/20131220205908/http://www.mncppc.org/commission_home.html/
0.0 http://wayback.archive-it.org/1826/20141118195815/http://www.mncppc.org/commission_home.html

Example 2: Detecting off-topic pages for http://hamdeensabahy.com/

python detect_off_topic.py -t https://wayback.archive-it.org/2358/timemap/link/http://hamdeensabahy.com/ -m wcount -th -0.85

Downloading 0 mementos out of 270
http://wayback.archive-it.org/2358/20140524131241/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130621131337/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20140602131307/http://hamdeensabahy.com/
http://wayback.archive-it.org/2358/20140528131258/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130617131324/http://www.hamdeensabahy.com/

…
Downloading 4 mementos out of 270
…
Extracting text from the html
…
Detecting off-topic mementos using Word Count method

Similarity memento_uri
-0.979434447301 http://wayback.archive-it.org/2358/20121213102904/http://hamdeensabahy.com/

-0.966580976864 http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/

-0.94087403599 http://wayback.archive-it.org/2358/20130526131402/http://www.hamdeensabahy.com/

-0.94087403599 http://wayback.archive-it.org/2358/20130527143614/http://www.hamdeensabahy.com/

Nicholas insisted on the importance of the off-topic tool from QA perspective, while Internet Archives folks focused on the required computation resources and how it can be shared with Archive-It partners. The group discussed some user interface options to display the output of the tool.

After the demo, we discussed the importance of the tool, especially in the crawling quality assurance practices. While demoing ArchiveWeb interface, some of the visualization for pages from different collections showed off-topic pages. We all agreed that it is important that those pages won’t appear to the users when they browse the collections.

It was amazing to spend time in IA and knowing about the last trend from other research groups. The discussion showed the high reputation of WS-DL research in the web archiving community around the world.

*Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson, Detecting Off-Topic Pages in Web Archives, Proceedings of TPDL 2015, 2015.

----
Yasmin

Web Science and Digital Libraries Research Group

Friday, October 28, 2016

2016-10-27: UrduTech - The GeoCities of Urdu Blogosphere

Friday, August 21, 2015

2015-08-20: ODU, L3S, Stanford, and Internet Archive Web Archiving Meeting