Connecting Infrastructure, Connecting Research

Accelerating the Processing of Large Corpora: Using Grid Computing Technologies for Lemmatizing 176 Million Words Arabic Internet Corpus

Name: Majdi Sawalha
Institution: University of Leeds
Research: Accelerating the Processing of Large Corpora: Using Grid Computing Technologies for Lemmatizing 176 Million Words Arabic Internet Corpus

The Arabic Internet Corpus is one of several large collections of text stored in an electronic database (corpora) collected for Translation Studies research. The Arabic Internet Corpus consists of about 176 million words which were initially raw text with no further processing. Majdi wanted to add the lemma (i.e. the dictionary form of the word (headword)) and root (i.e. three or four letters origin of the word) for each word in the Arabic Internet Corpus.

Arabic is different from English and other European languages. Hundreds of Arabic words can be derived from the same root; and a lemma can appear in the text in many different forms due to the attachment of prefixes and suffixes. In other words, morphemes that are attached to a word to form a new word, like English –ness, pre-, plural -s and past tense -ed at the beginning and at the end of the word. Therefore adding the lemma and root extraction is necessary for search applications to enable inflected forms of a word to be grouped together.

Majdi used the SALMA – Tagger (Sawalha Atwell Leeds Morphological Analyses – Tagger) to add additional information to the Arabic Internet Corpus words at two levels; the lemma and the root. The SALMA Tagger is relatively slow - in initial tests it processed 7 words per second. The lack of speed is because it has to deal with the way in which Arabic words are written, spell check the word’s letters, short vowels and diacritics (marks added above or below letters to provide information about correct pronunciation) and the large dictionaries provided to the analyzer.

An estimate execution time for lemmatizing the full Arabic Internet Corpus was 300 days using an ordinary uni-processor machine. To reduce the processing time of the whole task, Majdi used the NGS to lemmatize the Arabic internet corpus gaining a massive reduction in execution time. He did this by dividing the Arabic Web Corpus into half-million-word files and then wrote a program that generates scripts to run the lemmatizer for each file in parallel. The output files were combined in one lemmatized Arabic Internet Corpus comprising of 176 million word-tokens, 2,412,983 word-types, 322,464 lemma-types and 87,068 root-types. By using the NGS he massively reduced the execution time of processing the 176M-word corpus to only 5 days. This could have been reduced further if they had been able to allocate enough CPUs to process all files strictly in parallel; NGS provides virtual parallel processing on a reduced set of CPUs.

The output files were combined into one lemmatized Arabic Web Corpus and 10 random samples, of 100 words each, were selected to evaluate the accuracy of the lemmatizer. For each sample, Majdi computed the accuracy of the root and lemma analysis and found that the average root and lemma accuracy was consistent across samples. The average root accuracy was about 81.20% and the average lemma accuracy was 80.80%.

Majdi said “Roughly, an estimated execution time for lemmatizing the full Arabic Internet Corpus was 300 days using ordinary uni-processor machine. By using the computational power of the NGS a massive reduction in execution time was gained – instead it only took 5 days." It wasn’t just Majdi that benefited from using NGS resources. He explained “It made the processed Arabic Internet Corpus available to other translation studies and Arabic and middle eastern study researchers at the University of Leeds and other world-wide institutions."

Majdi’s supervisor Dr Eric Atwell “I hope we convinced, at least some, that Arabic is interesting and challenging! I must think about how to make more use of HPC resources in future Arabic-computing research proposals...”

Image - A Wordle visualisation of the Quran

PI - Dr. Eric Atwell

Funding body - University of Jordan

Download a summary slide of this case study.