Web Science and Digital Libraries Research Group: HTML Language

Language detection is not a simple task, and no method results in 100% accuracy. You can find different packages online to detect different languages. I have used some methods and tools to detect the language of either websites or some texts. Here is a review of methods I came across during working on my JCDL 2015 paper, How Well are Arabic Websites Archived?. Here I discuss detecting a webpage's language using the HTTP language header and the HTML language tag. In addition, I reviewed several language detection packages, including Guess-Language, Python-Language Detector, LangID and Google Language Detection API. And since Python is my favorite coding language I searched for tools that were written in Python.

I found that a primary way to detect the language of a webpage is to use the HTTP language header and the HTML language tag. However, only a small percentage of pages include the language tag and sometimes the detected language is affected by the browser setting. Guess-Language and Python-Language Detector are fast in detecting language, but they are more accurate with more text. Also, you have to extract the HTML tags before passing the text to the tools. LangID is a tool that detects language and gives you a confidence score, it's fast and works well with short texts and it is easy to install and use. Google Language Detection API is also a powerful tool that can be downloaded for different programming languages, it also has a confidence score, but you need to sign in and if the dataset you need to detect is large, (larger than 5000 requests a day (1 MB/day)), you must choose a payable plan.

HTTP Language Header:
If you want to detect the language of a web site a primary method is to look at the HTTP response header, Content-Language. The Content-Language lets you know what languages are present on the requested page. The value is defined as a two or three letter language code (such as ‘fr’ for French), and sometimes followed by a country code (such as ‘fr-CA’ for French spoken in Canada).

For example:

curl -I --silent http://bagergade-bogb.dk/ |grep -i "Content-Language"

Content-Language: da-DK,da-DK

In this example the webpage's language is Danish (Denmark).

In some cases you will find some sites offering content in multiple languages, and the Content-Language header only specifies one of the languages.

For example:

curl -I --silent http://www.hotelrenania.it/ |grep -i "Content-Language"

Content-Language: it

In this example, when looking at the browser the webpage has three languages available Italian, English and Dutch. And it only states Italian as its Content-Language.

You have to note that the Content-Language does not always match the language displayed in your bowser, because the browser's displayed language depends on the browser's language preference which you can change.

For example:

curl -I --silent https://www.debian.org/ |grep -i "Content-Language"

Content-Language: en

This webpage offers its content in more than 37 different languages. Here I had my browsers language preference set as Arabic, and the Content-Language found was English.

In addition, most cases the Content-Language is not included in the header. From a random sample of 10,000 English websites in DMOZ I found that only 5.09% have the Content-Language header.

For example:

curl -I --silent http://www.odu.edu |grep -i "Content-Language"

In this example we see that the Content-Language header was not found.

HTML Language:
Another indication of the language of a web page is the HTML language tag (such as, <html language='en'>….</html>). Using this method will require you to save the HTML code first then search for the HTML language code.

For example:

curl -I —silent http://ksu.edu.sa/ > ksu.txt

grep "<html lang=" ksu.txt

<html lang="ar"" dir="rtl" class="no-js" >

However, I found from a random sample of 10,000 English websites in DMOZ directory that only 48.6% have the HTML language tag.

Guess-Language:
One tool to detect language in Python is Guess-Language. It detects the nature of the Unicode text. This tool detects over 60 languages. However, two important notes to be taken are that 1)this tool works better with more text and 2) don’t include the HTML tags in the text or the result will be flawed. So if you wanted to check the language of a webpage I recommend that you filter the tags using the beautiful soup package and then pass it to the tool.

For example:

curl --silent http://abelian.org/tssp/|grep "title"|sed -e 's/<[^>]*>//g'

Tesla Secondary Simulation Project

python
from guess_language import guessLanguage
guessLanguage(“Tesla Secondary Simulation Project”)
’fr’
guessLanguage("Home Page An Internet collaboration exploring the physics of Tesla resonatorsThis project was started in May 2000 with the aim of exploring the fascinating physics of self-resonant single-layer solenoids as used for Tesla transformer secondaries.We maintain a precision software model of the solenoid which is based on our best theoretical understanding of the physics. This model provides a platform for theoretical work and virtual experiments.")

’en'

This example shows detecting the title language of a randomly selected English webpage from DMOZ http://abelian.org/tssp/. The language test using Guess-Language package will detect the language as French which is wrong. However, when we extract more text the result will be English. In order to determine the language of short text you need to install Pyenchant and other dictionaries. By default it only supports three languages: English, French, and Esperanto. You need to download any additional language dictionary you may need.

Python-Language Detector (languageIdentifier):
Jeffrey Graves built a very light weight tool in C++ based on language hashes and wrapped in python. This tool is called Python-Language Detector. It is very simple and effective. It detects 58 languages.

For example:

python
import languageIdentifier
languageIdentifier.load(“decultured-Python-Language-Detector-edc8cfd/trigrams/”)
languageIdentifier.identify(“Tesla Secondary Simulation Project”,300,300)
’fr’
languageIdentifier.identify("Home Page An Internet collaboration exploring the physics of Tesla resonatorsThis project was started in May 2000 with the aim of exploring the fascinating physics of self-resonant single-layer solenoids as used for Tesla transformer secondaries.We maintain a precision software model of the solenoid which is based on our best theoretical understanding of the physics. This model provides a platform for theoretical work and virtual experiments.”,300,300)

’en’

Here, we also noticed that the length of the text affects the result. When the text was short we falsely got "French" as the language. However, when we add more text from the webpage the correct answer appeared.

Another example where we check the title of a Korean webpage, which was selected randomly from the DMOZ Korean webpage directory.

For example:

curl --silent http://bada.ebn.co.kr/ | grep "title"|sed -e 's/<[^>]*>//g'

EBN 물류&조선 뉴스

python
import languageIdentifier
languageIdentifier.load(“decultured-Python-Language-Detector-edc8cfd/trigrams/”)
languageIdentifier.identify(“EBN 물류&조선 뉴스”,300,300)

’ko’

Here the correct answer showed up “Korean”, although some English letters were in the title.

LangID:
The other tool is LangID. This tool can detects 97 different languages. As an output it states the confidence score for the probability prediction. The scores are re-normalized and it produces an output in the 0-1 range. This tool is one of my favorite language detection tools because it is fast, detects short texts and gives you a confidence score.

For example:

python
import langid
langid.classify(“Tesla Secondary Simulation Project”)

(‘en’, 0.9916567142572572)

python
import langid
langid.classify("Home Page An Internet collaboration exploring the physics of Tesla resonatorsThis project was started in May 2000 with the aim of exploring the fascinating physics of self-resonant single-layer solenoids as used for Tesla transformer secondaries.We maintain a precision software model of the solenoid which is based on our best theoretical understanding of the physics. This model provides a platform for theoretical work and virtual experiments.")

(‘en’, 1.0)

Using the same text above. This tool identified a small text correctly, with a confidence rate of 0.99. And when full text is provided the confidence score was 1.0.

For example:

python
import langid
langid.classify(“السلام عليكم ورحمة الله وبركاته”)

(‘ar’, 0.9999999797315073)

By testing other language such as an Arabic phrase, it had a 0.99 confidence score for Arabic language.

Google Language Detection API:
The Google Language Detection API detects 160 different languages. I have tried this tool and I think it is one of the strongest tools found. The tool can be downloaded in different programming languages: ruby, java, python, php, crystal, C#. To use this tool you have to download an API key after creating an account and signing-up. The language tests results in three outputs: isReliable (true, false), confidence (rate), language (language code). The tool's website mentions that the confidence rate is not a range and can be higher than 100, no further explanation of how this score is calculated was mentioned. The API allows 5000 free requests a day (1 MB/day) free requests. If you need more than that there are different payable plans you can sign-up for. You can also detect text language in an online demo. I recommend this tool if you have a small data set, but it needs time to set-up and to figure out how it runs.

For example:

curl --silent http://moheet.com/ | grep "title"| sed -e 's/<[^>]* > //g' > moheet.txt

python
file1=open(“moheet.txt”,”r”)
import detectlanguage
detectlanguage.configuration.api_key=“Your key”
detectlanguage.detect(file1)
[{‘isReliable’: True, ‘confidence’: 7.73, ‘language’: ‘ar’}]

In this example, I extract text from an Arabic webpage from DMOZ Arabic Directory. The tool detected its language Arabic with True reliability and a confidence of 7.73. Note you have to remove the new line from the text so it doesn’t consider it a batch detection and give you result for each line.

In Conclusion:
So before you start looking for the right tool you have to determine a couple of things first:

Are you trying to detect the language of a webpage or some text?
What is the length of the text? usually more text is better and gives more accurate results (check this article on the effect of short texts on language detection: http://lab.hypotheses.org/1083)
What is the language you want to determine (if it is known or expected), because certain tools determine certain languages
What programming language do you want to use?

Here is a short summary of the language detection methods I reviewed and a small description of all:

Method	Advantage	Disadvantage
HTML language header and HTML language tag	can state language	not always found and sometimes affected by browser setting.
Guess-Language	fast, easy to use	works better on longer text.
Python-Language Detector	fast, easy to use	works better on longer text.
LangID	fast, gives you confidence score	works on both long and short text.
Google Language Detection API	gives you confidence score, works on both long and short text	needs creating an account and setting-up.

--Lulwah M. Alkwai

Recently, we conducted an experiment using mementos for almost 700,000 web pages from more than 20 web archives. These web pages spanned much of the life of the web (1997-2012). Much has been written about acquiring and extracting text from live web pages, but we believe that this is an unparalleled attempted to acquire and extract text from mementos themselves. Our experiment is also distinct from
AlNoamany's work or Andy Jackson's work, because we are trying to acquire and extract text from mementos across many web archives, rather than just one.

We initially expected the acquisition and text extraction of mementos to be a relatively simple exercise, but quickly discovered that the idiosyncrasies between web archives made these operations much more complex. We document our findings in a technical report entitled: "Rules of Acquisition for Mementos and Their Content".

Our technical report briefly covers the following key points:

Special techniques for acquiring mementos from the WebCite on-demand archive (http://www.webcitation.org)
Special techniques for dealing with JavaScript Redirects created by the Internet Archive
An alternative to BeautifulSoup for removing elements and extracting text from mementos
Stripping away archive-specific additions to memento content
An algorithm for dealing with inaccurate character encoding
Differences in whitespace treatment between archives for the same archived page
Control characters in HTML and their effect on DOM parsers
DOM-corruption in various HTML pages exacerbated by how the archives present the text stored within <noscript> elements

Rather than repeating the entire technical report here, we want to focus on the two issues of interest that may have the greater impact on others acquiring and experimenting with mementos: acquiring mementos from Web Cite and inaccurate character encoding.

Acquisition of Content from WebCite

WebCite is an on-demand archive specializing in archiving web pages used as citations in scholarly work. An example WebCite page is shown below.

For acquiring most memento content, we utilized the cURL data transfer tool. With this tool, one merely types the following command to save the contents of the URI http://www.example.com:

curl -o outputfile.html http://www.example.com

For WebCite, the output from cURL for a given URI-M results in the same HTML frameset content, regardless of which URI-M is used. We sought to acquire the actual content of a given page for text extraction, so merely utilizing cURL was insufficient. An example of this HTML is shown below.

<!DOCTYPE html PUBLIC " -//W3C//DTD XHTML 1.0 Frameset //EN" "http :// www.w3.org/TR/xhtml1/
DTD/xhtml1 -frameset.dtd">
<html xmlns="http :// www.w3.org /1999/ xhtml" xml : lang="en" lang="en">
<head>
<meta http−equiv="Content -Type" content="text/html; charset=utf -8"/>
<title>WebCite query result</title>
<link rel="stylesheet" type="text/css" href="/web/20180316072121cs_/http://ws-dl.blogspot.com/basic.css"/>
<link rel="stylesheet" type="text/css" href="/web/20180316072121cs_/http://ws-dl.blogspot.com/nicetitle.css"/>
<script src="https://web.archive.org/web/20180316072121js_/https://www.google.com/recaptcha/api.js" async defer></script>
</head>
<frameset rows="60 ,*" frameborder="0">
<frame src="/web/20180316072121fw_/http://ws-dl.blogspot.com/search/label/topframe.php" name="nav" noresize="noresize" marginwidth="0" marginheight="0" scrolling="no"/>
<frame src="/web/20180316072121fw_/http://ws-dl.blogspot.com/search/label/mainframe.php" name="main" noresize="noresize" marginwidth="0" marginheight="0"/>
</frameset>
</html>

Instead of relying on cURL, we analyzed the resulting HTML frameset and determined that the content is actually returned by a request to the mainframe.php file. Unfortunately, merely issuing a request to the mainframe.php file is insufficient because the cookies sent to the browser indicate which memento should be displayed. We developed custom PhantomJS code, presented as Listing 1 in the technical report, for overcoming this issue. PhatomJS, because it must acquire, parse, and process the content of a page, is much slower than merely using cURL.

The requirement to utilize a web browser, rather than HTTP only, for the acquisition of web content is common for live web content, as detailed by Kelly and Brunelle, but we did not anticipate that we would need a browser simulation tool, such as PhantomJS, to acquire memento content.

In addition to the issue of acquiring mementos, we also discovered reliability problems with Web Cite, seen in the figure below. We would routinely need to reattempt downloads of the same URI-M in order to finally acquire its content.

Finally, we experienced rate limiting from Web Cite, forcing us to divide our list of URI-Ms and download content from several source networks.

Because of these issues, the acquisition of almost 100,000 mementos from Web Cite took more than 1 month to complete, compared to the acquisition of 1 million mementos from the Internet Archive in 2 weeks.

Inaccurate Character Encoding

Extracting text from documents requires that such text be decoded properly for processes such as text similarity or topic analysis. For a subset of mementos, some archives do not present the correct character set in the HTTP Content-Type header. Even though most web sites now use the UTF-8 character set, a subset of our mementos come from a time before UTF-8 was adopted so proper decoding becomes an issue.

To address this issue, we developed a simple algorithm that attempts to detect and use the character encoding for a given document.

Use the character set from the HTTP Content-Type header, if present; otherwise try UTF-8.
If a character encoding is discovered in the file contents, as is common for XHTML documents, then try to use that; otherwise try UTF-8.
If any of the character sets encountered raise an error, raise our own error.

We fall back to UTF-8 because it is an effective superset of many of the character sets for the mementos in our collection, such as ASCII. This algorithm worked for more than 99% of our dataset.

In the future, we intend to explore the use of confidence-based tools, such as the chardet library, to guess the character set when extracting text. The use of such tools takes more time than merely using the Content-Type header, but are necessary when that header is unreliable and algorithms such as ours fail.

Summary

We were able to overcome most of the memento acquisition and text extraction issues encountered in our experiment. Because we were unaware of the problems we would encounter, we felt that it would be useful to detail our solutions for others to assist them in their own research and engineering.

--
Shawn M. Jones
PhD Student, Old Dominion University
Graduate Research Assistant, Los Alamos National Laboratory
- and -
Harihar Shankar
Research & Development Engineer, Los Alamos National Laboratory

Web Science and Digital Libraries Research Group

Tuesday, March 22, 2016

2016-03-22: Language Detection: Where to start?

Wednesday, February 24, 2016

2016-02-24: Acquisition of Mementos and Their Content Is More Challenging Than Expected

Acquisition of Content from WebCite

Inaccurate Character Encoding

Summary