Xerox Arabic Morphogical Analysis and Generation
Romanization, Transcription and Transliteration
Kenneth R. Beesley
Copyright © The Document Company - Xerox 1997 1998.All Rights Reserved.
In any discussion of Arabic Romanization, one is entering a field where there is little practical or terminological agreement. We will first define the terms Natural Language, Orthography, Romanization, Transcription and Transliteration as they are used in the Xerox Arabic project. In particular, we lead up to making a very clear distinction between what we call a Transcription and what we call a Transliteration. The reader is warned that these same terms may be used by other writers with different senses.Natural Language :
Without getting too technical about it, a natural language (henceforth just "language") is the kind of communication system used by all normal human beings; we call some of these languages French, English, Spanish, Arabic, Afrikaans, Navajo, etc. Language is usually transmitted by sound, but large numbers of deaf humans, cut off from sound, use sign languages that are transmitted in the visual medium. Linguists have long accepted sign languages as fully viable and expressive natural languages.
All normal adult humans speak or sign at least one language, and language is so much a part of our humanity that the lack of language is a clear indication of a severe handicap or, in rare cases, a feral or abused childhood in which the person had little or no exposure to a language.
In contrast, there is nothing natural or essentially human about literacy,
the ability to read and write. One can be a completely normal human being
and be illiterate; and indeed illiteracy is still the norm in many places
in the world.
It is important to make a clear distinction between Language, possessed by all normal adult human beings, and Orthography.
An orthography is a learnable human technology consisting of 1) a set of characters and 2) conventions for using them to make language "visible". Prototypically these characters take the form of marks on paper, parchment, bark or some similar medium, but the notion is easily extended to notches cut in a stick, carvings in stone or metal, raised Braille dot patterns that are felt rather than seen, magnetized bits in a computer file, etc.
Many language communities currently have no culturally-accepted or "standard" orthography, but only because too few people have ever wanted to write them badly enough. A single language may have multiple orthographies in reasonably common use (at least three separate orthographies, all romanizations, have been proposed and used for Aymará; and Serbian and Croation are essentially the same language, with two different orthographies). Orthographies can change, through evolution or cultural revolution: up until the 1920s Turkish was written using Arabic letters and conventions; since then it has been written in a Roman orthography. But the Turkish language is still the Turkish language, no matter what orthography is used. English speakers have used many orthographies, including a fairly standard Roman version (with regional variations), Pitman shorthand, Gregg shorthand, Shavian and dozens of other proposed orthographical reforms. A competent linguist can invent a new, viable orthography for any human language.
Many language communities adopt their standard orthography more or less by historical accident. English and most of the languages of Western Europe have a Roman orthography culturally associated with them because these areas were conquered by the Romans and later proselytized by the Roman Catholic Church. Polish (a Slavic language) uses a Roman orthography because it too was proselytized from the Roman Catholic side. But Slavic Russian and Bulgarian speakers traditionally use a Cyrillic orthography because historically they were proselytized from the Greek Orthodox side. Serbo-Croatian is for all practical purposes a single language, but the Serbs use a Cyrillic orthography while the Croats use a Roman one; the difference is again the result of which missionaries got there first.
In a similar fashion, the conquest of Islam carried both religion and, to a
lesser extent, the Arabic language; but where the local languages survived,
as in Persia, Turkey, Indonesia and India, the speakers often adopted an
orthography for their local language based on traditional Arabic
The term Transcription, as used here, denotes an orthography devised and used by linguists to characterize the phonology or morphophonology of a language. Trained linguists often use the International Phonetic Alphabet to transcribe languages. The vast majority of Arabic "romanizations", including the Library of Congress system and the romanizations in the respected Hans Wehr dictionary, are transcriptions in the present sense. (The romanization used in the Wehr dictionary is the "official" transcription, based on the Deutsche Morgenlńndische Gesellschaft proposal, adopted by the International Convention of Orientalist Scholars in 1936 in Rome.)
The typical transcription of Arabic has as its purpose to convey the
pronunciation of Arabic words, usually to foreigners who are not
comfortable with traditional Arabic orthography. Given their previous
schooling in the orthographies used for their native languages, Western
Europeans are more comfortable with a Roman-based transcription; Russians
and Bulgarians would obviously prefer a Cyrillic transcription, etc. In
any case, traditional Arabic orthography includes silent letters,
superficially ambiguous letters like waaw and yaa', and usually an absence
of vowel signs and other diacritics necessary to convey the pronunciation
reliably. For all these reasons, it is useful and proper for linguists,
teachers and dictionary editors to devise and use whatever kinds of
transcription are suited for their ends. These transcriptions are possible
orthographies for Arabic, possible ways of making Arabic visible, but
because they use different character inventories and different conventions,
they are different from the standard Arabic orthography.
Transliterations, for present purposes, are orthographies which must be clearly distinguished from Transcriptions, as just defined. The purpose of a Transliteration (sometimes called a "strict transliteration" or "orthographical transliteration") is to write a language in its customary orthography, using the exact same orthographical conventions, but using carefully substituted orthographical symbols. Transliterations are appropriate when one wants to use the traditional orthography (with all its strengths and weaknesses, all its distinctions and ambiguities) but where writing or displaying or storing the original characters is impossible or inconvenient.
For an orthography to qualify as a transliteration, it must use the same orthographical conventions and a symbol set which has a one-to-one, fully reversible mapping with the symbol set of the original orthography. Symbols may include carefully defined ngrams, as shown below.
This usage of the terms transliteration vs. transcription is consistent with that of Hans [Hanan] Wellisch in his book Transcription and Transliteration: an annotated bibliography on conversion of scripts, 1975, Silver Spring, Maryland:Institute of Modern Languages, where he writes:
The standard Arabic orthography is a clear case where writing, displaying and storing the original character shapes is often inconvenient for many people working with European-language text editors, email systems and networks; in these cases, there is often a practical need for a Roman transliteration that allows standard Arabic orthography to be represented faithfully using the available ASCII letters. Russians might devise a Cyrillic transliteration for the same reason.
Many linguists, and including Arabists, have not learned to distinguish transcriptions from transliterations, and this leads to considerable confusion. The vast majority of Arabic romanizations are transcriptions, designed to convey the surface pronunciation or the deeper morphophonology of words; and many Arabists see no purpose in a romanization that doesn't serve this purpose. But in commercial Arabic natural-language processing systems, such as the Xerox Morphological Analyzer, where the input and output consist of written text in the traditional orthography, there is often a need for a true Roman transliteration. The Buckwalter Transliteration is used by the developers of the Xerox Arabic Morphology system when they need to communicate Arabic text, consistent with Arabic orthographical conventions but with substituted letter shapes, via common email and other media where it is inconvenient or impossible to display real Arabic script.
Inside the Xerox Arabic applets that display real Arabic script, strings
are stored as UNICODE characters.
The Banality of Transliteration :
Despite the difficulty of getting some Arabists to appreciate the legitimacy and practical value of transliterating Arabic orthography, the notion is inherent to encoded computer text for all languages. For example, a well-formed ASCII encoding of English text is nothing more or less than a transliteration of standard English orthography, using the same orthographical conventions, but carefully substituting integers (or bit patterns) for the set of graphic characters traditionally used to write English. By ASCII convention, the lowercase 'a' is substituted by 97, 'b' by 98, 'c' by 99, 'd' by 100, etc. with one character stored per byte.
When a printer or terminal is directed to display a file of ASCII codes as English text, the codes are converted to letter shapes from a font and are then displayed appropriately on paper or terminal screen by an English Rendering program.
cat [display] ^ | English Rendering Program ^ | 99, 97, 116 [integers in a file]
The character distinctions of English orthography are reflected faithfully (and reversibly) in the ASCII encodings themselves, but some of the facts of English orthography are relegated to the Rendering Program, in particular the fact that English is rendered from left to right. Computer files have a beginning and an end, but they don't have any inherent left-to-right or right-to-left orientation; they're just sequences of byte values.
The banality of ASCII-transliterated English resides in the fact that it is unambiguously mappable to and from standard English orthography. For all practical purposes, ASCII-transliterated English texts are "the same thing" as traditionally typewritten or printed texts. This banality extends to all true transliterations: a transliteration of an Arabic orthographical text into ISO8859-6 or UNICODE characters is effectively the same as the original except that numbers are carefully substituted for the original characters. For exactly the same reason, a true transliteration of traditional Arabic orthography using Roman letters (or carefully defined ngrams) is again the same thing as traditional Arabic orthography except that the shape of each letter is different.
Of course, the rendering (on paper or computer screen) of Arabic script
from a UNICODE or ISO8859-6 file is somewhat more difficult that the
parallel rendering program for encoded English. The Arabic encoding
systems properly employ only a single character encoding for shiin, one or
miim, one for daal, etc. and yet the bitmap fonts contain multiple "glyphs"
for each character, representing the isolated, initial, medial and final
shapes for rendering each character appropriately in context. An
Arabic-script rendering program, such as the Java applets that display
Arabic script in the Xerox Arabic Morphology System, must accept a string
of input codes (UNICODE in this case), compute which glyph is appropriate
in each context, and then display the appropriate glyphs right to left.
Roman Transliterations of Arabic :
As pointed out above, most Arabic romanizations are transcriptions and are
not unambiguously mappable to and from traditional Arabic orthography; they
serve a purpose other than faithful communication of the facts of Arabic
orthography. Although it is possible to devise an unlimited number of
proper Roman transliterations for Arabic, few are genuinely in use, and
some orthographies identified as "transliterations" do not satisfy the
requirements as stated above.
Both transcription and transliteration have their uses, and the two can
seldom resemble each other for Arabic. Because of unwritten vowels and
other diacritics, and because of ambiguous and silent letters, standard
Arabic orthography is a poor clue to pronunciation, especially for
non-Arabic speakers who can't reliably guess which reading of a word is
appropriate in syntactic context; conversely, a good phonological
transcription is often a poor clue to standard orthography. It's a serious
mistake to try to do Arabic transcription and transliteration at the same
Where Many Attempts to Devise Transliterations Fail :
Even when limited, by entirely practical considerations, to using 7-bit ASCII characters in a transliteration, it is usually desirable for the system to be easily learnable, which means being at least reasonably legible in ASCII displays.
Thus almost everyone uses the obvious equivalents like s for siin, d for daal, z for zaay, t for taa', w for waaw, y for yaa', etc. In the Buckwalter Transliteration we use uppercasing to distinguish pharyngealized (aka "emphatic") consonants: S for Saad, D for Daad, T for Taa', Z for Zaa' (DHaa'). (The same convention was adopted quite independently by Terry Regier in his atex transliteration for specifying Arabic strings in LaTeX.) We use a for fatHa, i for kasra, u for Damma, o for sukuun, etc. Eventually we (and anyone else devising a 7-bit ASCII transliteration) must start grasping for motivated character substitutes, and we freely recognize that many equivalent and equally good Roman transliterations could be devised.
When the obvious letter substitutions run out, the usual course is to adopt Roman digraphs (or more generally ngraphs) to represent particular Arabic characters, but sloppy use of ngraphs disqualifies many orthographies from being transliterations. Classic blunders include using 'sh' for shiin, while at the same time using 's' for siin and h for haa'; and using th for unvoiced thaa' and dh for voiced dhaal. The use of such ngraphs creates ambiguities in the transliterated text that were not present in the standard orthography.
The use of ngraphs does not necessarily disqualify an orthography from being a faithful transliteration. If, for example, the exclamation mark has no independent role, then it could be used legitimately and unambiguously to distinguish siin, written just as "s", from Saad, written "s!", and daal, written "d", from Daad, written "d!", etc. Or digraphs can be bracketed, for example always writing "[sh]" for shiin, "[th]" for thaa' and "[dh]" for dhaal to preclude any possibility of confusing the digraphs with two separate characters. Nevertheless, bracketing is a nuisance, users can't always be expected to remember the conventions, and we found it safest to eschew ngraphs in the Buckwalter Transliteration.
Other ways in which most transcriptions of Arabic fail to qualify as transliterations:
As pointed out at the beginning of this file, a single language can have many orthographies defined for it, and Arabic could be very reasonably written for all purposes using any number of Roman- or Cyrillic-based transcriptions. Similarly, morphological analyzers, parsers and other computer systems could be written to accept and analyze text written in a Roman or Cyrillic transcription. However, given that existing Arabic text is (or should be) represented in encodings like ISO8859-6 and UNICODE, the requirement for a commercial natural-language-processing system to handle traditional Arabic orthography is every bit as real as the requirement to handle traditional French or English orthography.
Transcriptions will naturally be used in pedagogical situations, where
there is a genuine need, distinct from orthography, to convey phonetics,
phonology and morphophonology; whereas in Arabic natural-language
processing, there will more commonly be a need for strict orthographical
transliterations like the Buckwalter Transliteration.