aM

Mt. Kanchenjungnga

Home

i18n
Indic
Bodhic

How to How to use Diacritics for Romanized Indic Text on the WWW

This page is currently in the middle of being revised and updated. Please come back in a few days!


This page demonstrates how to use the diacritics characters usually used to transliterate Sanskrit and other Indic languages within HTML 4.0 documents on the WWW.

Index


 
Note: The text of this document contains only markup conforming to the HTML 4.0 specification. If your browser has problems displaying any part of it then it is not fully HTML 4.0 compatible! However you may need to install the font IndicTimes

How to use Diacritics for Romanized Indic Text on the WWW

by Christopher J. Fynn

Home

Index
Next

Limitations of font based character encoding

Indologists, Sanskritists and others have long wanted a standard way of encoding in electronic documents and word-processor files the diacritic characters commonly used for the transliteration of Sanskrit and other Indic languages. Previously these diacritic characters were not found in any standard character set and so scholars had to resort to using ASCII representation of these characters (e.g. The Kyoto Harvard Convention ), or the use an "HTML" FONT FACE tag (which is not actually conformant to the HTML standard) along with ad-hoc conventions such the "Classical Sanskrit" (CS)and "Classical Sanskrit Extended" (CSX), conventions [previously the nearest thing that existed to an agreed encoding standard for Romanised Indian text] which use fonts in which glyphs for characters normally found at a given position have been substituted by glyphs for these diacritic characters - in other words using fonts with a non-standard glyph encoding.

It has also often been difficult to exchange electronic documents with those who are using a different type of computer system, or using a different character set and even with those who do not have the same fonts installed on their system. Other problems resulting from using a non-standardised character encoding may include - difficulties in searching documents on the WWW for words containing such characters (since the same word may be encoded in different ways in different documents), while sorting, indexing and spell check applications designed to work with standard character sets will produce incorrect results when they encounter these characters.

Home

Index
Next
Previous

The Solution

Internationalization of HTML

HTML 4.0 was designed so that documents may be unambiguously written in every language and be transported easily around the world. This was accomplished by incorporating RFC2070, which deals with the internationalisation of HTML.

One important step was the adoption of the ISO/IEC 10646 Universal Multiple Octet Coded Character Set (UCS) standard as the basic document character set for HTML 4.0.

The ISO/IEC 10646 Standard

This is the world's most inclusive standard dealing with issues of the representation of international characters, text direction, punctuation, and other world language issues. The ISO/IEC 10646 standard contains a set of characters which is inclusive of all the unique characters found in previous internationally recognised character encoding standards. With HTHML version 4.0 the ISO/IEC 10646 character set has been adopted as the

Unicode

Unicode is a standard with an identical character repetoire to ISO/IEC 10646 but which additionaly specifies a set of character properties and algorithms for handling these characters - hence applications designed to work with the ISO/IEC 10646 character set are usually designed to conform with the Unicode Standard.

In the ISO 10646 /Unicode standards each unique character has a unique encoding point. With the greater support this provides for diverse human languages within an HTML document, more effective indexing of documents for search engines, higher-quality typography, better text-to-speech conversion, better hyphenation, etc. will be possible.

Within a few years, these features will also allow us to use complex Indic scripts such as Devanagri, Tamil and Tibetan in a standard way on the WWW. However the practical use of these scripts in a standard way requires that the application (i.e. browser) or rendering system handles complex context-sensitive glyph substitution and character substitution issues in a transparent manner - and such features have not yet been widely implemented across a broad range of computer operating systems and applications - let alone web-browsers.

These complex rendering problems do not exist for the "Indic" diacritics characters since each of these characters has a unique encoding in Unicode. All you need to understand is the proper way of representing or encoding these characters within HTML, to have access to a simple text editor and an HTML 4.0 compliant Web browser such as Microsoft's Internet Explorer 4.x or Netscape's Navigator 4.7 - along with a font with the necessary characters properly encoded.

Home

Index
Next
Previous
 

Font Issues

With the wide availability of multi-script computing over the next few years many commercial Roman-script fonts will be updated to include a comprehensive repertoire of diacritics character glyphs - including those used for transliteration of Sanskrit. Right now there are a number of fonts few partial "Unicode" fonts, but these are not widely available and many do not yet contain all the diacritic characters used for transliteration of Sanskrit.

The IndicTimes Font

In order that people can start using Sanskrit Diacritics with HTML 4 I have put together a Roman font IndicTimes containing glyph outlines for these diacritics characters along with those for the standard ISO Latin-1 character set. This font is available for download from this site and will work on systems using Windows 95 and NT 4 and above - (Mac's using system 8.5+ are supposed to support TrueType and OpenType "data fork" fonts created for PC's, so this font should also work on those systems ~ but I have not tested this).

N.B.: This font, which I have called IndicTimes, is based on URW's NimbusRomNo9L font, which they have kindly made available under the terms of the GNU General Public License (GPL). The chief provisions of the GPL are that software licensed under it may be freely redistributed provided the author's copyright is properly acknowledged. As permitted under Section 2 of the GPL, I have modified the fonts to include the extra diacritic characters with their ISO 10646 codepoints and converted the oulines to True Type format. The modified fonts, like the originals, are distributed under the GPL. However, the copyright of the font remains with URW++ Design and Development Incorporated, Poppenbuetteler, Bogen 29A, D-22399, Hamburg, Germany.

The IndicTimes font


Home

Index
 Next
Previous
 

Character Representation

A simple one-byte-per-character encoding technique is not sufficient for text strings over a character repertoire as large as ISO10646. There ways to represent or encode ISO 10646 characters. In order to maintain backward-compatibility with older browsers and to allow the safe transmission of documents containing multibyte characters over the Internet which was designed only for 7 or 8-bit character encodings, in HTML 4.0 multibye characters are either referenced by their numeric codepoint in the ISO 10646 / Unicode standard or encoded using the UTF-8 transformation format which is a method of representing multibyte characters by encoding them as a series of single (8-bit) bytes.

   

Note: To view the rest of this page properly you should be using a browser that supports HTML 4 and, if you are not using Internet Explorer 4.0 or later, you should first download the IndicTimes font and install it on your system. If you don't have this font installed, any other font with the necessary characters properly encoded should work - provided it is set up as the default font for your browser. Internet Explorer should automatically display an embedded version of this font created using Microsoft's free Web Embedding Font Tool (WEFT).

If understand the basics of HTML and want to understand how the characters on this page are encoded you will probably want to look at it with your browser's "View Page Source" feature or in a plain text editor. Caution: Simply opening this page in some HTML editors will destroy the character encoding - others may convert the decimal character references to UTF-8.


Home

Index
 Next
Previous

1. Decimal Character References:

The syntax for this is "&#D", where D is a decimal number, refers to the Unicode decimal character number D.

Ā * + Z [ \ ] 6 7 8 9 B C $ % Ñ ñ l m D E F G Z [ b c

Gif image of how previous line should look

  Ā Ā   ā
  * Ī   + ī
  j Ū   k ū
   Ē    ē
  L Ō   M ō
  Z Ṛ   [ ṛ
  \ Ṝ   ] ṝ
  6 Ḷ   7 ḷ
  8 Ḹ   9 ḹ
  B Ṃ   C ṃ
  $ Ḥ   % ḥ
  Ñ Ñ   ñ ñ
  l Ṭ   m ṭ
  Ḍ   ḍ
  D Ṅ   E ṅ
  F Ṇ   G ṇ
  Z Ś   [ ś
  b Ṣ   c ṣ

Home

Index
Next
Previous

2. Hexadecimal Character References:

The syntax for this is "&#xH;" or "&#XH;", where H is an hexadecimal number, refers to the Unicode hexadecimal character number H. Hexadecimal numbers in numeric character references are supposed to be case-insensitive. See:  HTML 4 Spec: 5.3.1 Numeric character references

At the time this page was created (June, 1998) neither Netscape Navigator 4 nor Microsoft Internet Explorer 4 supported hexadecimal character representation. Therefore the samples in this section have been commented out since they will not display properly in either of these browsers.

Note. Although the hexadecimal representation is not defined in [ISO8879], it is expected to be in the revision, as described in [WEBSGML]. This convention would be particularly useful if it worked since encoding standards documents generally use hexadecimal representations of character encoding values.


Home

Index
 Next
Previous

3. UTF-8 Character references:

The 2 byte (UCS-2) encoding of ISO 10646 is identical to the Unicode standard.

The UTF 8 encoding of ISO 10646/Unicode allows standard 8-bit systems, ASCII text editors, older browsers and so on to continue to work with HTML 4.0 and other text files containing encoded multibyte characters.

With UTF-8 encoding, ASCII text (0x0..0x7F) continues to appear without any changes and encodes all characters from 0x80.. 0x7FFFFFFF into a series of six or fewer bytes. If the most significant bit of the first character is 0, then the remaining seven bits are interpreted as an ASCII character. Otherwise, the number of leading 1 bits indicates the number of (8-bit) bytes following. There is always a 0 bit between the count bits and any data.

First byte could be one of the following. The X indicates bits available to encode the character.

  0XXXXXXX  only one byte        0..0x7F (ASCII)
  110XXXXX  two bytes            Maximum character value is 0x7FF 
  1110XXXX  three bytes          Maximum character value is 0xFFFF
  11110XXX  four bytes           Maximum character value is 0x1FFFFF
  111110XX  five bytes           Maximum character value is 0x3FFFFFF
  1111110X  six bytes            Maximum character value is 0x7FFFFFFF

All following bytes have this format: 10XXXXXX

A two byte example. The encoding position for an N tilde Ñ is 209 in both ISO/Latin-1(8859/1) and ISO 10646. In hexadecimal, it is 0xAE. In HTML, it is Ñ or &NTilde;. In UTF-8 it has the following two-byte encoding: 0xC3, 0x91.

For more information on UTF-8 see: ISO/IEC JTC1/SC2/WG2 N 1036.

For background information which led to the development of UTF-8, see the proposal that describes the File System Safe UTF (FSS-UTF).

A full set of UTF-8 Test Pages can be found at: http://titus.uni-frankfurt.de/unicode/unitest.htm

Ā ā Ī ī Ṛ ṛ Ṝ ṝ Ḷ ḷ Ḹ ḹ Ṃ ṃ Ḥ ḥ Ñ ñ Ṭ ṭ Ḍ ḍ Ṅ ṅ Ṇ ṇ Ś ś Ṣ ṣ

Gif image of how previous line should look

    Char.  UTF-8 Hex Dec     Char.  UTF-8 Hex Dec
Ā Ā 0xC4 0x80 196 128 ā ā 0xC4 0x81 196 129
Ī Ī 0xC4 0xAA 196 170 ī ī 0xC4 0xAB 196 171
  Ū Ū 0xC5 0xAA 197 170 Å« Å« 0xC5 0xAB 197 171
Ä’ Ä’ 0xC4 0x92 196 146 Ä“ Ä“ 0xC4 0x93 196 147
Ō Ō 0xC5 0x8C 197 140 ō ō 0xC5 0x8D 197 141
Ṛ Ṛ 0xE1 0xB9 0x9A 225 185 154 ṛ ṛ 0xE1 0xB9 0x9B 225 185 155
Ṝ Ṝ 0xE1 0xB9 0x9C 225 185 156 ṝ ṝ 0xE1 0xB9 0x9D 225 185 157
  Ḷ Ḷ 0xE1 0xB8 0xB6 225 184 182   ḷ ḷ 0xE1 0xB8 0xB7 225 184 183
Ḹ Ḹ 0xE1 0xB8 0xB8 225 184 184 ḹ ḹ 0xE1 0xB8 0xB9 225 184 185
Ṃ Ṃ 0xE1 0xB9 0x82 225 185 130 ṃ ṃ 0xE1 0xB9 0x83 225 185 131
Ḥ Ḥ 0xE1 0xB8 0xA4 225 184 164 ḥ ḥ 0xE1 0xB8 0xA5 225 184 165
Ñ Ñ 0xC3 0x91 195 145 ñ ñ 0xC3 0xB1 195 177
Ṭ Ṭ 0xE1 0xB9 0xAC 225 185 172 ṭ ṭ 0xE1 0xB9 0xAD 225 185 173
Ḍ Ḍ 0xE1 0xB8 0x8C 225 184 140 ḍ ḍ 0xE1 0xB8 0x8D 225 184 141
Ṅ Ṅ 0xE1 0xB9 0x84 225 185 132 ṅ ṅ 0xE1 0xB9 0x85 225 185 133
Ṇ Ṇ 0xE1 0xB9 0x86 225 185 134 ṇ ṇ 0xE1 0xB9 0x87 225 185 135
Åš Åš 0xC5 0x9A 197 154 Å› Å› 0xC5 0x9B 197 155
á¹¢ á¹¢ 0xE1 0xB9 0xA2 225 185 162 á¹£ á¹£ 0xE1 0xB9 0x83 225 185 163
Home

Index
Next
Previous

Remarks:

Other than encoding these characters in HTML 4.0 as desribed above, the main thing you have to do in order to use these characters in your WWW pages is to understand HTML 4.0 sytax and tags and to include the line: <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> in the header of the HTML file. It is also recommended that you learn a little about Cascading Style Sheets" and the STYLE and CLASS tags used in HTML 4.0.

Unicode compatible fonts (Including IndicTimes) with glyph outlines for these characters work in Word '97 and other applications which can save files in "Universal Alphabet" /UTF-8 format.

You may feel that it is a little premature to use this method of encoding and displaying "Indic" Roman diacritic characters on your web site since some visitors will not yet be using HTML 4.0 compliant web browsers. However if you are creating web pages which you don't want to have to modify at a later date; web pages where you need to use a standard and unambiguous; or, web pages that may be accessed by clients running on a diverse range of computer operating systems, then you should seriously consider marking up and encoding your web pages in accordance with the international features of HTML 4.0 outlined above.

Home

Index
Previous

Of Related Interest...

(updated on 13 April, 1999)

 
| Home | i18n| Indic | Bodhic | Links | WWW Search |


This page maintained by: chris_fynn@hotmail.com.
Material Copyright © 2000 Christopher J. Fynn