|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
How to How to use Diacritics for Romanized Indic Text on the WWWThis page is currently in the middle of being revised and updated. Please come back in a few days!
Index
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
How to use Diacritics for Romanized Indic Text on the WWWby Christopher J. Fynn |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Limitations of font based character encodingIndologists, Sanskritists and others have long wanted a standard way of encoding in electronic documents and word-processor files the diacritic characters commonly used for the transliteration of Sanskrit and other Indic languages. Previously these diacritic characters were not found in any standard character set and so scholars had to resort to using ASCII representation of these characters (e.g. The Kyoto Harvard Convention ), or the use an "HTML" FONT FACE tag (which is not actually conformant to the HTML standard) along with ad-hoc conventions such the "Classical Sanskrit" (CS)and "Classical Sanskrit Extended" (CSX), conventions [previously the nearest thing that existed to an agreed encoding standard for Romanised Indian text] which use fonts in which glyphs for characters normally found at a given position have been substituted by glyphs for these diacritic characters - in other words using fonts with a non-standard glyph encoding. It has also often been difficult to exchange electronic documents with those who are using a different type of computer system, or using a different character set and even with those who do not have the same fonts installed on their system. Other problems resulting from using a non-standardised character encoding may include - difficulties in searching documents on the WWW for words containing such characters (since the same word may be encoded in different ways in different documents), while sorting, indexing and spell check applications designed to work with standard character sets will produce incorrect results when they encounter these characters. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The SolutionInternationalization of HTMLHTML 4.0 was designed so that documents may be unambiguously written in every language and be transported easily around the world. This was accomplished by incorporating RFC2070, which deals with the internationalisation of HTML. One important step was the adoption of the ISO/IEC 10646 Universal Multiple Octet Coded Character Set (UCS) standard as the basic document character set for HTML 4.0. The ISO/IEC 10646 StandardThis is the world's most inclusive standard dealing with issues of the representation of international characters, text direction, punctuation, and other world language issues. The ISO/IEC 10646 standard contains a set of characters which is inclusive of all the unique characters found in previous internationally recognised character encoding standards. With HTHML version 4.0 the ISO/IEC 10646 character set has been adopted as the UnicodeUnicode is a standard with an identical character repetoire to ISO/IEC 10646 but which additionaly specifies a set of character properties and algorithms for handling these characters - hence applications designed to work with the ISO/IEC 10646 character set are usually designed to conform with the Unicode Standard. In the ISO 10646 /Unicode standards each unique character has a unique encoding point. With the greater support this provides for diverse human languages within an HTML document, more effective indexing of documents for search engines, higher-quality typography, better text-to-speech conversion, better hyphenation, etc. will be possible. Within a few years, these features will also allow us to use complex Indic scripts such as Devanagri, Tamil and Tibetan in a standard way on the WWW. However the practical use of these scripts in a standard way requires that the application (i.e. browser) or rendering system handles complex context-sensitive glyph substitution and character substitution issues in a transparent manner - and such features have not yet been widely implemented across a broad range of computer operating systems and applications - let alone web-browsers. These complex rendering problems do not exist for the "Indic" diacritics characters since each of these characters has a unique encoding in Unicode. All you need to understand is the proper way of representing or encoding these characters within HTML, to have access to a simple text editor and an HTML 4.0 compliant Web browser such as Microsoft's Internet Explorer 4.x or Netscape's Navigator 4.7 - along with a font with the necessary characters properly encoded. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Font IssuesWith the wide availability of multi-script computing over the next few years many commercial Roman-script fonts will be updated to include a comprehensive repertoire of diacritics character glyphs - including those used for transliteration of Sanskrit. Right now there are a number of fonts few partial "Unicode" fonts, but these are not widely available and many do not yet contain all the diacritic characters used for transliteration of Sanskrit. The IndicTimes FontIn order that people can start using Sanskrit Diacritics with HTML 4 I have put together a Roman font IndicTimes containing glyph outlines for these diacritics characters along with those for the standard ISO Latin-1 character set. This font is available for download from this site and will work on systems using Windows 95 and NT 4 and above - (Mac's using system 8.5+ are supposed to support TrueType and OpenType "data fork" fonts created for PC's, so this font should also work on those systems ~ but I have not tested this). N.B.: This font, which I have called IndicTimes, is based on URW's NimbusRomNo9L font, which they have kindly made available under the terms of the GNU General Public License (GPL). The chief provisions of the GPL are that software licensed under it may be freely redistributed provided the author's copyright is properly acknowledged. As permitted under Section 2 of the GPL, I have modified the fonts to include the extra diacritic characters with their ISO 10646 codepoints and converted the oulines to True Type format. The modified fonts, like the originals, are distributed under the GPL. However, the copyright of the font remains with URW++ Design and Development Incorporated, Poppenbuetteler, Bogen 29A, D-22399, Hamburg, Germany. The IndicTimes font |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Character RepresentationA simple one-byte-per-character encoding technique is not sufficient for text strings over a character repertoire as large as ISO10646. There ways to represent or encode ISO 10646 characters. In order to maintain backward-compatibility with older browsers and to allow the safe transmission of documents containing multibyte characters over the Internet which was designed only for 7 or 8-bit character encodings, in HTML 4.0 multibye characters are either referenced by their numeric codepoint in the ISO 10646 / Unicode standard or encoded using the UTF-8 transformation format which is a method of representing multibyte characters by encoding them as a series of single (8-bit) bytes. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Note: To view the rest of this page properly you should be using a browser that supports HTML 4 and, if you are not using Internet Explorer 4.0 or later, you should first download the IndicTimes font and install it on your system. If you don't have this font installed, any other font with the necessary characters properly encoded should work - provided it is set up as the default font for your browser. Internet Explorer should automatically display an embedded version of this font created using Microsoft's free Web Embedding Font Tool (WEFT). If understand the basics of HTML and want to understand how the characters on this page are encoded you will probably want to look at it with your browser's "View Page Source" feature or in a plain text editor. Caution: Simply opening this page in some HTML editors will destroy the character encoding - others may convert the decimal character references to UTF-8. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
1. Decimal Character References:The syntax for this is "&#D", where D is a decimal number, refers to the Unicode decimal character number D. Ā * + Z [ \ ] 6 7 8 9 B C $ % Ñ ñ l m D E F G Z [ b c
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
2. Hexadecimal Character References:The syntax for this is "&#xH;" or "&#XH;", where H is an hexadecimal number, refers to the Unicode hexadecimal character number H. Hexadecimal numbers in numeric character references are supposed to be case-insensitive. See:Â HTML 4 Spec: 5.3.1 Numeric character references At the time this page was created (June, 1998) neither Netscape Navigator 4 nor Microsoft Internet Explorer 4 supported hexadecimal character representation. Therefore the samples in this section have been commented out since they will not display properly in either of these browsers. Note. Although the hexadecimal representation is not defined in [ISO8879], it is expected to be in the revision, as described in [WEBSGML]. This convention would be particularly useful if it worked since encoding standards documents generally use hexadecimal representations of character encoding values. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
3. UTF-8 Character references:The 2 byte (UCS-2) encoding of ISO 10646 is identical to the Unicode standard. The UTF 8 encoding of ISO 10646/Unicode allows standard 8-bit systems, ASCII text editors, older browsers and so on to continue to work with HTML 4.0 and other text files containing encoded multibyte characters. With UTF-8 encoding, ASCII text (0x0..0x7F) continues to appear without any changes and encodes all characters from 0x80.. 0x7FFFFFFF into a series of six or fewer bytes. If the most significant bit of the first character is 0, then the remaining seven bits are interpreted as an ASCII character. Otherwise, the number of leading 1 bits indicates the number of (8-bit) bytes following. There is always a 0 bit between the count bits and any data. First byte could be one of the following. The X indicates bits available to encode the character. 0XXXXXXX only one byte 0..0x7F (ASCII) 110XXXXX two bytes Maximum character value is 0x7FF 1110XXXX three bytes Maximum character value is 0xFFFF 11110XXX four bytes Maximum character value is 0x1FFFFF 111110XX five bytes Maximum character value is 0x3FFFFFF 1111110X six bytes Maximum character value is 0x7FFFFFFF All following bytes have this format: 10XXXXXX A two byte example. The encoding position for an N tilde Ñ is 209 in both ISO/Latin-1(8859/1) and ISO 10646. In hexadecimal, it is 0xAE. In HTML, it is Ñ or &NTilde;. In UTF-8 it has the following two-byte encoding: 0xC3, 0x91. For more information on UTF-8 see: ISO/IEC JTC1/SC2/WG2 N 1036. For background information which led to the development of UTF-8, see the proposal that describes the File System Safe UTF (FSS-UTF). A full set of UTF-8 Test Pages can be found at: http://titus.uni-frankfurt.de/unicode/unitest.htm Ä Ä Äª Ä« ṠṠṠṠḶ ḷ Ḹ ḹ ṠṠḤ ḥ à ñ Ṭ ṠḠḠṠṠṠṠŠŠṢ á¹£
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Remarks:Other than encoding these characters in HTML 4.0 as desribed above, the main thing you have to do in order to use these characters in your WWW pages is to understand HTML 4.0 sytax and tags and to include the line: <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> in the header of the HTML file. It is also recommended that you learn a little about Cascading Style Sheets" and the STYLE and CLASS tags used in HTML 4.0. Unicode compatible fonts (Including IndicTimes) with glyph outlines for these characters work in Word '97 and other applications which can save files in "Universal Alphabet" /UTF-8 format. You may feel that it is a little premature to use this method of encoding and displaying "Indic" Roman diacritic characters on your web site since some visitors will not yet be using HTML 4.0 compliant web browsers. However if you are creating web pages which you don't want to have to modify at a later date; web pages where you need to use a standard and unambiguous; or, web pages that may be accessed by clients running on a diverse range of computer operating systems, then you should seriously consider marking up and encoding your web pages in accordance with the international features of HTML 4.0 outlined above. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Of Related Interest...(updated on 13 April, 1999)
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| | Home | i18n| Indic | Bodhic | Links | WWW Search | | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This page maintained by: chris_fynn@hotmail.com. Material Copyright © 2000 Christopher J. Fynn |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||