This guide discusses the structure and role of coded character sets used in IT systems and for information interchange between IT systems. The following subjects are covered:
A coded character set defines specific bit representations for a specific character repertoire. The number and appearance of the bit combinations themselves are identical in all coded character sets based on a given number of bits. Starting with 5 bits per character (e.g. the TTY [teletype] code), the character sets extended first to 6 bits (e.g. the IBM BCD [Binary Coded Decimal] code), then 7 bits (e.g. the basic US ASCII [American Standard Code for Information Interchange]) and 8 bits (e.g. the IBM EBCDIC [Extended Binary Coded Decimal Interchange Code] and the ISO/IEC 8859 series). Encoding schemes that are based on using one octet (eight-bit byte) per character often have national variants, in order to meet the needs of particular languages and language combinations. More recently multi-octet coding schemes have been introduced (particularly ISO/IEC 10646 and Unicode). The specific coded character set that is used in a given environment must be known before you can correctly identify the represented characters.
Characters don’t usually appear in a coded form outside IT systems, although the 6- and 8-dot Braille symbols (which also have cultural and language dependent meaning, somewhat similar to sign languages for the deaf) are a notable exception to this. Even within an IT system, character coding is really needed only to allow for processing of the data, in addition to reducing memory requirements for storing the data. Processing, be it calculation, search, ordering, etc, requires a level of knowledge of the data content, as does also any flexibility in the rendering of the data.
The number of bits defines the maximum number of possible individual codes (5 bits allow for 32 codes, 6 for 64, 7 for 128, 8 for 256, 16 for 65536, etc.). In practice the number of assignable graphic characters is considerably lower than this maximum since some positions may be reserved and some are used for control characters. For example, the widely used 8-bit ISO/IEC 8859 series only allows for 191 graphic characters per set. Proprietary solutions such as the Microsoft Windows Western CP1252 or the IBM Multinational CP850 code sets allow for more graphic characters to be represented in 8 bits by assigning graphic characters to bit combinations in a range that is reserved for control characters in standard sets. This is used to reduce their language orientation, although only slightly, since most of the added characters are for punctuation or special symbols. As the result, standardized, single-octet coded character sets are often used for external interchange only. They are, however, also used internally e.g. in UNIX and Linux type systems.
In addition to the coding in the character set, each character is represented by a “glyph” that provides the basis for the physical presentation of the character for display on screen or printing. The actual presentation form is highly dependent on font and type specifications. More than one glyph can be mapped to a single character code, if they are considered as glyph variants of the same character. Also, more than one character code may be used together to form a single composed output character, because it represents a combination of characters (e.g. a ligature such as fi or some highly accented Lithuanian letters for legal use).
Graphic characters have a number of different properties. They can be alphabetic, numeric or special characters of various categories. Alphabetic characters may have lower case and upper case variants, or be case independent. Characters also have a default reading direction used for rendering (such as left-to-right). Special rules may need to be applied when presenting characters with different reading directions. Definitions of coded character sets have usually not discussed such properties but left them to be defined elsewhere. In Unicode, the trend is to develop rigorous definitions for special properties, even application-oriented properties such as basic behavior in line-breaking rules.
There are thousands of languages spoken by the nations of the world. Many nations (although not necessarily their individual members) are multilingual and virtually the same language is often also spoken by members of several nations. In addition to the variety in the native languages there are differences in the scripts used to write them. Within Europe six modern scripts are in traditional use (Latin, Cyrillic, Greek, Armenian, Georgian, and Hebrew) and a number of other scripts are used by large immigration populations. The script used is often dependent on the language, although a given language may be written using more than one script. In addition, a number of both historic and artificial scripts have their own user communities, such as researchers and hobbyists.
For practical purposes, the full character repertoire of a given language should be available for processing within a single coded character set, although at times it is difficult to agree on what constitutes the full repertoire, particularly for languages for which an established orthography has not been well documented. Even for widely known languages (such as English), the question arises as to which letters are:
Thus it is difficult to know where to draw the line for the full repertoire.
To illustrate the problems with using single-octet code sets, even the official languages of Europe have such unique characteristics that no single 8-bit coded character set can cover them, not even those using only the Latin script. For this reason, there are multiple Latin parts for ISO/IEC 8859. Most of these parts have no free space, which was the reason for creating “Latin-9” (8859-15) in 1999 for use, in lieu of “Latin-1” (8859-1), when the euro sign (€) was added to the repertoire.
It should be noted that there can be no direct understanding between systems and applications using different single-octet encodings, including various parts of the same standard. Some extension schemes have been developed to allow switching between encodings within a data stream or file. However, none of these schemes have been widely supported, primarily because of the severe difficulties in defining and implementing processing rules for them, particularly at the time when processing cycles were a scarce resource. Several “multilingual” systems already allow for processing of data elements (e.g. e-mail messages) with different, known encoding schemes without any loss of data, as long as the various elements don’t need to be combined.
To overcome the limitations of the single-octet world, the Universal Multiple-Octet Coded Character Set (UCS — ISO/IEC 10646) has been created by ISO/IEC JTC1/SC2 (WG2). As the name implies, the intent is to cover all the characters of all the scripts of the world. A parallel effort, Unicode, had been started by major IT companies. These two efforts are now fully synchronized and the bodies responsible for them are committed to maintaining the synchronization. Specifically, the repertoire and coding of the currently published ISO/IEC 10646-1:2000 with Amendment 1 and Unicode 3.2 are identical. The Unicode standard, however, also deals with a number of character semantics and implementation related items whereas the 10646 is essentially limited to coding and identifying the script and the relevant collections.
In spite of the strong desire by the industry to concentrate on the implementations of UCS/Unicode and not to have to make any new investments in old technology, there is still some standardization activity in ISO/IEC JTC1/SC2 (WG3) on the 8-bit coded character sets, although the Working Group has not met since 2000. This activity includes a revision of ISO/IEC 6937, Information technology -- Coded graphic character set for text communication -- Latin alphabet, which provides a way to code an extended yet limited repertoire by using a particular coding scheme with selected sequences (unlike UCS/Unicode, where the architecture does not restrict the use of combining characters).
ISO/IEC JTC1/SC2 (WG3) also deals with the registration of character sets for the purposes of a user community being able to formally refer worldwide to either a specific coding scheme or repertoire. The relevant standard, ISO/IEC 2375 Information technology -- Procedure for registration of escape sequences and coded character sets, is under revision. The purpose of the revision is to clarify the rules and also provide a mapping from future registrations to the UCS, which is becoming the reference point for all character set specifications. The revised standard is expected to be published in 2002. The register is available on-line.
The IT industry is active in supporting the UCS/Unicode. The more recent versions of many operating systems already use UCS/Unicode in their internal implementation. Code conversions are routinely done using UCS as the reference point. Rendering devices, such as printers and displays, also support UCS/Unicode in lieu of several more limited coded character sets.
The first edition of Part 1, Architecture and Basic Multilingual Plane, was originally published in 1993, followed by a multitude of amendments; the second edition was published in 2000, followed by one amendment. Part 1 deals exclusively with 16-bit (2-octet) codes, except for the architecture, which extends to the maximum of four octets. Part 2, Supplementary Planes, first published in 2001, defines a secondary multilingual plane for scripts and symbols, a supplementary plane for CJK [Chinese, Japanese and Korean] ideographs, and a special purpose Plane. This part has greatly expanded the coverage of the UCS in terms of both character repertoire and user community. The Working Group, in May 2002, initiated an effort to combine the two parts in one.
The UCS started off by defining code positions and thus new encoding for characters in existing coded character sets. Because of this, in addition to a very large number of useful characters, it also contains a number of obscure characters, some of which have never been widely used or have for various reasons been deprecated. Nevertheless, in order to ensure continued interoperability, no existing UCS codes will be removed or even renamed, although relevant footnotes on the usage and alternative naming are attached to a number of entries.
There are several characters that can only be defined as such, which, for the purposes of this discussion, are called base characters. In addition, the UCS also defines a large number of pre-composed characters as well as their components. Thus, for example, many characters based on the Latin letter A are encoded in both capital and small letter forms by themselves (e.g. Áá, Àà, Ââ, Ää, Åå, etc) or as the letter A followed by the corresponding combining diacritical mark (i.e. grave, acute, circumflex, tilde, diaeresis, ring above, etc). This was seen to be necessary to provide for ease of transition, using relatively simple processing and font design rules, when moving from existing character sets. Since a base character can carry several diacritical marks, even in parallel, it was necessary for the purposes of automatic sorting to define unambiguously the decomposition of pre-composed characters, including the default mutual order of combining diacritical marks. This unambiguity has helped to create new processing rules also for other than ordering purposes.
The internationalization of the Internet imposes requirements on the character repertoire that can be used in contexts where US ASCII has been the only repertoire available or the the only one with reliable support. Within Internet e-mail, a general mechanism (MIME) that allows messages to be written in any character encoding, among other things, is relatively well established in principle. Similarly, for Web page content, well-defined techniques exist for communication between client and server as regards to character encoding. However, there are still serious practical limitations and problems, especially in interactivity like form submission. But as regards to Internet domain names and URLs, which often have great symbolic and marketing value too, they have even in principle been limited to US ASCII. The challenge is to make it possible to use different scripts and characters (and hence different languages) without too much overhead.
Also, for generalized fall-back schemes for searching and browsing, the pre-composed characters would have to be decomposable, as many schemes are based on dropping some or all of the diacritical marks. In particular, in searches it is often desirable to treat e.g. “é”, “è”, “ê”, etc, all as equivalent to “e”, especially since such characters are often mistyped in data. The composition of these characters, however, is not evident from the encoding.
The ambiguity caused by the redundancy of both pre-composed and decomposed encoding can be resolved by using “normalization tables”, which should be relatively stable in order for them to stay up-to-date. Different methods can be applied in normalization; the normalization algorithms are discussed in the Unicode Standard Annex #15. For example, the W3C Working Draft for the Character Model for the World Wide Web defines a solution based on a particular normalization method and “early normalization”.
As a consequence of the composition possibilities and rules, since the finalization of the second version of UCS Part 1, there is a strong reluctance to accept any new pre-composed Latin characters for encoding if they can be composed using the existing repertoire of base characters and combining diacritical marks. A new notation, the UCS Sequence Identifiers (USI), is intended to provide visibility for the encoding rules of characters that will remain decomposed and also for their processing and rendering rules. For a large user base needing support for a number of oriental languages, decomposed characters are the norm rather than an exception.
The encoding of a character and the method for inputting it are quite independent. For example, for keyboard entry, it is natural to make the most common characters easiest to enter, using a single keystroke. It is up to the keyboard driver program to create the appropriate encoding, which could, in principle, create a pre-composed character, a decomposed character or a sequence of characters from one or more keystrokes. New technologies for small devices create great challenges, especially since the devices can accommodate perhaps just a dozen or so keys. Software techniques that assist typing work best when the repertoire of characters is small.
Data can also be entered into the system in machine readable form. Thus data can be scanned from documents, in which case glyph recognition is required if character data and not just images need be processed. Special fonts, such as OCR-B [Optical Character Recognition, character set B], have been created for reliable, high speed character recognition of specific repertoires. Although current technology has advanced to a high degree of tolerance for the intricacies of font design, character recognition is always repertoire dependent. Other machine readable forms of input include bar codes and magnetic stripes. The resulting character encoding, on the other hand, should always be independent of the form of input.
The rendering of visible output is also repertoire dependent, although with advances in font design technology, it does not necessarily have to be at the level of the final, composed characters. As the UCS repertoire is extended, essentially more advanced font technologies will be needed for full support.
The repertoires supported by a given system or device for both input and output can routinely be extended by downloading upgrades from the various vendors’ websites. However, in practice this poses problems to ordinary users, and there is a growing need for sufficiently wide character repertoires to be supported by “factory defaults”.
Certain processing rules are only applicable for a specific repertoire and cultural environment. Thus, the growth of the UCS repertoire creates needs for defining smaller sets of characters: subsets of the repertoire for adequate processing in specific environments. Multilingual European Subsets [MES] of UCS have been defined in a CEN Workshop Agreement (CWA 13873:2000). MES repertoires are being used as the base for a number of European specifications.
An example of proper application of defined repertoires in a layered manner is the ordering of characters. Locally, in any given environment, the ordering of the native characters is fully dependent on and defined within that environment. The ever expanding multilingual Information Society, however, requires that also non-native characters must be ordered in a predictable manner, even within local applications. Thus, the non-native characters not covered by the local ordering rules should be ordered in Europe using the European Ordering Rules (ENV 13710:2000). Since this European standard has been defined for use in pan-European applications, it only covers the MES repertoire, and any characters outside that repertoire should be ordered according to the worldwide standard for ordering UCS (ISO/IEC 14651:2001). To facilitate this, the European standard is defined as an overlay profile of the international standard (and future national standards should be defined as overlay profiles of the European standard).
Similar considerations should be given to transformation, i.e. transliteration, transcription and fall-backs. In a local application, local rules should apply, as the issue is not necessarily whether or not the system is able to present the proper characters, since they may not be comprehensible to the users. Thus Greek, Russian or Yiddish words or expressions are likely to be readable (whether understandable or not) by the majority of western Europeans only when written using the Latin script. Also, even in the Latin script, a number of characters with or even without diacritical marks (e.g. thorn: Þ,þ) may be totally unfamiliar to a large number of users of the script, and a replacement notation (like th for thorn) is sometimes adequate; the same applies to other scripts, e.g. Cyrillic, particularly Slavic vs. non-Slavic. It should also be noted that the same character may be a base character in one language and a “refinement” in another and that the proper fall-back is often dependent on the source language.
Global applications such as multilingual search of the web cause particular problems whenever names are transliterated. Definite transliteration rules usually apply from one language to another, and there are very few standards in this area. A case in point is how differently Russian names are written in the different European languages using the Latin script, because these languages pronounce particular combinations of Latin characters differently.
The use of UCS and Unicode as the conceptual basis is recommendable. Whether the internal representation and processing of data uses UCS/Unicode depends on the systems and applications software in use. In effect, many systems already use UCS/Unicode internally in many applications that appear to be more restrictive. Particularly for transmission of data, practical considerations still often lead to the use of other alternatives, such as 8-bit encodings. They are then to be selected within the capabilities of the communicating systems according to the needs of the languages and cultural environments to be supported.
The primary objective is to be able to process together all the characters of a given language with the right character properties and the right ordering sequence and searching rules for the characters (and, eventually, combinations of characters). Normally the requirement is extended to being able to process together all the characters of all the languages that are commonly used in a given multilingual environment. The processing rules for a given document should be common and follow those for the “primary” language rather than those for ancillary languages.
In most modern systems, default values are being set on a number of what are often termed cultural elements (see CWA 14094) at the time of installation. These values include the coded character set to be used. In most cases, the system can be tailored by overriding the default values either permanently or temporarily. For the coded character set, the set to be used can often be specified for individual outgoing messages, etc, and in some instances even different fields of a database may be encoded differently.
The basic UCS/Unicode, in spite of its many virtues, is not a coding scheme for use as such for information exchange over communication facilities, not only because of potential performance implications of using long codes but also because the operation of various communication protocols could be confused by the codes. There exist, however, a number of standardized UCS Transfer Formats (e.g. UTF-8 and UTF-16) which provide for the transfer of the full UCS repertoire. Their performance depends on what sub-repertoires are being transmitted, since the characters have different lengths in different formats. For predominantly Latin character data, UTF-8 is the preferred choice for performance reasons. These transfer formats or other, typically mnemonic schemes, can be used between systems even when the coding used by the systems themselves is not UCS/Unicode. One such mnemonic scheme is used in HTML, where e.g. the euro sign can be expressed as the character “€" or referred to as “€” or by a notation for its hexadecimal or decimal UCS code, i.e. “€” or “€”. The character repertoire of HTML Version 4 and XML is that of UCS/Unicode.
If the system allows the specification of the coded character set to be attached to outgoing messages and data, the facility should be used. For example, E-mail programs and Web servers should use specific headers designed for this purpose. In addition, it is often useful to specify the language of the message. Data encoded in a different code than the one normally used in the receiving system can subsequently be converted according to the rules set by the recipient, e.g. by replacing characters or character sequences by others, dropping diacritical marks or replacing unprintable characters by a specific symbol. Systems that use UCS/Unicode internally may also offer to dynamically load the required fonts so that they can render the data as sent.
All CEN and ISO originated standards material, including CEN Workshop Agreements (CWAs), is sold by the European national standards organizations. All ISO originated material is also available directly from ISO, some free of charge on the web. Some of the CWAs, including those discussed here, are freely downloadable from the CEN web site.
Unicode standards and technical reports are available free of charge on the Unicode web site.
The W3C Working Draft for the Character Model for the World Wide Web is continuously updated.
Statskontoret, the Swedish Agency for Administrative Development has published Comparisons of Standardized Character Sets for Europe (2000:2) (ISBN 91-7220-374-9) on a representative number of 7- and 8-bit coded standardized and proprietary character sets and registrations. Code tables for selected pairs indicate which characters exist in both sets with the same encoding, which with different encoding and which don't exist in the comparison set.
|The Diffuse Project is funded under the European Commission's Information Society Technologies programme. Diffuse publications are maintained by TIEKE (the Finnish IS Development Centre), IC Focus and The SGML Centre.|