DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

What's New

Reference
Business Guides
Standards List
Standards Fora List
RTD Project List

News
Electronic Commerce
Information Management
Information Society RTD
Standards Conferences
Diffuse Conferences

User Support
Index
Search
Help Desk

Background
About IST
About Diffuse
Diffuse FAQ
RTD Initiatives
IPR Statement
Disclaimer

Character Set Standards

Project funded under the European Commission's 5th Framework IST Programme

Purpose of Section
This section of the Diffuse Standards and Specifications List provides information on character sets that can be used for data interchange.

Subjects Covered

  1. ASCII - American Standard Code for Information Interchange
  2. EBCDIC - Extended Binary Coded Decimal Interchange Code  
  3. ISO 646 - ISO 7-bit coded character set for information interchange  
  4. ISO/IEC 2022 - Character code structure and extension techniques
  5. ISO/IEC 4873 - 8-bit coded character set for information interchange  
  6. ISO/IEC 6429 - Control functions for coded character sets
  7. ISO/IEC 6937 - Coded graphic character set for text communication - Latin alphabet  
  8. ISO/IEC 8859 - 8-bit single-byte coded graphic character sets
  9. ISO 9036 - Arabic 7-bit coded character set for information interchange
  10. ISO 9541 - Font information interchange  
  11. ISO/IEC 10367 - Standardized coded graphic character sets for use in 8-bit codes  
  12. ISO/IEC 10538 - Control functions for text communication
  13. ISO/IEC 10646 - Universal multiple-octet coded character set (UCS) More data
  14. JIS X 0201 - Japanese Industrial Standard code for information interchange
  15. JIS X 0202 - Extension techniques for use with the code for information interchange
  16. JIS X 0208/0212 - Code of the Japanese graphic character set for information interchange
  17. JIS X 0213 - 7-bit and 8-bit double byte coded extended Kanji sets for information interchange  
  18. OCR
  19. Other character set standards
More data Entry updated this month

Active Fora
The standards in this section have been prepared by both private and public organizations. The following public bodies have been involved in their preparation:

  • ISO/IEC JTC1/SC2 -- JTC1 is the first (and only) Joint Technical Committee of ISO and IEC, and deals with Information Technology. SC2 is the subcommittee of JTC1 responsible for the description of Coded character sets and Code extension techniques
  • ISO/IEC JTC1/SC34 -- SC34 is the subcommittee of JTC1 responsible for Document description and processing languages
  • ITU -- International Telecommunication Union
  • CEN -- European Committee for Standardization TC 304, ICT - European Localization Requirements (formerly called Character Set Technology)
  • TERENA -- Trans-European Research Networks Association Working Group on Character Sets and Internationalization of Networks Services
  • ANSI -- American National Standards Institute 
  • JISC -- Japanese Industrial Standards Committee.
Related Initiatives
Unicode, an industry consortium that produces the Unicode standard which has the same character repertoire and coding as the ISO/IEC 10646 (UCS) makes its Standard Version 3 and related Technical Reports available on its web pages. The standard also defines character properties and provides implementation guidelines that are not part of the UCS.

Statskontoret, the Swedish Agency for Administrative Development has published Comparisons of Standardized Character Sets for Europe (2000:2), a revised report (ISBN 91-7220-374-9) on a representative number of 7- and 8-bit coded standardized and proprietary character sets and registrations.  Code tables for selected pairs indicate which characters exist in both sets with the same encoding, which with different encoding and which don't exist in the comparison set.

Further information on certain standardized character sets, and other, proprietary, character sets, can be obtained from http://www.dkuug.dk/i18n/charmaps/

Part 5 of the Netherlands Ministry of the Interior's series on Standards for the electronic exchange of personal data (ISBN 90-5414-019-4) provides a tutorial on the character set standards and their historical development.

Information on character set conversion software can be obtained from the TERENA project by contacting http://www.nada.kth.se/i18n/c3/.

The Diffuse Guide to Character Sets provides on overview of the role of character sets.


DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

Section Contents
Standards List
Index
Help

Project funded under the European Commission's 5th Framework IST Programme


ASCII

Expanded name
American Standard Code for Information Interchange

Area covered
7-bit coded character set for information interchange

Sponsoring body
American National Standards Institute (ANSI)

Source documents
Information Systems – Coded Character Sets – 7-Bit American National Standard Code for Information Interchange (7-Bit ASCII)

Characteristics/description
Specifies coding of space and a set of 94 characters (letters, digits and punctuation or mathematical symbols) suitable for the interchange of basic English language documents. Forms the basis for most computer code sets and is the American National Version of ISO/IEC 646.

Usage
Used as the basic US code set for personal and workstation computers.

The following IST RTD projects use this standard: M-PIRO.

Further details available from
ANSI, 25 West 43rd Street, New York, NY 10036, USA

Other references
A list of ASCII codes can be obtained from http://www.dkuug.dk/i18n/charmaps/ANSI_X3.4-1968.


DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

Section Contents
Standards List
Index
Help

Project funded under the European Commission's 5th Framework IST Programme


EBCDIC

Expanded name
Extended Binary Coded Decimal Interchange Code

Area covered
8-bit coded character set for information interchange between IBM computers

Sponsoring body
Proprietary specification developed by IBM

Characteristics/description
A set of national character sets for interchange of documents between IBM mainframes. Most EBCDIC character sets do not contain all of the characters defined in the ASCII code set but there is a special International Reference Version (IRV) code set that contains all of the characters in ISO/IEC 646 (and, therefore, ASCII). Several national versions have been updated to support the encoding of the euro sign (in lieu of the currency sign).

Usage
Not much used outside of IBM and similar mainframe environments. When transmitting EBCDIC files between systems care needs to be taken to ensure that the systems are set up for the relevant national code set.

Further details available from
Your local IBM office.

Other references
Details of the most commonly used sets of EBCDIC codes can be obtained from http://www.dkuug.dk/i18n/charmaps which, however, has not necessarily been updated to cover the new code pages that also support the euro sign..

Unicode Consortium report on EBCDIC-Friendly UCS Transformation Format
OII Standards and Specifications Activity Report, December 1998

DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

Section Contents
Standards List
Index
Help

Project funded under the European Commission's 5th Framework IST Programme


ISO 646

Expanded name
ISO 646: 7-bit coded character set for information interchange

Area covered
Unaccented Latin letters, digits and punctuation characters

Sponsoring body
ISO/IEC JTC1/SC2 and ITU

Source documents
ISO 646:1991/ITU-T Recommendation T.50 (09/92) Information technology -- 7-bit coded character set for information interchange

Characteristics/description
Specifies 7-bit coding for space and 94 characters. There is an International Reference Version (IRV), which is identical to ASCII, and national variants that provide accented and other special characters required in different countries.

Character positions 00-31 (ISO positions 0/0 to 1/15) and 127 (ISO position 7/15) are reserved for control codes. Code 32 (2/0) identifies a space. The sequence in which other codes appear in the IRV is:

! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ `
a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~
Usage
Base character set used by most systems. Contains all characters provided by the standard shift positions of basic US QWERTY keyboards.

Further details available from
ISO and national standards bodies.


DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

Section Contents
Standards List
Index
Help

Project funded under the European Commission's 5th Framework IST Programme


ISO/IEC 2022

Expanded name
ISO/IEC 2022: Character code structure and extension techniques

Area covered
Structure of 7-bit or 8-bit code tables, rules for code extension

Sponsoring body
ISO/IEC JTC1/SC2

Source documents
ISO/IEC 2022:1994 Information technology -- Character code structure and extension techniques. A technical corrigendum was published in 1999.

Characteristics/description
ISO standard for switching between code sets in 7-bit and 8-bit environments. Describes the role of the Escape, Shift-Out (SO) and Shift-In (SI) codes in the base control code set for controlling which character sets are used in an 7-bit environment, and how the role of these characters changes in an 8-bit environment to provide a locking shift code swapping function.

Up to 4 code sets (G0-G3) can be mapped into the left-hand side of an 8-bit ISO code set. Three of these (G1-G3) can also be used on the right-hand side. Escape code sequences are used to identify which code sets are to be used. Users can also select variant control code sequences using Escape code sequences. Escape code sequences are also used to provide a single character change of character sets.

Usage
Forms the basis for code switching in other standards, including SGML, but Shift functions are not used on standard hardware platforms, making use of this standard problematical.

Further details available from
ISO and national standards bodies.


DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

Section Contents
Standards List
Index
Help

Project funded under the European Commission's 5th Framework IST Programme


ISO/IEC 4873

Expanded name
ISO/IEC 4873: ISO 8-bit code for information interchange

Area covered
Rules for developing 8-bit code sets

Sponsoring body
ISO/IEC JTC1/SC2

Source documents
ISO/IEC 4873:1991 Information technology -- ISO 8-bit code for information interchange -- Structure and rules for implementation

Characteristics/description
Standard explaining the structure of 8-bit coded character sets based on the concepts of ISO/IEC 2022. Three levels of implementation are specified, 1 for No Shifts, 2 for Single Shifts, 3 for Locking Shifts.

Note: ISO/IEC 4873 has not been updated to conform to changes made to ISO/IEC 2022.

Usage
Provides basic rules for later ISO standards.

Further details available from
ISO and national standards bodies.


DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

Section Contents
Standards List
Index
Help

Project funded under the European Commission's 5th Framework IST Programme


ISO/IEC 6429

Expanded name
ISO/IEC 6429: Control functions for coded character sets

Area covered
Control codes for 7-bit and 8-bit coded character sets

Sponsoring body
ISO/IEC JTC1/SC2

Source documents
ISO/IEC 6429:1992 Information technology -- Control functions for coded character sets

Characteristics/description
Defines 163 control functions, including the control characters that can be used in the C0 (0/0 - 1/15) positions in 7-bit and 8-bit environments and C1 (8/0 - 9/15) positions in 8-bit environments.

Usage
Forms the basis for control code definitions in many systems.

Further details available from
ISO and national standards bodies.


DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

Section Contents
Standards List
Index
Help

Project funded under the European Commission's 5th Framework IST Programme


ISO/IEC 6937

Expanded name
ISO/IEC 6937: Coded graphic character set for text communication -- Latin alphabet

Area covered
Defines a character set supporting most western European languages in a limited fashion

Sponsoring body
ISO/IEC JTC1/SC2 and ITU

Source documents

Characteristics/description
The left-hand side of this 8-bit code set is based on ISO 646. The code set for the right-hand set of 94 characters contains a set of diacritical marks that can form predefined combinations with letters on the left-hand side to produce accented characters, together with other characters used in European languages based on the Latin script that are not suitable for splitting into letter plus diacritic, such as the thorn (þ) used in Icelandic. The set also includes a set of single, double and French style angle open and closing quotation marks, Copyright and Registered symbols, the Spanish inverted question mark, some maths signs, fractions, superior numbers (2 and 3 only) and a set of arrows.

Usage
Basic character set used on teletext, videotext and related systems. For computers this code set has mostly been superseded by ISO/IEC 8859 and ISO/IEC 10646.

ISO 6937 also provides the character set repertoire used for X.400 message handling systems and X.500 directory services and its repertoire was the basis for the first version of the ISO/IEC 9995-2 international keyboard layout standard .

The ITU version of the standard is based on the 1994 version of the ISO standard.

Further details available from
ISO and national standards bodies.

Other references
Details of the ISO/IEC 6937 code set can be obtained from http://www.dkuug.dk/i18n/charmaps/ISO_6937-2-ADD.


DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

Section Contents
Standards List
Index
Help

Project funded under the European Commission's 5th Framework IST Programme


ISO/IEC 8859

Expanded name
ISO/IEC 8859: 8-bit single-byte coded graphic character sets

Area covered
Defines accented and non-Latin characters used in European languages

Sponsoring body
ISO/IEC JTC1/SC2

Source documents
ISO/IEC 8859 Information processing -- 8-bit single-byte coded graphic character sets

Characteristics/description
Specifies coding for sets of accented characters that cover the needs of most European languages, including limited sets of Greek, Hebrew and Arabic characters and some Cyrillic characters. Part 1 covers Western European languages, some of which have been more fully covered by Part 15, which also supports the euro sign. Part 2 covers Eastern European (Slavic, Albanian, Hungarian and a variation of Romanian) languages, Part 3 covers Southern European languages (Maltese) and Esperanto, and Part 4 covers Northern European languages. Part 9 covers characters used for Turkish, replacing those in Part 1 for Icelandic, while Part 10 deals with the Icelandic, Nordic and Baltic character sets. Part 11 combines Latin and Thai characters while Part 16 (Latin No. 10) replaces Part 2 for Romania and supports some characters with comma below as opposed to with cedilla and also the euro sign. Part 7 is under revision (to include, among others, the euro sign),  

Usage
Used by a few systems as the underlying code set. ISO/IEC 8859-1 has been commonly used as the basis of extended 8-bit code sets within the European Community. Mixing of code sets cannot be done so that there are problems when trying to move between environments using different parts of the standard (e.g. Greece, where Part 7 is used, and the Netherlands, where Part 9 is officially preferred).

The following IST RTD projects use this standard: M-PIRO.

Further details available from
ISO and national standards bodies.

Other references
Details of the ISO/IEC 8859 code sets can be obtained from http://www.dkuug.dk/i18n/charmaps.


DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

Section Contents
Standards List
Index
Help

Project funded under the European Commission's 5th Framework IST Programme


ISO 9036

Expanded name
ISO 9036: Arabic 7-bit coded character set for information interchange 

Area covered
Defines set of Arabic characters

Sponsoring body
ISO/IEC JTC1/SC2

Source documents

  • ISO 9036:1987 Information processing -- Arabic 7-bit coded character set for information interchange
  • ISO 11822:1996 Information and documentation -- Extension of the Arabic alphabet coded character set for bibliographic information interchange
Characteristics/description
ISO 9036 defines the stand-alone version of Arabic character in a form that can be used for interchange between computer systems using a 7-bit code set.

ISO 11822 covers the use of the Arabic alphabet in bibliographic entries

Usage
Unknown.

Further details available from
ISO and national standards bodies.

Other references
Details of the ISO 9036 code set can be obtained from http://www.dkuug.dk/i18n/charmaps/ASMO_449.


DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

Section Contents
Standards List
Index
Help

Project funded under the European Commission's 5th Framework IST Programme


ISO 9541

Expanded name
ISO 9541: Font information interchange 

Area covered
Provides mechanism for the interchange of information related to the metrics and drawing of glyphs used to display characters

Sponsoring body
ISO/IEC JTC1/SC34

Source documents

  • ISO/IEC 9541:1991 Information technology -- Font information interchange
  • ISO/IEC 10036:1996 Information technology -- Font information interchange -- Procedure for registration of font-related object identifiers
  • ISO/IEC TR 15413:2001 Information technology -- Font services -- Abstract service definition

Characteristics/description
While other standards describe the numeric codes to be assigned to "characters" within a computer, this standard defines how information about the representation of these characters on a screen or a printed sheet should be interchanged. A single coded character can have many different physical representations, depending on the type face (font) being used. Each such representation forms a unique "glyph".

Part 1 of the standard explains the general architecture of the font information interchange standard. Part 2 defines the metrics used to describe the weight, width, height, etc, of a glyph. Part 3 defines how the information needed to generate a glyph should be interchanged, and defines ASN.1 and SGML interchange formats for this information.

ISO 9541 has been defined so that metric information can be interchanged separately from the more commercially sensitive glyph generation information. The metrics defined in Part 2 are of relevance to composition and other software that needs to calculate the relative position of glyphs. Only when the characters are actually being displayed/printed does access need to be provided to the much bulkier glyph drawing information. ISO 9541 Type 1 fonts are compatible with Version 23.0 of the Postscript interpreter.

 A register of glyph identifiers is maintained on behalf of the ISO by the Graphics Communication Association of America (GCA). This register is based on ISO/IEC 10036.

Usage
Not widely adopted.

Further details available from
ISO and national standards bodies.

Other references

An operational model for characters and glyphs (ISO/IEC TR 15285)
OII Multimedia and Hypermedia Standards Activity Report, September 1996

DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

Section Contents
Standards List
Index
Help

Project funded under the European Commission's 5th Framework IST Programme


ISO/IEC 10367

Expanded name
ISO/IEC 10367: Standardized coded graphic character sets for use in 8-bit codes

Area covered
Defines graphic characters used for general purpose applications in typical office environments

Sponsoring body
ISO/IEC JTC1/SC2

Source documents
ISO/IEC 10367:1991 Information technology -- Standardized coded graphic character sets for use in 8-bit codes

Characteristics/description
Specifies a unique coded graphic character set for use as the G0 set and a series of coded graphic character sets of up to 96 characters for use as the G1, G2 and G3 sets defined in ISO 4873 when shifting levels 2 or 3 are implemented. It provides a comprehensive repertoire, including all characters from ISO/IEC 6937, 8859 Parts 1-9 and a box character set.

Registration of character repertoires is carried out using the procedures laid down in ISO/IEC 7350:1991.

Usage
Adopted as national standard in Austria. Uptake elsewhere restricted by limited support for ISO/IEC 2022.

Further details available from
ISO and national standards bodies.

Other references
Details of the ISO/IEC 10367 character set that allows box drawing characters to be used in conjunction with ISO/IEC 8859 can be obtained from http://www.dkuug.dk/i18n/charmaps/ISO_10367-BOX.


DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

Section Contents
Standards List
Index
Help

Project funded under the European Commission's 5th Framework IST Programme


ISO/IEC 10538

Expanded name
ISO/IEC 10538: Control functions for text communication

Area covered
Control functions required for text in page-image format, and for mixed formatted and formattable text

Sponsoring body
ISO/IEC JTC1/SC2

Source documents
ISO/IEC 10538:1991 Information technology -- Control functions for text communication

Characteristics/description
Describes the role of ISO 6429 control characters when used in page images or in text that has been, or is capable of being, formatted prior to presentation. Applies to text characters only, not graphics. The codes are defined for interchange purposes only: they are not intended for the actual processing of text.

Usage
Unknown.

Further details available from
ISO and national standards bodies.


DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

Section Contents
Standards List
Index
Help

Project funded under the European Commission's 5th Framework IST Programme


ISO/IEC 10646

Expanded name
ISO/IEC 10646: Universal Multiple-Octet Coded Character Set (UCS)

Area covered
Multilingual, multi-octet character set covering all major trading languages. The intent is to provide coding for all the characters of all the scripts of the world.

Sponsoring body
ISO/IEC JTC1/SC2 and ISO/IEC JTC1/SC22 WG20

Source documents

  • ISO/IEC 10646-1 Information technology -- Universal Multiple-Octet Coded Character Set (UCS)
    • Part 1: Architecture and Basic Multilingual Plane
    • Part 2: Supplementary Planes
  • ISO/IEC DIS 14651  International string ordering and comparison -- Method for comparing character strings and description of the common template tailorable ordering

  • ISO/IEC PRF TR 14652 Information technology -- Specification method for cultural conventions

  • ISO/IEC 14755:1997 Information technology -- Input methods to enter characters from the repertoire of ISO/IEC 10646 with a keyboard or other input devices
  • Unicode 3.2
  • RFC 2279 UTF-8, a transformation format of ISO 10646
Characteristics/description
Integrates previous internationally/nationally agreed character sets into a single code set together with additional characters to previously encoded scripts and new, both current and ancient scripts. ISO/IEC 10646 is based on 4 octet (32-bit) coding scheme known as the "canonical form" (UCS-4), but a 2-octet (16-bit) form (UCS-2) is used for the Basic Multilingual Plane (BMP), where the missing two high order octets are assumed to be 00 00.

The code set is split into 128 "groups" of 256 "planes", each containing 256 "rows" with 256 "cells" for characters. Each character is given a code position using multiple octets, the third (first) of which identifies the row containing the character and the fourth (second) its cell number.

The first 127 characters of the Basic Multilingual Plane (BMP) that can be encoded in 16 bits are those of the ISO 646 International Reference Version of ASCII. The characters forming the second half of the first row are those used in ISO/IEC 8859-1, the Latin-1 character set. Other rows provide encoding for:

  • extended Latin characters
  • the International Phonetic Alphabet (IPA)
  • Greek (including accented characters, "monotoniko" and "polytoniko")
  • Cyrillic, Georgian and Armenian
  • Hebrew, Ethiopic
  • all four forms of Arabic characters (initial, medial, final and stand-alone)
  • Indic languages, mostly used on the Indian subcontinent (including Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Myanmar, Oriya, Sinhala, Tamil and Telugu)
  • Khmer, Lao, Mongolian, Thai, Tibetan
  • Chinese/Japanese/Korean (CJK) unified ideographs, radicals, letters and months; Bopomofo, Hangul syllables, Hiragana, Kangxi radicals, Katakana, Yi and Yi radicals
  • Cherokee, unified Canadian aboriginal syllabics
  • Ogham, Runic, Syriac, Thaana
  • currency symbols
  • mathematical symbols and operators and special character forms
  • box and line drawing characters, blocks and arrows
  • geometric shapes and Dingbats
  • special OCR characters used on cheques, Braille patterns
  • encircled characters and numbers
  • etc.

The planes specified in Part 2 are:

  • Plane 1: SMP, Secondary Multilingual Plane
  • Plane 2: SIP, Supplementary Plane for CJK Ideographs
  • Plane 14: GPP, General Purpose Plane

ISO/IEC 14651 defines a "reference comparison method" that allows programs to determine the relative order of two UCS strings. It also defines a Common Template Table that describes an order for all characters encoded in the first edition of ISO/IEC 10646-1 up to Amendment 7.

ISO/IEC Technical Report 14652 defines a general mechanism to specify cultural conventions, and formats for a number of specific cultural conventions in the areas of character classification and conversion, sorting, number formatting, monetary formatting, date formatting, message display, addressing of persons, postal address formatting, and telephone number handling.

Usage
This standard has become the basic coding form for all 16 and 32-bit computer systems.

Users of Internet Explorer 5, and XLink-aware XML browsers, can obtain more details about applications of ISO 10646 from our Diffuse Topic Map service.

Further details available from
ISO and national standards bodies.

Other references
Details of the Unicode standard, the repertoire and coding of which are identical to those of the ISO/IEC 10646 code set can be obtained from http://www.unicode.org.

European Ordering Rules (for the Multilingual European Subsets of ISO/IEC 10646-1, CWA 13873:2000) have been published by CEN TC304 as ENV 13710:2000. More information can be found at: http://www.stadlar.is/TC304/EOR/eorhome.html
Requirements for String Identity Matching and String Indexing for ISO 10646 coded documents
OII Standards and Specifications Activity Report, July 1998
New languages to be covered in next edition
OII Standards and Specifications Activity Report, October 1998
Unicode Consortium report on EBCDIC-Friendly UCS Transformation Format
OII Standards and Specifications Activity Report, December 1998
Unicode 3.0 to be based on 2nd Edition of ISO 10646
OII Standards and Specifications Activity Report, August 1999
Unicode in XML and other Markup Languages
OII Standards and Specifications Activity Report, September 1999
Unicode 3.0 published
Information Management Standardization Activity, March 2000
Character Normalization in IETF Protocols
Information Management Standardization Activity, September 2000
ISO 10646-1:2000 published
Information Management Standardization Activity, October 2000
Unicode in XML and other Markup Languages
Information Management Standardization Activity, December 2000
Character Model for the World Wide Web 1.0
Unicode 3.1
Information Management Standardization Activity, January 2001
Unicode 3.1
Information Management Standardization Activity, March 2001
Use for e-business standardization
Electronic Commerce Interoperability Report, December 2001
Use within Character Model for the World Wide Web
Information Management Standardization Activity, December 2001
Unicode 3.2
Information Management Standardization Activity, June 2002
UTF-8 specification updated to conform to Unicode 3.2
Information Management Standardization Activity, October 2002

DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

Section Contents
Standards List
Index
Help

Project funded under the European Commission's 5th Framework IST Programme


JIS X 0201

Expanded name
Japanese Industrial Standard Code for Information Interchange

Area covered
Interchange of Latin and Katakana characters

Sponsoring body
JISC - Japanese Industrial Standards Committee

Source documents
JIS X 0201:1976 (reaffirmed 1984) Code for Information Interchange (published in Japanese and English)

Characteristics/description
Provides 7-bit and 8-bit code sets for Latin characters (based on ISO 646) and the simple Katakana letters used to aid phonetic interpretation of Kanji ideograms. (Katakana is used for teaching Japanese children to read.)

In 7-bit environments the SO (0/14) and SI (0/15) codes are used to switch from the Latin to the Katakana code set. In 8-bit environments the Katakana characters form the right-hand sector (11/1 to 13/15).

Usage
Used to transfer Japanese information between early Japanese computer systems.

Further details available from
Japanese Industrial Standards Committee, c/o Standards Department, Ministry of International Trade and Industry, 1-3-1 Kasumigaseki, Chiyoda-ku, Tokyo 100, Japan.

Other references
Details of the JIS X 0201 code set can be obtained from http://www.dkuug.dk/i18n/charmaps/JIS_X0201.


DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

Section Contents
Standards List
Index
Help

Project funded under the European Commission's 5th Framework IST Programme


JIS X 0202

Expanded name
Extension techniques for use with the Code for Information Interchange 

Area covered
Switching of Japanese character code sets

Sponsoring body
JISC - Japanese Industrial Standards Committee

Source documents
JIS X 0202 Extension techniques for use with the Code for Information Interchange (published in Japanese and English)

Characteristics/description
Japanese equivalent of ISO 2022.

Usage
Used on 8-bit Japanese word processors to call in multiple character sets.

Further details available from
Japanese Industrial Standards Committee, c/o Standards Department, Ministry of International Trade and Industry, 1-3-1 Kasumigaseki, Chiyoda-ku, Tokyo 100, Japan.


DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

Section Contents
Standards List
Index
Help

Project funded under the European Commission's 5th Framework IST Programme


JIS X 0208/0212

Expanded name
Code of the Japanese Graphic Character Set for Information Interchange

Area covered
Interchange of Latin, Kanji, Hiragana and Katakana characters

Sponsoring body
JISC - Japanese Industrial Standards Committee

Source documents

  • JIS X 0208:1990 Code for the Japanese graphic character set for information interchange (published in Japanese and English)
  • JIS X 0212:1990 Code of the supplementary Japanese graphic character set for information interchange
Characteristics/description
Multiplane standard providing access to 6353 Kanji ideographs, 86 Katakana character and sound identifiers, 83 Hiragana character and sound identifiers, 52 Roman, 48 Greek and 66 Cyrillic letters, together with associated numeric, punctuation and line drawing codes.

Usage
Whilst only providing access to a portion of the extensive Japanese ideograph set this standard is used by many Japanese word processing and general computing systems.

Further details available from
Japanese Industrial Standards Committee, c/o Standards Department, Ministry of International Trade and Industry, 1-3-1 Kasumigaseki, Chiyoda-ku, Tokyo 100, Japan.


DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

Section Contents
Standards List
Index
Help

Project funded under the European Commission's 5th Framework IST Programme


JIS X 0213

Expanded name
7-bit and 8-bit double byte coded extended Kanji sets for information interchange

Area covered
Interchange of Kanji characters

Sponsoring body
JISC - Japanese Industrial Standards Committee

Source documents
JIS X 0213-2000: 7-bit and 8-bit double byte coded extended Kanji sets for information interchange

Characteristics/description
JIS X 0213 specifies 11,223 characters and their bit combinations. The characters consist of  an extension to the coded character set of JIS X 0208 with an additional 4344 characters. JIS X 0213 also defines two implementation levels, level 3 and level 4, in addition to the two defined in JIS X 0208

Usage
Used in Japanese word processors.

Further details available from
Japanese Industrial Standards Committee, c/o Standards Department, Ministry of International Trade and Industry, 1-3-1 Kasumigaseki, Chiyoda-ku, Tokyo 100, Japan.


DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

Section Contents
Standards List
Index
Help

Project funded under the European Commission's 5th Framework IST Programme


OCR

Expanded name
Optical Character Recognition

Area covered
Coding of machine readable characters by defining the repertoire and the related glyphs 

Sponsoring body
ISO/IEC JTC1/SC31

Source documents

  • ISO 1073 Alphanumeric character sets for optical recognition
    • Part 1: Character set OCR-A -- Shapes and dimensions of the printed image
    • Part 2: Character set OCR-B -- Shapes and dimensions of the printed image
  • JIS X9003:1980 Katakana character set for optical recognition
  • JIS X9005:1979 Handprinted Katakana characters for optical character recognition
  • JIS X9006:1979 Handprinted numerals for optical character recognition
  • JIS X9007:1981 Handprinted alphabets for optical character recognition
  • JIS X9008:1981 Handprinted symbols for optical character recognition
  • JIS X9009:1991 Handprinted Hiragana characters for optical character recognition
  • JIS X9010:1984 Coding of machine readable characters (OCR and MICR)
Characteristics/description
Limited character sets that are designed to be machine readable. OCR-A provides numbers and other characters needed for automated cheque handling. OCR-B allows alphabetic characters to be used in machine-readable data. The Japanese Industrial Standards (JIS) committee has a number of extensions for the recognition of Japanese characters and for the recognition of handwritten numbers, characters and symbols.

The revision decided upon in June 1995 by ISO/IEC JTC1/SC2 to extend the OCR-B glyph set to include a range of accented and related characters has been cancelled and the transfer of the maintenance of ISO 1073 (together with ISO 1831 - Printing Specifications) from SC2 to SC31 has been completed in May 2001.

CEN TC304 is working on an ENV to extend the normative OCR-B repertoire of ISO 1073-2  by the Euro sign and also add some of the European characters from the previous SC2 attempt to an informative annex of the ENV. Once completed, this ENV is envisaged to be fast-tracked for a revision of the ISO standard.    

Usage
Widely used to enable accurate machine scanning of information. With the current progress in character recognition technologies, however, considerable flexibility is available for the scanning of particularly less critical information.

Further details available from
ISO and national standards bodies.

Other references

Proposed addition of Euro to OCR-B
OII Standards and Specifications Activity Report, July 1998

DIFFUSE: Dissemination of InFormal and Formal Useful Specifications and Experiences to research, technology development & demonstration communities

Section Contents
Standards List
Index
Help

Project funded under the European Commission's 5th Framework IST Programme


Other Character Sets


Area covered

Machine readable characters for non-European languages not covered elsewhere

Sponsoring body
Various national standards bodies

  • CNA GB-2312-89 Code of Chinese ideograms for information interchange -- Basic set
  • CNA GB-7590-87 Code of Chinese ideograms for information interchange -- 4th supplementary set
  • CNA GB-8565-88 Coded character set for text communication
  • CNA GB-12345-90 Code of Chinese ideograms for information interchange -- Supplementary set
  • IS 13194:1991 Indic Script Code for Information Interchange (ISCII)
  • KS C5601-1992 Code for information interchange (Korean)
  • KS C5636-1993 Code for information interchange (Latin characters)
  • KS C5627-1991 Extension code sets for information interchange
  • MS 1362:1983 Jawi character set (Malaysian)
  • TIS 620:1990 Thai character codes for computers
Characteristics/description
Character sets whose use is normally specific to one or two countries.

Usage
Used within local markets. Often form the basis of an ISO 10646 code plane.

Further details available from
National standards body

File last updated:
October 2002
The Diffuse Project is funded under the European Commission's Information Society Technologies programme. Diffuse publications are maintained by TIEKE (the Finnish Information Society Development Centre), IC Focus and The SGML Centre.