This article describes the new GB18030-2000 standard in terms of its evolution, structure, and properties. It then discusses how Sun has embraced this new standard, how it is supported in its products, and its official certification of compliance from the People's Republic of China (PRC).


GB18030-2000 is a new character set standard from the PRC that specifies an extended codepage and a mapping table to Unicode.

On March 17, 2000, the Chinese government issued regulations mandating that all operating systems on non-handheld computers sold in the PRC after January 1, 2001 would have to comply with the new multibyte GB18030-2000 standard. However, the initial implementation deadline of January 1, 2001 was later postponed until September 1, 2001.

Evolution of GB18030-2000

All character set standards that originate in the PRC have designations that begin with "GB". GB is an abbreviation for Guojia Biaozhun, meaning "national standard". The GB 2312-1980 character set standard was established in 1981 to represent simplified Chinese characters. GB 2312-1980 is a coded character set that contains 7,445 characters, including 6,763 Hanzi and 682 non-Hanzi characters. With the release of ISO 10646-1/Unicode 2.1 in 1993, the PRC expressed its fundamental consent to support the combined efforts of the ISO/IEC and the Unicode Consortium through publishing a Chinese National Standard that was code- and character-compatible with ISO 10646-1/Unicode 2.1. This standard was named GB 13000.1. Whenever the ISO and the Unicode Consortium changed or revised their common standard, GB 13000.1 subsequently adopted these changes.

To accommodate all additional Hanzi characters specified in GB 13000.1 that are not included in GB 2312-1980, a new specification known as GBK was then introduced. GBK is an abbreviation for "Guojia biaozhun kuozhan", which is the Chinese for "Rules/Specifications defining the extensions of internal codes for Chinese ideograms". GBK is an extension of GB 2312-1980 and the key significant property of GBK is that it leaves the characters and codes as defined in GB 2312-1980 untouched and positions all additional characters around it. The additional characters are mainly those of the Unified Han portion of Unicode 2.1 that go beyond the character repertoire of GB 2312-1980. Thus, code and character compatibility between GBK and GB 2312-1980 is ensured while, at the same time, the complete Unicode Unified Han character set is made available. At the time when GBK was defined, other characters were added that were not available in Unicode.

GBK defines 23,940 code points containing 21,886 characters. At the same time, GBK provides mappings to the code points of Unicode 2.1. However, due to the packed code space used to define GBK, it became obvious that there was no space left for a major addition. The 1,894 code points of GBK's three user-defined areas were not even close to providing sufficient space for the CJK Unified Ideographs Extension A, which defines 6,582 new characters in plane 0 of Unicode, version 3.0, the Basic Multilingual Plane (BMP).

Therefore, GB18030-2000 was created as an update of GBK for Unicode 3.0 with an extension that covers all of Unicode. It is fully backward-compatible with GB 2312-1980 and GBK. The mapping table from GB18030-2000 to Unicode is backward-compatible with the mapping table from GB 2312-1980 to Unicode, however, the GBK to Unicode table has a few differences. GBK contains characters which were not defined in Unicode 2.1, but were added in later versions of Unicode.

GB18030-2000 specifies a mapping table that covers all Unicode code points and maintains compatibility of GB-encoded text with GBK and GB 2312-1980.

Properties of GB18030-2000

GB18030-2000 has the following significant properties:

  • It incorporates Unicode's CJK Unified Ideographs Extension A completely.
  • It provides code space for all used and unused code points of Unicode's plane 0 (BMP) and its 15 additional planes. While being a code- and character-compatible "superset" of GBK, GB18030-2000, at the same time, intends to provide space for all remaining code points of Unicode. Thus, it effectively creates a one-to-one relationship between parts of GB18030-2000 and Unicode's complete encoding space.
  • In order to accomplish the Unihan incorporation and code space allocation for Unicode 3.0, GB18030-2000 defines and applies a four-byte encoding mechanism.

GB18030-2000 encodes characters in sequences of one, two, or four bytes. The following are valid byte sequences (byte values are hexadecimal):

  • Single-byte: 0x00-0x7f
  • Two-byte: 0x81-0xfe + 0x40-0x7e, 0x80-0xfe
  • Four-byte: 0x81-0xfe + 0x30-0x39 + 0x81-0xfe + 0x30-0x39

The single-byte portion applies the coding structure and principles of the standard GB 11383 (identical to ISO 4873:1986) by using the code points 0x00 through 0x7f.

The two-byte portion uses two eight-bit binary sequences to express a character. The code points of the first (leading) byte range from 0x81 through 0xfe. The code points of the second (trailing) byte ranges from 0x40 through 0x7e and 0x80 through 0xfe.

The four-byte portion uses the code points 0x30 through 0x39, which are vacant in GB 11383, as an additional means to extend the two-byte encodings, thus effectively increasing the number of four-byte codes to now include code points ranging from 0x81308130 through 0xfe39fe39.

GB18030-2000 has 1.6 million valid byte sequences, but there are only 1.1 million code points in Unicode, so there are about 500,000 byte sequences in GB18030-2000 that are currently unassigned.

Sun Product Support

Sun has been among the first to embrace the new GB18030-2000 standard and has been working closely with the Chinese government in further defining GB18030-2000 and ensuring that Sun's products comply with the new standard.

On a platform level, Sun's Solaris Operating Environment has a single-source, internationalized framework that is extensible and codeset-independent. This allows Sun to overcome any implementation difficulties. Sun has globalization APIs such as mbtowc(), mbstowcs(), and mblen() to help convert GB18030-2000 multibyte strings to wide character format. This helps applications to easily process GB18030-2000 text files. Sun also provides an iconv code conversion module to convert GB18030-2000 multibyte strings to Unicode.

On an application layer, Sun products such as Java will have their own support for GB18030-2000.(Java 1.4 has been certified with an A+ rating). Sun ONE products use libraries for character set processing. In order to fully support GB18030-2000, these libraries are being updated. Sun ONE applications will capture the platform and library updates as new versions are released.

The framework for the GB18030-2000 standard has been finalised, however, the Chinese government will still continue the process of refining the requirements for this standard.

Sun Awarded Highest Compliance Rating

To date, the PRC have specified three main requirements that products must meet in order to fully comply with the GB18030-2000 standard:

  1. Products must be able to correctly identify and output all the characters defined in GB18030-2000, including Minority characters such as Mongolian, Tibetan, Yi, and Wei. For the Minority characters, it is sufficient to support the code points, which means that the Minority characters are legal in the system, but there are no fonts to support output.
  2. Products must be able to correctly edit and process all the characters defined in GB18030-2000, except for the Minority characters.
  3. Products must comply with requirements 1 and 2 above and also be able to edit and process all the Minority characters.

Sun has satisifed the PRC's GB18030-2000 current requirements and is among the first to be officially certified for compliance with GB18030-2000. In fact, Sun Solaris 8 Operating Environment received A+ rating, the highest possible rating from the PRC government for GB18030-2000 compliance. As the standard evolves, Sun will continue its efforts to track and meet any new requirements that will be set.

