Unicode in Japan
Guide to a technical and psychological struggle
This is not a final version and probably contains numerous stylistic and factual disasters. Please correct my faults!
Purpose of this Document
The purpose of this document is to provide background information for the discussion of Unicode in the context of Japanese information processing. Because this has become an emotional subject for some people, misinformation has become common and it's hard to avoid heated debates on topics like 'Why Unicode can never ever work' or 'Why Unicode is the answer to all life's problems'. Hopefully this guide can help distinguish fact from dogma -- and it could also provide useful ammunition, whatever side you want to argue on.
This Unicode Tutorial might be useful for terminology and general information about character sets.
This document probably contains many errors and omissions. Please send corrections to me to help me improve it.
When this document uses language such as 'it is generally considered', it means that as far as I can determine the majority of people with no particular axe to grind are in agreement. It doesn't mean that I'm stating my personal opinion -- when I'm doing that, I label it accordingly.
I use the expression 'Japanese Unicode' or 'the Japanese part of Unicode' in a vague sense to mean those aspects and features of Unicode that are relevant to representing Japanese, especially the Han character areas of the Unicode character set.
Raw Materials -- pre-Unicode Japanese Character Sets
Before the arrival of Unicode on the scene, the Japanese government produced various standard lists of characters for various different purposes. Three government departments (the ministries of industry, culture and justice) have been involved with creating character sets. In order to understand the decisions made by these departments, it is necessary to bear in mind that the Japanese language was dramatically simplified and reorganized after World War 2, and for some decades thereafter the aim of Japanese language standards was to change and simplify the language, not to describe it.
During the 19th century, the number of kanji required for literacy in Japan was perhaps about 4,000. Even at that time, there were many people calling out for a rationalization and pruning of the writing system. In the early 20th century, the Ministry of Education (now the Ministry of Culture) issued a list of common kanji and a new kana system. Newspapers also announced their own plans of restricting kanji to some sensible subset (although these subsets appear very large and baroque by modern standards). However, opposition from traditionalists effectively postponed reform until after World War 2. In 1945, the Yomiuri newspaper announced that the abolition of kanji would now finally be possible, which at the time wasn't too extreme a position -- others were advocating the total abandonment of the Japanese language!
Japanese character sets as we know them, therefore, have arisen from a background of rapid change and strong reformism.
Since 1946, the Ministry of Culture has produced various lists of kanji (these standards were not indended for electronic use, so they are 'lists' alone without any encoding information. The first was the 'Toyo kanji list', a set of 1,850 commonly-used characters. This list abandoned the traditional phonetic readings of many characters, which enraged and confused many users. Adoption of the new standard in official publications was enforced anyway, but many issues were unresolved.
Partly because of the mixed reception of the Toyo list, later lists such as the Joyo ('everyday use') list of 1981 have come as 'recommendations', followed to varying degrees by different organizations. The 1,945 characters of the Joyo list form a core group of kanji that seems unlikely to change much and which is included in every Japanese character set for computer use. However, there are several very common kanji that are not in Joyo.
The second main list of kanji referred to today is the Gakushuu ('educational') list. This is a list of 1,006 kanji divided into levels based (approximately) on frequency. This list specifies the minimum number of kanji to be known by students at various school levels. It is a subset of the Joyo list and therefore a subset of all common Japanese coded character sets.
In the immediate postwar period, the range of kanji used in newspapers and official publications was strongly restricted, by government-imposed guidelines, to the Joyo repertoire. This is no longer the case, now that reformism has died down, and a recent study suggested that only about 90% of the kanji occurring in a sample of major daily newspapers were Joyo kanji.
The number of kanji (and other characters) in common use in Japan appears to have been increasing since the 1980s, and some would argue that the Joyo and Gakushuu sets are less relevant now than they used to be. However, they are the nearest thing to a definition of 'commonly used kanji' and are likely to remain so for a long time.
While the Ministry of Culture has been responsible for recommending kanji repertoires for education and newspapers, electronic character sets are the responsibility of the Ministry of Economy, Trade and Industry. This Ministry is responsible for Japanese Industrial Standards (JIS), which include standard coded character sets.
In 1969, the JIS C 6220 standard was created, specifying two sets of characters: a localized ASCII (yen sign instead of backslash) and a set of half-width katakana, suitable for writing basic Japanese on dot-matrix era devices. The 'C' in the name refers to electronics; the standard was later renamed to JIS X 0201, where 'X' refers to information technology.
JIS X 0208 (originally JIS C 6226) of 1978 was the first JIS character set to include kanji. It specified 6,335 kanji, arranged by frequency into two levels. The arrangement of the JIS X 0208 kanji does not seem to be connected with that of the Joyo kanji. The initial release of this standard was extremely problematic, and it wasn't until the 1983 revision that it was reasonably error-free. The selection of kanji was also odd: the character set was assembled from various existing proprietary sets, but prewar forms and other variants were suppressed, while many unusual variants were included in order to write personal and place names -- although not enough to write all personal names. Many bizarre mistakes were made in transcribing names, resulting in several new kanji coming into existance. The result was that JIS X 0208 was a slowly-adopted and controversial standard, which competed with a number of proprietary character sets (e.g. Fujitsu's JEF).
The 1983 revision, while correcting most of the errors, also replaced some kanji present in the actual language with simplified forms that existed nowhere outside the Ministry. Also, because JIS X 0208 had been made without reference to either the Joyo (see above) or Jinmei (see below) lists, which were growing, various new characters had to be added. During the 80's the National Language Council (part of the Agency for Cultural Affairs, nothing to do with the Ministry of Economy, Trade and Industry) was called in to solve some of the problems that had arisen from JIS X 0208. This example of liaison between two unrelated departments was considered a great step forward.
The problematic nature of JIS X 0208 created many problems for Unicode. The needs of round-trip compatibility meant that some kanji which had been split into two variants in JIS X 0208 had to be given two Unicode code points, even though almost everyone would agree that the 'variants' are just slightly different glyphs (e.g. U+5294 and U+5292). Unification was also made harder by the need to work out which JIS X 0208 characters were really unique and which were just very eccentric forms of ordinary characters.
JIS X 0212, the 'hojokanji' (auxiliary kanji) list, was created in 1990 to silence the complaints about kanji missing from JIS X 0208. Unfortunately, by this time the encodings (eg Shift-JIS) that had already become widespread did not have room for the 5,801 kanji of the new set. Instead people were adding the kanji they needed into their own particular Shift-JIS implementations, a tendency which continues to this day. Although barely ever used, JIS X 0212 was quite controversial in the way it 'modernized' some kanji and created ambiguous variants which appear to cover a range of real-world variants.
In 1995, JIS X 0221 was defined. It was basically a Japanese translation of ISO/IEC 10646-1:1993, i.e. identical to Unicode 1.1. However, Unicode 1.1 was not used widely (if at all). Subsequent revisions of JIS X 0221 will presumably track the development of Unicode.
In 1996/7, JIS X 0208 was reviewed very effectively under the guidance of Kouji Shibano, with several results:
The newly identified kanji were collected into JIS X 0213, a standard intended both to satisfy those discontented with the JIS repertoire and to help get the new kanji into Unicode 3.1. This new JIS standard had many advantages: it was easy to integrate with Shift-JIS encoding, it was well-researched, and it was developed in close co-operation with Unicode. Despite these properties, it is rarely encountered.
There was a certain amount of disagreement as to whether JIS X 0213 was intended as a step on the way to Unicode, or as an alternative to Unicode. Unicode has round-trip compatibility with JIS X 0213 and is a superset of it, so the main reason to use JIS X 0213 now is probably compatibility with Shift-JIS encoding and Shift-JIS based fonts (which are still numerous). Some US companies refused to support JIS X 0213 until it could be made round-trip compatible with Unicode; whether this was dutiful attention to international standards or neo-colonial bullying is occasionally debated.
The relative obscurity of JIS X 0212, the tendency to just use Unicode 3.1 instead of JIS X 0213, and the deficiencies of JIS X 0208 mean that at the moment, when a system is said to use "JIS", a proprietary extention of JIS X 0208 is often what is meant.
Around the mid 90s the general anger of the Japanese IT and typography community seems to have shifted away from JIS and toward Unicode, which perhaps reflects the increasing irrelevance of new JIS standards. The accusations made against Unicode often seem like a slightly expanded version of those made toward JIS (i.e. issues relating to the unification of Japanese with non-Japanese Han characters were added).
- A revision of JIS X 0208, reassigning the code points of several hundred kanji while only adding two new ones.
- Publication of the JIS Kanji Jiten, a tome explaining the JIS kanji in depth.
- The identification of hundreds of necessary kanji beyond those in JIS X 0208.
The Ministry of Justice became involved in the character set mudwrestle because the job of deciding what constitutes a valid personal name goes to the Ministry of Justice. The Jinmei kanji list (jinmeikanjiyou) is the list of kanji which may appear in given names.
Originally, the Jinmei list was determined without reference to any other standard list of kanji, and had no particular relationship to any other list. However, by 1983 it was a subset of the JIS X 0208 character set.
The original Jinmeiyou list was widely considered to be too narrow and it has been expanded at various points in the past. This reflects not only the discovery of various reasonable names that are not on the list, but also a gradual erosion of the language-reforming zeal of the Japanese government.
In the 2001, after the list had remained static for some years, a minor squabble over a curious name gradually developed into a reform movement. At issue was whether the kanji U+8235, 'rudder', could form part of a name. The Ministry eventually decided on a new revision of the Jinmei list, more extensive than any before, and made the following statement:
It is expected that the Jinmei kanji will at least double, and if possible increase to about the level of 1,000. Informal and incorrect characters will, as before, be forbidden but we will base our decisions on the following attitude: Except for very difficult kanji, everything that is requested should be allowed.
In this remark (my rough translation), I think we see the final disappearance of the reformist enthusiasm of the 40s and 50s.
Since this liberalization, hundreds of new name kanji have been suggested ('strawberry' seems to be a common one). Most of these are already JIS kanji but some are not and there has been speculation -- just speculation -- that a new Joyo revision and a new JIS revision may be necessary if the Ministry accepts exotic kanji variants.
There is one interesting property of Japanese names that, while not directly relevant, sometimes gets thought of as a character set issue. Most Japanese people have a hanko, a seal which has the individual's name carved on it and works like a signature. To be valid on legal documents, a hanko must have a certain level of complexity and uniqueness. The same variants of the same characters written in the same style still won't count as a signature; the exact precise glyph (including wear and damage) that appears on the hanko is the one that constitutes the individual's signature. Therefore, not merely a character and a variant but an actual glyph is recorded for many Japanese people's names -- a unique situation. Luckily, character sets are not concerned with particular glyphs (except possibly Mojikyo) so this issue does not affect us.
How Unicode Was Made
Many fascinating problems were addressed in the creation of Unicode, but two of the design considerations are especially relevant here:
Between them, these two aspects of Unicode's design account for the majority of the problems people have with the Unicode character set. Unfortunately, although both of them are essential considerations, in cases such as Japanese kanji they have often proved difficult to reconcile.
- Han Character Unification
- Round-Trip Compatibility with existing character sets
What is Han Character Unification?
When compiling a character repertoire from several existing repertoires, you have two strategies:
You can unify the character repertoire, in which case characters from existing sets are examined, and if two or more are felt to represent the same abstract 'real life' character, they are represented with only one character in the new set.
You can not unify the character set, in which case every single character that you take from a source set becomes one distinct character in the set you build.
A non-unified character set is easy to build: you just add all the characters from existing sets together to make your new set. In practise, though, this results in a very large set with a lot of duplicates. For instance, if you were making a character set for Europe, you wouldn't include a capital 'A' for every single European language; instead you would unify the 'A's found in the various national character sets into a single 'A'.
Although unification is important in all areas of Unicode, it is most important with Han characters (i.e. chinese-derived ideograms such as Japanese kanji) because their sheer number means that it is impossible to keep a separate set for all the languages that use them. Han characters can have any of the following forms:
In practise, Japanese kanji often have a recognizably different pre-war and a post-war form. It could well be argued that these forms are just typographic differences, i.e. differences in representation, not in the underlying character. It could equally well be argued that the creator of a document often wants to use the pre-war form and not the post-war, or vice versa, and that they therefore have to be kept separate. They are not generally kept separate in Unicode except in particular cases where a JIS standard keeps them separate.
Ideally, if unification were perfect, Unicode would provide a single character, having a single code point, and the display system would come up with an appropriate Traditional Chinese, Simplified Chinese, Japanese, Korean, or Vietnamese glyph to represent it. However, unification has not been perfect...
- Traditional Chinese, as used in Taiwan and in older documents elsewhere.
- Simplified Chinese: thousands of common Han characters have simplified forms used in mainland China.
- Japanese Kanji: Japanese uses thousands of Han characters, often with forms subtly different from the Chinese original.
- Korean Hanja: Korean uses a limited number of Han characters on a day-to-day basis, and a much larger number in historical/literart contexts. Again, the Korean form often differs slightly from the Chinese original.
- Vietnamese Chu Han: Vietnamese was once written with Chinese characters and special forms evolved for this purpose can still be required.
- Historical Forms: Chinese 'seal script' and Japanese cursive calligraphy are two examples of situations where particular variants of a character, not part of the common language, are often used. Most people would say that these 'artistic' variants are not separate characters, in the same way that a complicated medieval manuscript 'a' is not a different character from a regular printed 'a'.
Issues With Han Character Unification
Character unification is difficult at the best of times, and the Unicode Han character repertoire often seems to reflect expediency and personal opinion more than academic value. For instance, some characters are completely un-unified; the three characters for 'spirit' or 'mind', U+6c14, U+6c23 and U+6c17 are retained as separate characters because although they are the typical Simplified Chinese, Korean and Japanese versions of the same underlying character, they are visually distinct, very common, and appear separately in several pre-Unicode character sets (such as JIS 0212). In other cases, however, groups of variants that many would think of as quite separate have been unified and stuck on the same code point; U+9AA8 is a frequently-cited example of this.
When Unification is over-enthusiastic, and variants that are really thought of as separate are assigned the same code point, an unfortunate sequence of events tends to take place:
In theory, the users who were upset because they saw the wrong region's variant of a character could have benefited from the implementation of plane 14 language tags, special characters which indicate a preference for rendering in the style of one particular region or another. However, these language tags have never actually been implemented -- and purists would say that meta-information like that does has no business in a 'character set' anyway.
Often, people who favor 'aggressive' unification of the sort that produced U+9AA8 argue that there is just one character involved and that what is annoying the users is just a glyph difference, which should be sorted out at the display level. Unicode was generally built on the following paradigm:
- Considering, for instance, the Han character for 'bone', Unicode scholars decide that there is one basic character involved, and assign it a code point, U+9AA8
- Users are aghast to find that their national variant of the character is not displayed on their screens, because there is no way to specify the particular variant of the character needed (and their font usually just displays a variant close to the reference glyph provided by Unicode).
- Hair-pulling ensues and new characters are assigned for local variants; in this case U+3947 for Simplified Chinese, U+586C for Traditional Chinese, U+397C for Japan and U+4D69 for Korea. In some cases this step happens only long afterward and the local variant has to be represented with surrogate code points. Note: This is not a special problem just for Asian characters. It is analogous to what Unicode does for ASCII: punctuation marks that have local standard forms are present as the 'general' ASCII form and as 'specific' forms, e.g. the hyphen (U+002D) which is specialized in U+2010 'hyphen', U+2011 'nonbreaking hyphen', U+2013 'en dash' and so on.
- The unlucky creator of text now has to decide whether they mean 'bone' in general or whether they mean 'bone, the Japanese version of it in particular', bearing in mind that the display system may or may not know what to do with the regional versions. Characters now exist whose range of meanings overlap or contain each other. This is inelegant and creates work.
One Character has many Glyphs
However, for Han characters it is often better to think of it this way:
One Character has one or more Variants, which have many Glyphs
Unfortunately, many of those arguing in favor of Unicode have, historically, been non-users of Han characters who have tended to just assume that differences between character variants are generally insignificant display-level details, like differences between fonts. Some Asian character sets, notably the Taiwanese CCCII, have been cleverly built to take variants into account, but this is not a widespread paradigm.
A point worth noting is that sometimes, a pair of variants are interchangeable in one context but considered different in another context. For instance, the characters U+9AD8 and U+9AD9 would be considered interchangeable by all but the most finicky when used in Japanese text, but when used in a family name the difference is taken seriously.
In general, it is by no means easy to judge when a difference is between two characters, or between two variants of one character, or between two visual representations of one variant, and there is nowhere you can draw the line without irritating someone somewhere. Unicode has often drawn the line very aggressively (relegating entites that are very different in some situations to the status of mere glyphs), and then redrawn it later, and the result can be confusing for particular characters.
This page at Mojikyo (Japanese) describes this issue from the point of view of the Mojikyo organization, which tends to go the exact opposite way to Unicode and counts even glyph variants as separate characters. I can't find a useful English discussion on the subject.
What is Round-Trip Compatibility?
In order to get people to use Unicode, it had to work well with existing character sets. Specifically, it had to be possible to convert text in a legacy character set to Unicode and back without losing any information. In other words, Unicode had to have round-trip compatibility with all the major existing character sets.
Unfortunately, the need for round-trip compatibility often means that the quirks of every existing character set are perpetuated in a new set. For instance, to have round-trip compatibility, the new character set must have the following property: for each character in a given legacy set, there must be a character in the new set that represents that legacy character and no other character from that legacy set. This often makes unification very difficult. Suppose the legacy character set has two slightly different 'a' characters. You would like to unify all the 'a's into one single 'a' in your new character set, but you won't be able to, because when converting back to the legacy character set, you wouldn't know which 'a' to use. Thus, your new set must contain an extra character that has no place in your repertoire of 'real' characters, but is needed for round-trip compatibility.
Because of this need, Unicode has no pretentions toward being a set of actual 'real language' characters and only 'real language' characters. It contains numerous code points that are assigned to vague abstract concepts that formed part of some earlier regional text system and had to be put in Unicode for round-trip compatibility (compatibility with existing text display systems is another source of odd 'characters'). For instance, the Khmer area contains a pair of non-printing characters whose role was to switch between religious and non-religious mode in some now-rare local encoding. The encoding is gone but this pair of characters will now remain, specified but unused, forever.
Issues with Round-Trip Compatibility
Unicode relied heavily on the JIS standards as the basic repertoire of Japanese kanji. This was for two reasons: first, one of the goals of Unicode was to have round-trip compatibility with major existing character sets, among which JIS figured highly. The second reason was simply the lack of any other Japanese electronic character repertoire.
However, the JIS standards had been created without any particular regard for the kind of ambitious unification that Unicode was trying to do, and therefore, even though their repertoire was very small compared to Chinese standards or to the present-day repertoire of Unicode, they contained many 'characters' which would usually be thought of as mere variant glyphs (as noted above). This prevented the unification of those characters that had separate variants in JIS X 0208, and even in JIX X 0213 there are code points (61, in fact) assigned to what most commentators say are characters already in Unicode, but which now have to be given their own codepoints for round-trip compatibility.
Alternatives to Unicode
The fascinating Konjaku Mojikyo project is a character set -- or more properly a glyph set -- whose design philosophy does not include unification. This frees up Mojikyo to include vast amounts of variant characters -- for instance, the variants written on bones for divination in former times. The Mojikyo character set is an invaluable tool for anyone trying to represent classical Japanese literature or scripture electronically.
Mojikyo has some features that make it very well suited for literary or scholarly work; for instance there are powerful tools for finding and selecting kanji variants, and adding new variants is relatively easy (compared to Unicode). However, it also has some properties that make it difficult to recommend for industrial use: it focuses heavily on east Asia, it is not as well supplied with conversion and normalization tables as Unicode, and it is mainly supported with Japanese language documentation and tools. There is no chance of Mojikyo becoming standard on systems originating outside Japan.
The Mojikyo character set and font are free and make excellent reading for anyone who needs to write in the language of Xixia, an exceptionally beautiful system which sadly has been disused for some centuries.
This is a combination of character set, font, and searching tool. Found here, it resembles Mojikyo in that it is more of a glyph set than a character set, but unlike Mojikyo it has no global ambitions and focuses entirely on kanji, of which it encodes about 70,000. GTCode is maintained by Tokyo University and can be a valuable academic tool, but is probably not used for bulk storage of text. GT Code is widely enough used that there is a small market for tools based on it.
This combined character set and font is being actively maintained and contains some obscure characters, but it does not appear to have either the user base or the rich meta-information of Mojikyo.
The TRON project is not primarily concerned with character sets, although it is occasionally pushed as an international text representation standard due to the founder (Ken Sakamura) objecting to Unicode. The main part of TRON concerns the development of ITRON, a standard for embedded OSes, but there is also a desktop OS standard, BTRON, and there have been some BTRON implementations such as DOS/V and Chokanji. BTRON uses the 'TRON multilanguage environment' which among other things defines a character set that is extensible by adding in pre-existing character sets -- hence the relevance of TRON to Japanese text processing.
Chokanji is an excellent tool and perhaps the best way to work with ancient Japanese on a PC, but it has been some time since the desktop side of the TRON project showed much activity. The situation seems to have settled down with Chokanji being used by people who need to use the Mojikyo characters, plus various Japanese-oriented editing tools, while the TRON project works mainly on embedded OSes for handheld devices (at which it has been very successful: 80% of home electronics in Japan are thought to use TRON). This is in accordance with the main goals of TRON. It is therefore hard to see how TRON's multilanguage layer could become a global character set standard -- although in many ways it would make a very interesting one.
TRON's representation of characters consists of a number of 16-bit planes and a set of control codes that switch between them. It is thus similar to ISO 2022 and other systems, and unlike Unicode which uses surrogate code pairs to address the problem of supporting many characters. A feature of TRON's system is that new character sets can be 'plugged' in or out of a system, with the relevant control codes being enabled or disabled. TRON makes no attempt to unify the various component character sets; therefore an 'A' in a TRON system may be the 'A' in JIS, or the 'A' in Mojikyo, or the 'A' in any other character set. There were originally plans to incorporate a large amount of context information, e.g. about the age or type of characters, into the TRON system but this seems not to have happened.
The TRON system is described in some detail here. As these tables show, the following character sets were included by the year 2000:
However, this may not reflect what is actually available on a real TRON-based system; in particular Mojikyo is often included.
- JIS up to 213
- KS X 1001 (Korean)
- GB 2312
- CNS 11643
- i-Mode emoji (see below)
- The Daikanwa Dictionary repertoire
- Various regions of Unicode 2.0 (but not Han characters of Korean hangul)
Japan's flavor of ISO-2022 is conceptually a little like TRON (it includes other character sets whole), but like all ISO-2022 standards it specifies neither character repertoire nor encoding itself; it uses escape sequences to shift between various permitted existing charset/encoding pairs.
ISO-2022-JP2 is different from all other ISO-2022 systems in that (unlike the fairly short-lived ISO-2022-JP1) it encodes more than one language; in fact, by incorporating not only the JIS standards of the time but also the major Korean and Chinese character sets, ISO-2022-JP2 was a kind of forerunner of Unicode, a character set that ambitiously aimed to include a high proportion of the world's languages (remember that JIS already included alphabets such as Greek and Cyrillic).
Truly ahead of its time, ISO-2022-JP2 had a strong influence on the way the Japanese IT community thought of international character sets -- notably, it was both modal and non-unified. These properties are exactly opposed to the properties of Unicode, and some people found it difficult to switch between the two (e.g. the abuse of ISO-2022 encoding indicator sequences to indicate language can be a handy shortcut). The issues of modal versus non-modal encodings and unified versus non-unified character sets are addressed elsewhere in this document.
ISO-2022-JP is not normally used on desktop computers or on the internet so its visibility was never as great as that of Shift-JIS.
Shift-JIS is an encoding system for JIS X 0208 developed by Microsoft. The original encoding specified by JIS was a modal encoding too inefficient for the personal computers of the 1980's, so Microsoft created Shift-JIS, and specified a character repertoire initially consisting of JIS X 0208 and JIS C 0202 (half-width katakana). This is the 'Codepage 932' that Windows users know and love today. At the time, the efficient (compared to JIS encoding) and standard EUC-JP encoding was already available, but Microsoft claimed that some customers had demanded half-width katakana, which could not be encoded with EUC.
Despite its relative complexity compared to EUC-JP, Shift-JIS became an important encoding in Japan relatively early and was the focus of most attempts to fix the deficiencies in JIS X 0208. New proprietary Shift-JIS extensions are still appearing. These typically consist of ordinary Shift-JIS encoding used with JIS X 0208 and a set of vendor-specific characters in an unoccupied part of the Shift-JIS-encodable space. Although in the past extensions usually consisted of extra kanji not found in JIS X 0208, new character sets now often include special characters or versions of characters used in mobile phones or other handheld devices.
For example, the widespread system used in NTT Docomo's i-Mode mobile phones defines a set of emoji ('picture characters', some more like icons and some more like characters) and assigns them code points in the 0xf800-0xf9ff range. The competing J-Sky system is completely different, using an escape character to indicate that the ensuing Shift-JIS encoded characters are emoji. NTT Docomo publish their character set, divided into basic and extended areas, here for the curious.
A very large amount of traffic uses these extended Shift-JIS systems, so they will probably continue to be a factor in Japanese IT for a long time to come. Since many web-based systems are used from mobile phones, and many database back-ends are in turn used with these websites, there is a tendency for these character sets to spread to all kinds of computer systems. Some pseudo-standards for representing these characters in Unicode in the private-use area have evolved but at this time I know of no-one working on a proposal for formal inclusion in Unicode as a unified character set.
It can also be pointed out that the Chinese GB, Korean KR and Japanese JIS character sets that were considered for unification had been unified internally in three different ways: the Japanese by meaning, the Chinese by actual shape, and the Korean by sound (meaning that there were several characters that were visually identical but were associated with a different sound -- tricky!).
Unicode is a Western standard and therefore less appropriate than Japanese standards
For many reasons, it is occasionally believed in Japan and elsewhere that Unicode is a standard created by Americans and Europeans and foisted upon an innocent Asia. Although it may be true that Japanese involvement in Unicode has not been as assertive as it could have been, there are many possible interpretations of the situation. To some extent this is a matter of opinion; rather than try to prove or disprove the popular belief I would like to collect together some facts which shed light on how opinions have developed.
What is now usually just called 'Unicode' was developed in parallel by two groups; the ISO and the Unicode Consortium, the latter being (at the time) a group of mostly American industrial interests. Since the early 80's, the ISO had been discussing the possibility of an international character set. It proved difficult to decide whether Han characters in this set should be unified or not, although this was just one among many problems faced by ISO/IEC JTC1/SC2 (the snappily-named committee in question). By 1990, there was concern in the ISO about duplication of effort between the Unicode Consortium and JTC1/SC2, and a majority of SC2 voted (in Seoul, which thus became the Jerusalem of han unification) to coordinate ISO standards and work with the Unicode Consortium. Since then, the Unicode standard has been identical with the ISO's ISO10646 standard and the terms are usually used interchangeably.
People who are generally positive about Unicode like to point out that the Han unification effort in Unicode evolved from the efforts of a group called CJK-JRG (China, Japan, Korea Joint Research Group), formed in 1990 from the original JTC1/SC2, which met in Japan, Korea and China and was composed of representatives from those countries -- and was thus fully in the hands of han character users. On the other hand, Japanese people who oppose Unicode like to point out that the CJK-JRG actually favored the DIS 10646 proposal, a proposal for a non unified character set, which was thrown out in favor of unification with the Unicode Consortium's unified character set by the votes of American and European ISO members. More pro-Unicode people, however, like to point out that within the CJK-JRG, the Japanese voted against DIS 10646 because they aparrently preferred a unified character set. Those positive about Unicode, again, will mention that the editor of 10646, who favored the fusion between 10646 and Unicode that was eventually adopted, was in fact Japanese. In reply, anti-Unicode people will bitterly state that he was an employee of DEC Japan, a strongly pro-Unicode and pro-unification faction, and therefore not a free agent. In other words, there are plenty of facts to support either side.
Some claim that the work of unification was carried out by Americans unfamiliar with Han characters. This is a fabrication, since the actual unification was done by the CJK-JRG, as noted above. However, some claim (and this time with a foundation in reality) that in some cases the CJK-JRG was bullied or circumvented by the Unicode Consortium; for instance it is said that IBM managed to insert 32 characters from its own character set into the standard even after the CJK-JRG rejected them.
As these bitter references to DEC and IBM suggest, the original participants in the Unicode Consortium included many large US-based companies, whose expected return on the creation of Unicode was easier access to worldwide markets. The fusing of the ISO efforts with those of the Unicode Consortium, therefore, looked to some non-US observers like the selling out of international standards to US commercial interests. The issue of Han unification became identified with the idea of American corporations forcing globalization on a vulnerable Asian market, to an extent that often obscured academic and technical arguments (and, in many cases, still does).
With the release of JIS X 0213, a related issue appeared -- JIS X 0213 allocated separate characters for the Ainu language (the language of the pre-Japanese inhabitants of Japan), but Unicode preferred to represent them with combining characters, which has little impact on actual information processing but quite a lot on the people's feelings. Supporters of the decision would argue that the characters in question really are combined forms rather than unique signs with their own identity, and that Ainu was handled in the same way as many other non-ideographic Asian languages. Others would argue that the decision was railroaded through as the result of American companies refusing to use JIS X 0213 until it could be made compatible with Unicode.
In fact, Japanese opinions during the late 80s and early 90s were deeply divided on the subject of whether Han unification would be a good thing, but they were united in their opinion of the existing JIS standards. This speech (English) delivered at a conference around 1990 presents a moderate point of view.
Note that there are other areas of Unicode where there is much stronger reason to believe that standards affecting one group have been dictated by other groups. Languages that are not the official language of any nation, languages such as Thai which have many unique dialects spoken only outside the national territory, and languages such as Khmer which are the languages of very weak nations, are very vulnerable to this situation. In Japan's case, however, representatives from an IT-aware nation were participating from the very beginning. It would be reasonable to say that Japanese has been better represented than many other languages in Unicode decision-making.
Unicode/ISO 10646 has been an international effort by an enormously diverse group of people. At the time this effort began, Japan was subject to several unique disadvantages: the language was in a state of ongoing reform, the domestic character set standard (JIS) was weak, and Japanese industry, unlike American IT companies, showed little inclination to become involved. Efforts like TRON and Mojikyo were still in the future at that time, and the Japanese community was divided on the subject of unification (whereas the Europeans and Americans were united in favor). These factors may have resulted in a standard that does not have a Japanese feel to it, but if so, this has happened in spite of the efforts of the ISO and Unicode Consortium.
To sum up, I would suggest that the image of Unicode as something imposed by the West on Japan has something to do with these factors, in descending order of importance:
- Domination of the early Unicode Consortium by American corporations
- The choice of a Chinese font, and only a Chinese font, for the reference glyphs in the Han character area
- The decline of Japanese computing technologies, at least in the desktop market, compared to American based technology (especially Windows) during the time period in question
- The absence, in the late 80s, of a strong Japanese national standard or unified body of thought on what the standard should be like
Unicode does not have enough code points to represent every kanji
People who say this are probably thinking of Unicode version 1, which was never used and which indeed allowed only about 60,000 code points, in the BMP (Basic Multilingual Plane). Since 2.0, Unicode has allowed for 0h10ffff code points, more than the number of Han characters in existance by any measure. It is necessary to use surrogates to access most of these code points, surrogates being pairs of code points which, taken together, specify a character in a higher plane.
The strange belief that Unicode consists only of the BMP is still heard but is easy to correct. There are also many people who consider that surrogates are obscure and difficult to process and that kanji assigned to higher planes and accessed via surrogates are being 'swept under the rug'. Although it is true that some systems (e.g. early Java versions) adopted UCS2, an unofficial encoding system that does not handle surrogates, surrogates are not generally difficult to handle.
The surrogate system has the following properties:
It is extremely important to bear in mind that the system of using surrogate pairs to represent high-plane characters only applies to UTF-16 encodings. UTF-8 expresses the scalar Unicode code-point value in the same way whatever plane the value is on (higher planes require more bytes, though). If UTF-32 encoding is used, then all characters are represented directly by their code point (but each character takes up 4 octets). This is the kind of tradeoff that is always made in encoding systems, and is a property not of the Unicode character set but of the Unicode encoding systems. Having worked with many encoding systems, my personal opinion that although not perfect, Unicode surrogates are probably the least painful way to access more than 16 bit's worth of code points to appear so far.
- Surrogate code points cannot be confused with regular BMP code points
- It is easy to tell whether a surrogate code point is the first or last of a pair
- It is computationally easy to find the real character value given the surrogate pair, and vice versa
- Processing surrogate pairs to get a character is vastly simpler than processing diacritical marks and ligatures to get a display character, which text processing systems have to do frequently
There is no reason to support any more kanji, because the Japanese government only recogizes a limited set
This statement is usually heard from people who like Unicode and are wary of any suggestion that it is incomplete. They believe that, while more kanji could be added to Unicode, this would go against Japanese government policy and that therefore the current repertoire is all that should exist. This belief appears to refer to two Japanese standards:
As it happens, however, neither of these rules has particularly broad application. The Jinmei list, as noted above, is subject to change, is probably about to expand, and only refers to given names anyway. The Joyo kanji restriction was once enforced strictly for newspapers, and is still often referred to today, but usually only in newspapers and official documents. Nobody has ever expected novels or poems to conform to this abridged character repertoire, for instance. To represent even popular modern works of literature, it is necessary to look far beyond the Joyo list.
- The Jinmei kanji list
- The Toyo/Joyo lists, as applied to restrict the range of kanji in newspapers and official documents
Variant kanji are purely a display issue
This statement is usually made for one of two reasons: either because the person making the statement does not use Han characters and does not understand that two variants of a character can have a very different effect in some contexts, or because the person is thinking of Chinese usage, which is usually much looser than Japanese. In particular, Japanese users are usually extremely picky about the variants used for personal names, whereas Chinese users will often accept any variant provided the meaning is clear. (This statement is based on anecdotal evidence but the trend seems quite clear).
In fact, the importance attached in some contexts to the particular choice of variant is easily demonstrated by the large amount of software available for specifying and representing variants -- for instance this product. The variants produced by products like this are generally beyond the ability of Unicode, JIS and other 'technical' character sets, but are found in the 'academic' character sets like Mojikyo and GTKanji.
Han Unification makes it impossible to have Chinese and Japanese text on the same page
In Japanese, it is normal to quote Chinese text using Japanese character shapes. You don't shift to Chinese shapes for the quotation, just as you don't shift to a Fraktur font when quoting German. It is therefore not generally true that Unicode creates problems for mentioning one CJK language within another.
Nevertheless, if writing a Chinese to Japanese dictionary, you would need some 'out of band' information to indicate that a certain character in the headword should be written with a Chinese glyph, while the same character in the definition should be written with a Japanese glyph. This would normally be done by having different fonts for the two pieces of text. This is in contrast to systems like ISO 2022, in which the actual character stream would maintain state to indicate what encoding is to be used and therefore, as a side effect, what language the text is in. There are both advantages and disadvantages to including this type of information in the character stream, depending on the exact use case.
In systems like Shift-JIS, the issue does not arise, simply because Shift-JIS is not capable of representing, or intended to represent, more than one language. You would therefore have to use Shift-JIS for the Japanese text and something else, perhaps a GB encoding, for the Chinese headwords. Most people agree that this is quite a bit more problematic than either the ISO 2022 or the Unicode approach.
To sum up the above three paragraphs, the work required to switch between (for example) Chinese and Japanese is as follows:
Some confusion on this subject appears to have arisen because the reference glyphs for Han characters Unicode are Chinese, and many Japanese users got the impression that the character (not just the glyph) is Chinese and should not be used in Japanese. This is essentially a confusion between 'character' and 'glyph'. The Unicode Consortium could probably have done more to provide alternative reference glyphs, emphasise that what glyph is used for a character is a typographical issue, and prevent this mistake from arising.
- In Unicode: Change font (but you need extra information to tell you when)
- In ISO-2022: Change font and switch encoding (but the relationship between these two is not always 1-to-1, so you still need extra information)
- In single-language systems: Change font, then tell the display system that text in a new encoding follows (requires lots and lots of extra information to specify new encoding system)
Frequently Asked Questions
Are all Japanese characters in common use included in Unicode?
Well, until recently the Unicode Consortium's own FAQ answered this question 'no', although as of 2003 it has been changed. However, it depends on the definition of 'common use'. The number of kanji in common religious use, for instance, is extremely high and extends beyond the Unicode repertoire, because the sutras are studded with peculiar characters. It is also possible to find several placenames that are normally represented by characters or distinct variants not found in Unicode, and abbreviated or informal shapes found on shop signs are absent too; this site contains evidence of several such forms. On the other hand, if by 'common use' you mean things that people are actually likely to have to write down on a daily basis, then in practise Unicode would normally be found to contain them all.
A special issue is the hentaigana, old intermediate forms between kanji and modern-day kana, vital in cursive calligraphy (sousho). These are absent from Unicode (and it would be very complicated to add them), but even if they were present there would still be
Note that because Unicode contains so many Chinese characters, you are much more likely to be able to represent an archaic Japanese text in Unicode than in any JIS standard. However, it's only Mojikyo and GTCode that provide enough literary and ancient characters to have a chance of representing, say, the Heike Monogatari correctly.
What character sets should be considered 'safe' for Japanese communication?
This depends on who you're communicating with. PC systems use Microsoft's codepage 932, a Shift-JIS encoding and character set which should be considered standard for PC documents. Mainframes and old Unix systems, such as are often found in the Japanese government and large companies, usually handle only JIX X 0208, often with unique extensions. These two levels (CP932 and JIS X 0208) are probably the most important levels to bear in mind. Modern software based on Windows, Java or XML will usually be using Unicode as UTF-8 or UTF-16, and in practise a high proportion of Japanese text is handled in Unicode already simply because it is handled on Windows.
What do I need in order to represent *every* Japanese character?
There is no one character set, nor any one display system, that can correctly represent every character that is used in Japanese. The following are some reccommendations:
Even then, there are marks and ligatures used in vertical writing that only specially-designed text display systems are likely to handle, and then there's cursive writing (shousho), still widely used for poetry, which has no satisfying binary representation short of an image file. Such are the limitations of the character stream as a model of text.
- Ancient and exotic kanji and other Asian characters: Definitely Mojikyo
- Rare and literary kanji variants: Mojikyo or GTCode
- Local and proper-name variants: Mojikyo or GTCode
- Mobile phone characters/icons: One of the proprietary Shift-JIS extensions
- Anything else: Unicode will almost always suffice
What characters can be used as examples of Unicode problems?
Need an example to back up a criticism of Unicode? These are popular:
The 'bone' character U+9AA8 mentioned elsewhere in this document is often used as as an example of a unified character whose national glyphs look very different. In fact, it is not a terribly good example since although the Chinese version is partly a mirror-image of the Japanese version, they are still obviously the same character. A better, and equally popular example is U+76F4, whose Chinese and Japanese forms are not mutually recognisable; the average monolingual user, confronted with the two glyphs, would usually say they were two different characters, one common and one unknown. Whether you consider this to really be a problem depends on what you want to do; I am merely mentioning that it is an example of two signs that look like different entities to users being unified onto the same code point.
For an example of missing variants, the name 'watanabe' has often been used. There are three main variants of the second character in this name; one new and two older ones (U+908A and U+9089). The original JIS standard included both older ones. However, the older ones each have several subvariants used by various families, which were not in JIS and are therefore, as far as I know, still not in Unicode. These variants (and many other name variants) can be represented by many products sold in Japan designed specially for representing variant kanji, but there is no standard encoding for such products.
For an example of a character whose unification in Unicode has been inconsistent, you could use U+6808 and U+685F. These characters have seperate code points even though exactly analogous variants were unified onto code points U+6B8B and U+6D45. This is probably caused by the need for round-trip compatibility with some existing character set in which U+6808 and U+685F were already seperate.
Appendix A: Timeline
This timeline of Japanese character set development is color coded as follows:
- Unicode is green
- JIS is red
- Other Japanese Government standards are orange
- Everything else is black
- 1670: K'ang Hsi dictionary compiled -- the most important attempt to standardize kanji. The forms in this dictionary were standard for centuries.
- 1946: Toyo kanji list defined.
- 1951: Jinmei kanji list defined (92 kanji).
- 1958: Gakushuu kanji list defined.
- 1969: JIS C 6220 (Japanese ASCII) was defined.
- 1976: Jinmei kanji list revised.
- 1978: JIS C 6226 (Basic kanji) was defined.
- 1983: JIS C 6226 was revised and heavily corrected.
- 1981: Jinmei kanji list revised.
- 1981: Joyo kanji list defined.
- 1983: Shift-JIS developed.
- 1984: TRON project started.
- 1987: JIS C 6220 was renamed to JIS X 0201.
- 1987: JIS C 6226 was renamed to JIS X 0208 and revised heavily.
- 1988: TRON association founded and begins to popularise TRON.
- 1990: JIS X 0208 was revised.
- 1990: JIS X 0212 (auxiliary kanji) was defined.
- 1990: Jinmei (name) kanji list reaches its current form (Joyo + 285 kanji).
- 1991: Unicode Consortium founded, Unicode 1.0 published.
- 1993: Unicode Consortium and ISO publish Unicode 1.1.
- 1995: JIS X 0221 (JIS Unicode) was defined.
- 1996: Unicode 2.0 published, the first useful (and used) version of Unicode.
- 1997: Mojikyo institute begins to compile the Mojikyo character set.
- 1997: JIS X 0208 was revised again.
- 1997: Jinmei kanji list has one character added.
- 1997: JIS Kanji Jiten, a definitive resource on the JIS kanji, published.
- 1998: Unicode 2.1 published.
- 2000: JIS X 0213 (more auxiliary kanji) was defined.
- 2000: Unicode 3.0 published.
- 2001: Unicode 3.1 has round-trip compatibility with JIS X 0213.
- 2002: Unicode 3.2 published, adding many Han characters.
- 2002: Revised JIS Kanji Jiten published.
- 2002: Expansion of Jinmei list announced.
- 2003: Unicode 4.0 published, adding many historical Han characters.
Appendix B: Glossary
In Japanese commentary and standards documents, the words used to describe characters and to discuss whether two entities are really different characters do not match the English words, and reflect the fundamental differences between ideographic and alphabetic writing. To help non-Japanese speakers get a handle on how Japanese writers may see the issues, a few relevant words are described here.
This is the visual shape of a character, analogous to a glyph. The jikei often reflects a particular calligraphic tradition; for instance, where a dot appears in a character, that dot might appear as a teardrop shape in a Chinese-style jikei or as a short vertical tick in a Japanese-style jikei, reflecting old brush usage. Some jikei are characteristically Chinese, Japanese or Korean; others, such as 'Mincho' (the Ming dynasty standard writing style) look natural in many locales.
The jitai is the topography of a character. Glyphs with different jikei have equal numbers of strokes, but the style of the strokes is different. Glyphs with different jitai, however, may have different numbers and types of strokes. Many Japanese standards speak in terms of 'jitai' rather than of characters, especially the Jinmei list. Different jitai of one character may be called 'variants' in English. Sometimes, two different jitai may have exactly the same meaning, as with the various ways of making a lowercase 'g' in English. At other times, two different jitai may differ in correctness or connotations.
JIS X 0213, which defines various terms, defines Jitai rather differently: The abstract ideal of the shape of a graphical representation of a graphical character. (my rough translation).
It is possible for displayed kanji to differ from each other by jitai only (a pair of variants displayed in the same font), by jikei only (one variant, displayed in Mincho and in gothic style) or by both.
Two entities that have the same jitai and jikei may still be different in 'dezain'. The JIS X 0213 standard says that differences of 'dezain' will be ignored, which seems reasonable. The exact relative size or location of the different elements in a kanji is a matter of 'dezain', and so is the exact angle of a line; it's a font difference too small to constitute a different writing style.
Usually used as the equivalent of 'abstract visible representation'; thus two displayed forms that differ only by boldness, not by jikei, jitai or dezain, are forms of the same mojizukei. JIS has a long and unique definition of this word -- too complicated for this humble translator.
This word is usually translated as 'variant'. It usually carries the sense of a non-standard variant of some other standard character. Many personal and place names in Japanese are written with itaiji.
This means 'character'. It tends to be used in the sense of 'one element occurring in a string of writing' rather than the sense of 'one abstract character which may be instantiated in writing'. It does not intrinsically have a shape.
Defined in JIS X 0213 as One unit in a collection of elements used in the organization, control and representation of data. My translation is unreliable as usual, but it's interesting that the writers don't use any text or language related terms in the definition -- it's seen as a data entity, not a language entity.
A 'shape character', i.e. a character which refers to some visible entity. This would often be called a 'printable character' in English. Again, the term refers to the character (data entity) as found in a string, not specifically a language character.
These terms mean respectively 'abbreviated character' and 'colloquial character'. In fact, they are nearly synonymous. The antonymous term 'seiji' refers to the form of a character found in a dictionary. Although zokuji are not uncommon, most Japanese people would not consider them distinct entities requiring seperate coding, any more than a rapidly-handwritten 'E' is coded seperately from a typed 'E'.