Han Unification in Unicode

Otfried Cheong

October 12, 1999

There seems to be a lot of confusion, even among Unicode enthousiasts, about Han unification. Perhaps future versions of the standard should add more explanations of this controversial issue. Let me explain my view of the affair. I am not an expert at all, so please correct me where I am wrong!

First of all, there has NOT been a single unification between a (PRC) simplified and a traditional Chinese character. Similarly, I am not aware of a single unification between a simplified Japanese and a traditional character. (That could not happen, since many non-simplified versions are still in use in Japan and encoded in JISx0212, so there would be no round-trip compatibility for documents encoded in a mixture of JISx0208 and JISx0212).

So what has been unified? Roughly speaking, characters that have the same meaning and shape. One has to understand that there are variations of many of the primitive elements that appear in Chinese characters, and writers have traditionally been free to write as they preferred--the more elaborate form when carving a giant character in stone, the simpler form when writing a shopping list... Different Chinese fonts have used different variants of some characters before Unicode even existed.

My last name, for instance, is U+912d:

The two dots in the top left can be written either like this /\ or like this \/, with no difference in meaning. Everybody can easily recognize both variants. Different fonts will show them in different ways, even within the same locale, and so the characters have been unified in Unicode. (There is, by the way, a PRC-simplified version, which is encoded as U+90d1--no unification there!)

Another example of a character element that appears in two variants is the "black" radical. It can be written either in its traditional form as U+9ed1 (with two little dots), or in its simplified form as U+9ed2 (where the two dots have been replaced by a single stroke):

"Simplified", by the way, doesn't refer to a political process here, it is simply the way people have written for centuries to write faster--just like the syllable "un" in handwritten English looks like a single wiggled line.

I'm not sure why these two characters have not been unified--other characters that contain this element have been. For instance, you will find U+9edb with both variants of the radical in different fonts, and nobody will have any difficulty recognizing them as variants of the same character. (But compare the distinction U+9ed8 versus U+9ed9, which are again variants of the same character. They are not unified since there is a structural change in the makeup of the character.)

A less obvious example is U+76f4, which has two variants:

Here the difference is really one of locale, you'll find the right hand glyph in Japanese or Korean fonts, the left hand one in Chinese fonts. Most Japanese won't recognize them as the same character, while I believe Chinese and Koreans would (but I didn't).

Is it okay to unify character variants that ordinary people wouldn't recognize? Yes, of course, just as Sütterlin script should be encoded with the Latin-1 repertoire, even though most non-German speaker (and even most young German speakers) cannot read it easily.

So what is all the controversy about?

There are several issues here:

"This is not my name"

People can be attached to particular shapes of particular characters. Japanese would be unhappy if you wrote their name using a variant different from what they consider "their name". (Interestingly, they do not mind at all if you don't know how to pronounce the name.) There is a market in Japan for font software that allows you to modify glyphs to be able to typeset exactly the variant you have in mind!

I believe the JIS standard actually prescribes the shapes of the glyphs for each character, and this is perhaps exactly the grief that Japanese have with Unicode. If you are used to think about a codepoint being associated with a well-defined shape, the lose view that Unicode takes seems rather careless.

Chinese seem much less fixed on a particular variant, and are exposed to more variations in daily life. The difficulty is thus that what is a negligible font variation to a Chinese is a major shape change to a Japanese observer.

"Preserving our legacy"

Chinese, Koreans, and Japanese have been keeping records of their life for between one and two thousand years, and there is a large amount of literature that is slowly but steadily being transcribed into digital form. These documents contain archaic characters that are no longer used--so these have to be encoded--and archaic variants of characters still in use. Should these be replaced by the modern "unified" variant? Not if we want to faithfully preserve the document, right? The CNS character set defined in Taiwan, where a huge effort is being conducted in digitalizing the classics, now contains about 50000 characters, for exactly this reason. The CCCII character set has a 94 x 94 x 94 code space arranged in multiple layers that contain variants of the same characters.

I am not sure what direction Unicode is taking with respect to this issue. Unicode 3.0 improves support for CNS, but gives up the source separation rule, so that CNS/Unicode roundtrip compatibility is no longer possible.

"What do these Chinese characters do in my letter?"

When I view a page written in ShiftJIS on my web browser, all the characters will have the same style, as they come from the font that the browser uses for the ShiftJIS encoding. If I view the same page saved as Unicode (UTF-8), the browser will suddenly show my characters with different styles, taken from different fonts. This is a technical problem--the browser doesn't have a font for Unicode, so it will map each character to some other character set, such as GB, CNS, JIS, or KSC, and the web page appears in a patchwork of styles.

This problem is caused by the Han unification, and it is a serious problem. Not only does the patchwork look ugly, people also don't like to see alien fonts. Even disregarding what we said above about glyph variations, a typical mainland Chinese font is easily recognizable as such by its style, even if there is not a single simplified character in a sentence. A Japanese, Korean, or even Hong-Konger would be quite unpleased to see characters in this font style appear in their letters. Conversely, some Japanese styles aren't appreciated very much in Hong Kong (although there are so many Japanese articles for sale here in original packing that one would hardly notice).

You'll say "of course, everybody knows that there is no such thing as a Unicode font." Well, actually that's not quite true either. For instance, the style called Mincho in Japan, Myeongjo in Korea, and Song in China originated in Ming-dynasty China and is universally acceptable in the CJK countries. The Japanese fontmaker Typebank has a Chinese Mincho font with PRC-simplified characters approved by the Chinese government that shares glyphs with their Japanese Mincho font, and it is quite feasible to make a Mincho font that will serve the mainland Chinese/Korean/Japanese users decently.

The main difficulty in making a "Unicode font" is not the style, but the character variations discussed above. Since Japanese are most concerned about using a particular variant, and since Chinese will recognize most variations, one can indeed make a kind of Unicode ideographic font that uses Mincho style and the Japanese variant where applicable. This is exactly what the CJK Dictionary Publishing Society is doing for their "Dictionary of Unified CJK Characters" that shows a single glyph for each character (the font being made by Dynalab in Taiwan), and I believe this is also how Bitstream made their Cyberbit font (the CJK glyphs for Cyberbit were also made by Dynalab). I'm not claiming that this is the perfect solution--certainly not for an appliation for a specific market--but the font will be readable and acceptable for all CJK users, even though people may be surprised by some unfamiliar glyph shapes.

Despite all the apparent differences, CJK cultures have a common heritage, and Chinese characters form a strong part of that. Despite differences in writing style and changes in meaning, a Japanese can travel through mainland China without speaking a word of Chinese, communicating by notes in Chinese characters. If I see the same character on road signs in Japan or in Hong Kong, I don't think of them as being any more different than seeing the same word in roman letters.

Otfried Cheong