A complete introduction to Japanese character encodings

Alexandre Elias
http://www.cs.mcgill.ca/~aelias4/
aelias4@cs.mcgill.ca

v1.0

Contents


1 Introduction


1.1 Goal of this document

This document is a complete technical introduction to all the ways Japanese text is encoded on PCs. After reading this document, you will have an in-depth understanding of all common Japanese encoding schemes. Reading this text should be enough to be able to program software that manipulates any given JIS or Unicode file. (Though I do not go into enough nitty-gritty detail for someone wishing to program a complete display system from the ground up.)

In contrast to the simple elegance of ASCII, Japanese text encoding systems are a ghastly farrago of unstable and incompatible standards. Even Unicode is far more complicated than most people think. I hope this document will help you get your bearings in this mess and use Japanese confidently on your PC, including relatively new developments such as Plane 2 Unicode and JIS X 0213.

Although I did my best to verify every claim, a few errors might have slipped into this document. Please e-mail me for any corrections or additions.


1.2 A crash course in Japanese writing

Note: Proficient Japanese speakers may skip this section. You must have a Japanese font installed to display the Japanese characters properly. The characters are encoded in HTML plain-ASCII ampersand-Unicode form (e.g. の = の) and should display properly regardless of which encoding you tell your browser to use.

Japanese has 3 main writing systems: hiragana, katakana and kanji. Roman letters (a, b, c ...), arabic numerals (0, 1, 2 ...) and various punctuation marks (。, !, ?, 「, 」...) are also commonly found in Japanese writing. All of these different writing systems can be found in the same sentence. Here is a typical example:

UNIX での日本語文字コードを扱うために使用されている従来の EUC は次のようなものでした。
(The traditional EUC encoding used to handle Japanese character codes on Unix looked like this.)

Unlike our roman characters, each character, whether hiragana, katakana, or kanji, is square and of identical size. There are a few uncommon exceptions: numbers, roman letters, punctuation and katakana are sometimes given a half-width rectangular form.

Japanese was traditionally written in columns, from top to down and right to left. However, modern Japanese can also be written from left to right, just like Western text. Indeed, outside of word processing programs, left-to-right writing is almost always used on computer displays.

Hiragana and katakana put together are known collectively as the "kana". Depending on how you count, there are roughly 50 hiragana and 50 katakana. Each hiragana has a katakana equivalent: they can be thought of as analogous to our small and capital letters. Unlike roman letters, their size does not vary, but hiragana have a smooth curvy appearance, whereas katakana are blocky and jagged. Each kana corresponds to one sylabble. Here are a few kana:

kakikukeko
Hiragana    
Katakana

The kanji are a complex ideographic writing system stolen from China. They remain very similar to Chinese to this day --- this is why the kanji are unified with Chinese and Korean in the Unicode character set. There are thousands of kanji in common use in Japan. The sheer quantity of kanji is what makes Japanese writing so difficult, both to learn and to encode on computers. Kanji can be recognized by their complex, intricate appearance: for example, 通, 飛, and 治 are common kanji. (But a few kanji look simple, such as 人 and 女; if in doubt, pore over a table of all kana to make sure the character isn't in it. Be careful: some kanji and kana look almost exactly the same, such as the kanji 力 and the katakana カ, or the kanji 一 and the katakana sound extender ー. The only sure way to tell these apart is by context.)

Note that because the kana can express all possible sounds in Japanese, it is possible to write any Japanese sentence using only one of the two kana writing systems. Thus, the earliest standard Japanese character set for computers (JIS X 0201, described later) supported only katakana, which, although inconvenient, was sufficient for the limited applications of the day.

There is much else to be said about Japanese writing, but this is all you need to know to understand how Japanese text is encoded.


1.3 Character set vs. encoding

A character set is a one-to-one mapping between a set of distinct integers and a set of written symbols. For example, I could define a new character set FOOBAR that maps the alphabet {A, B, C} to the digits 1, 2, and 3, respectively. A character set is an abstract concept that exists only in the mind of the programmer: computers do not directly manipulate character sets.

In contrast, an encoding is a concrete, specific way characters are stored into actual 0s and 1s of computer memory. If I wanted to implement FOOBAR support on a real computer, the most obvious way to encode my data would be to represent one character per byte, following the usual way of encoding integers in binary. In this scheme, the string "AABC" would become:

00000001 00000001 00000010 00000011

This is how ASCII is normally encoded. But since we only have 3 distinct characters in FOOBAR, it seems rather wasteful in this case. Alternatively, we could use only 2 bits for each character in the string. This would allow us to cram the entire string "AABC" into one byte:

01011011

We have changed the encoding, but the character set remains the same. In general, no matter how twisted and convoluted our encoding scheme becomes, conceptually {A,B,C} still maps to {1,2,3}.


1.4 Overview of the encoding schemes

In real life, encodings tend to multiply uncontrollably as implementors accomodate the quirks in their systems, but character sets remain few. In the English-speaking world, the only two character sets on the radar are ASCII and EBCDIC, and EBCDIC is long dead. Similarly, there are only 2 character sets used to write Japanese: JIS (Japanese Industrial Standard) and Unicode. (And the Japanese part of Unicode is actually derived from JIS.)

Unicode is the new, superior standard, but JIS continues to be more popular. Japanese computers have been using JIS for decades, and Unicode has only appeared in the past few years, so it has barely made a dent in JIS' mindshare. It is possible that all Japanese text will be written in Unicode somewhere in the very distant future, but --- as much as we would all like to jump on the Unicode bandwagon --- for the moment everyone should know how to deal with both JIS and Unicode text.

As needs have evolved, both standards have undergone several revisions. This is mainly a problem with JIS, since its revision process has been somewhat chaotic. In contrast, Unicode follows a policy that each new revision must be a strict superset of previous ones, so version conflicts rarely cause problems.

There are essentially 3 JIS encodings (Shift-JIS, EUC, ISO-2022-JP) and 3 Unicode encodings (UTF-8, UTF-16, UTF-32) in widespread use. In a nutshell:


2 JIS

JIS stands for "Japanese Industrial Standard". JIS is a blanket term used to describe all non-Unicode Japanese character sets and encodings, all of which are based on standards by the JSA (Japanese Standards Association).


2.1 JIS character sets

Unlike Unicode, there is no single JIS character set. A JIS encoding actually involves several standard character sets used in combination:

JIS encodings employ various schemes to use these overlapping character sets together in the same text.

The above four JIS standards are the ones you need to remember, but for reference, here is the meaning of all the JIS X 200 codes:

JIS X 0201 - Roman/katakana (JG)
JIS X 0202 - ISO-2022-JP
JIS X 0203, 0204, 0205, 0206, 0207 - Obsolete/withdrawn standards
JIS X 0208 - Main kanji character set (JH)
JIS X 0209 - How to write JIS X 0211 characters
JIS X 0210 - How to encode numbers
JIS X 0211 - Standard ASCII control codes
JIS X 0212 - Supplemental character set (JJ)
JIS X 0213 - New unified JIS character set
JIS X 0218 - Definition of standard prefixes (JG, JH, JJ, JK, JL)
JIS X 0221 - Unicode (JK for UCS-2, JL for UCS-4)


2.1.1 JIS X 0201

JIS X 0201 is a rudimentary 8-bit character set, supporting half-width katakana in addition to ASCII characters. JIS X 0201 hexadecimal character codes can be prefixed with "JG" to distinguish them from other JIS character sets. It was designed in the 1960s (long before the other standards), back when computers were not powerful enough to store kanji.

The 7-bit part (i.e. 0x00 to 0x7f) of JIS X 0201 is identical to ASCII, with two exceptions: the backslash character '\' (0x5c) is replaced by a yen symbol, and the tilde character '~' (0x7c) becomes an overline. Thus, displaying ASCII text in a Japanese font will work almost perfectly --- except that all your backslashes will turn into yens. This problem is so pervasive that it is no longer really a "problem" at all: Japanese people all know and expect to find yen symbols instead of backslashes in such things as DOS/Windows path separators. This bug has been around for so long, it is now embedded into the very fabric of the universe. It will never be fixed, and will probably live on even when Unicode becomes the standard. So you can either travel back in time to the 60s to assassinate the guy who decided this, or learn to think of the yen and the backslash as just variant ways of writing the same character.

The 8-bit part is divided in this fashion:

0x80 to 0x9f inclusive: Reserved
0xa0 to 0xdf inclusive: Japanese-style punctuation and half-width katakana
0xe0 to 0xff inclusive: Reserved

The katakana are half-width because makes them the same size as roman characters and thus easy to display on primitive fixed-width terminals.


2.1.2 JIS X 0208

JIS X 0208 is by far the most important of the standards. When people say "the JIS standard", they mean JIS X 0208. JIS X 0208 hexadecimal character codes can be prefixed with "JH" to distinguish them from other JIS character sets. It has gone through 4 official versions from the Japanese Standards Association.

Revision history:

1978 - Standard created. This version is known as "Old-JIS" and is now more or less dead.
1983 - Standard changed heavily to make room for the 1981 Jouyou kanji list.
1990 - 2 new kanji added to the end.
1997 - 6 (?) new kanji added to the end.

The 1983, 1990 and 1997 standards can be considered essentially the same, being close supersets of each other. They are collectively known as "New-JIS", or mostly just "JIS". Thus it is possible to talk about "JIS X 0208" without mentioning the year.

JIS X 0208 is set up as a 2-dimensional, 94x94 grid. The position of a character on this grid is called its "kuten". Here is a description of each horizontal line on the grid:

LineContent
01-02     punctuation, symbols
03 ISO 646 (alphanumerics only)
04 hiragana
05 katakana
06 Greek
07 Cyrillic
08 line drawing
16-47 kanji level 1 (2965, ordered by on'yomi)
48-83 kanji level 2 (3390, ordered by Kangxi radical, then stroke)
84 miscellaneous kanji (6)

This 94x94 grid fits nicely between 33 and 126 inclusive, almost completely overlapping the non-control part of ASCII. A "raw" JIS code is considered to be the 16-bit code resulting from adding 33 to each JIS coordinate and concatenating the two resulting bytes, the vertical coordinate becoming the high byte.

The kanji here are enough for the vast majority of writing, but every so often you might need a rarer kanji (to write names especially). This is why the following standards exist.


2.1.3 JIS X 0212

JIS X 0212 was introduced in 1990 to accomodate the demand for rare kanji. JIS X 0212 hexadecimal character codes can be prefixed with "JJ" to distinguish them from other JIS character sets. It is meant to be used in the same encoding alongside JIS X 0208. It contains 5801 obscure level 3 kanji. Even educated native Japanese people will not be familiar with most of them, and a foreigner like me recognizes almost none.

Like JIS X 0208, it is origanized in a 94x94 grid. Here is a description of each horizontal line on the grid:

LineContent
02 more punctuation, symbols
06 accented Greek
07 non-Russian Cyrillic
09 extended Latin
10 uppercase accented Latin
11 lowercase accented Latin
16-77     kanji level 3 (5801, ordered by Kangxi radical, then stroke)

Because of its overlap with JIS X 0208, encodings have to jump through hoops to use both JIS X 0208 and JIS X 0212 in the same text. Moreover, it leads to confusion. If I tell you I am using JIS kanji 0x6666, you have to ask: are you talking about JH6666 or JJ6666? The only nice thing is that, though the kanji don't, at least the non-kanji parts of JIS X 0208 and JIS X0212 occupy disjoint code ranges.


2.1.4 JIS X 0213

JIS X 0213 is the successor to JIS X 0208. It is sometimes called JIS2000 (having been standardized in the year 2000). It is not yet in wide use: for now, you can safely ignore it. However, for future reference, here is a brief description.

The design of JIS X 0213 is clever. The new characters in JIS X 0213 are mostly kanji and a few miscelleaneous other characters. The standard is divided into two parts. Level 3 kanji in JIS X 0213 are said to be in "Plane 1", and Level 4 kanji are said to be in "Plane 2". Both planes are a 94x94 grid.

Here is the trick: Plane 1 kanji occupy the unused codespace in JIS X 0208, whereas Plane 2 kanji occupy the unused codespace in JIS X 0212. To visualize what is happening, imagine that you printed out the JIS X 0208 grid, the JIS X 0212 grid, the JIS X 0213 plane 1 grid and the JIS X 0213 plane 2 grid on four separate sheets of transparent plastic. Then you could place the JIS X 0213 plane 1 sheet over the JIS X 0208 sheet without any overlapping characters, and you could place the JIS X 0213 plane 2 sheet over the JIS X 0212 sheet with no overlap.

To accomodate JIS X 0213, all 3 major JIS encodings have undergone revision. Their new MIME names are EUC-JISX0213, Shift_JISX0213, and ISO-2022-JP-3. Fortunately, because of the clever design of JIS X 0213, the new encodings are only slightly different from their original version. As of September 2002, these new encodings are not yet widely supported, but they are likely to grow more popular.


2.1.5 NEC/IBM proprietary extension

NEC/IBM defined roughly a thousand new characters for use with CP932, Microsoft's proprietary extension of the Shift-JIS encoding. These cause no end of problems, even having duplicates with standard characters. As these characters are tightly bound to CP932, they are described in section 2.2.2 on Shift-JIS.


2.2 JIS-based Encodings
2.2.1 EUC-JP (a.k.a. UJIS)

Ways to recognize this encoding
  1. If it's BOTH of these things:
    • Japanese text has the 8th bit of EVERY byte set
    • at least one Japanese character (kana or kanji) in the text takes exactly 2 bytes

EUC (Extended Unix Code) is a simple and clean encoding, standard on Unix systems, which can encode all characters from JIS X 0201, JIS X 0208 and JIS X 0212. It is backwards-compatible with ASCII (i.e. valid ASCII implies valid EUC). However, it is NOT backwards-compatible with raw JIS X 0201: EUC does not support 1-byte half-width katakana/punctuation (though it does support it in 2 bytes).

EUC has the nice property that ASCII characters are encoded as just ASCII, and every other character is multibyte and has the top bit of each byte set.

Here is how each character set is encoded:
JIS X 0201: Two-byte encoding. 1st byte: 0x8E. 2nd byte: raw JIS X 0201 byte.
JIS X 0208: Two-byte encoding. Just take the raw JIS X 0208 two-byte code and set the top bit of each byte.
JIS X 0212: Three-byte encoding. 1st byte: 0x8F. 2nd and 3rd bytes: take the raw JIS X 0212 code and set the top bit of each byte.

Note, though, that it may be a bad idea to use characters in JIS X 0212 with EUC, as some brain damaged software might not recognize them. Because almost all characters in EUC take up only 2 bytes, it is all too easy for careless programmers to build software that will break when it encounters a 3-byte EUC character.

EUC's standard MIME label is "EUC-JP".

JIS X 0213 note: With the advent of JIS X 0213, EUC was extended to support the new characters, and given the MIME label "EUC-JISX0213" (not yet standard). The extension is simple. Recall that JIS X 0213 plane 1 fits into the unused codespace of JIS X 0208, and JIS X 0213 plane 2 fits into the unused codespace of JIS X 0212. EUC-JISX0213 is identical to ordinary EUC-JP, except that it allows you to encode JIS X 0213 plane 1 characters just as you would encode JIS X 0208 characters (in 2 bytes), and JIS X 0213 plane 2 characters just like JIS X 0212 characters (in 3 bytes).


2.2.2 Shift-JIS (a.k.a. SJIS)

Ways to recognize this encoding
  1. If it has any 1-byte Japanese punctuation or half-width katakana between 0xA1 and 0xDF.
  2. If
    • it isn't ISO-2022-JP (at least one Japanese character has highest bit set)
    • it isn't EUC (second byte of at least one Japanese character has last bit cleared)
    • it isn't UTF-8 (all Japanese characters take up only 2 bytes) and
    • it isn't UTF-16 (take a kana and check to see if it falls into the UTF-16 kana range 0x30A0-0x309F)
(This encoding is so chaotic there's unfortunately no easier way to tell from looking at the hex dump. But just remember that if you found it on a web page, it's probably Shift-JIS.)

The selling point of Shift-JIS is that, unlike EUC, it is backwards-compatible with not only ASCII, but also JIS X 0201. One-byte half-width katakana/punctuation is valid Shift-JIS. Unfortunately, this compatibility comes at a steep price: Shift-JIS is the messiest encoding of all. These half-width katakana are hardly used nowadays, so this tradeoff turns out in hindsight to have been a mistake. At any rate, because of Microsoft's endorsement of it, this encoding is the most popular.

Shift-JIS can be used to encode JIS X 0201 and JIS X 0208 (but not JIS X 0212).

The first byte of a Japanese character in Shift-JIS always has the top bit set and, in order to permit the use of the wonderful JIS X 0201 characters, avoids ever being in the high JIS X 0201 range of 0xa0 to 0xdf. Unfortunately, because there would not be enough codespace otherwise, the second byte has no such guarantee: its top bit is not necessarily set (!). It can go all the way from 0x40 to 0xFC (overlapping with ASCII).

The algorithm to convert raw JIS codes into Shift-JIS is hideous, and can't be explained in a nutshell. It involves dividing things by 2 and adding arbitrary magic numbers. See the end of this page for a precise description.

As if it wasn't bad enough, there exists a proprietary extension of Shift-JIS from Microsoft, called CP932 (a.k.a. Windows-31J). These extensions are perfect supersets of the standard, so for most practical purposes they can be considered identical to ordinary Shift-JIS. They contain the NEC/IBM extended characters mentioned earlier. This extension is:

NEC special characters (83 characters in SJIS row 13),
NEC-selected IBM extended characters (374 characters in SJIS rows 89..92),
and IBM extended characters (388 characters in SJIS rows 115..119).

Since Windows is so pervasive in Japan, Japanese people often really mean "CP932" when they say "Shift-JIS". Note that Apple also has its own proprietary extension, which is to my knowledge very similar to CP932.

Shift-JIS's standard MIME label is "Shift_JIS".

JIS X 0213 note: The new JIS X 0213-compatible extension of Shift-JIS, known by its (nonstandard) MIME label "Shift_JISX0213", can also be used to encode both planes of JIS X 0213, but still can't encode JIS X 0212.


2.2.3 ISO-2022-JP

Ways to recognize this encoding
  1. If you see any escape (0x1B) code followed by ( or $.
  2. If it looks like garbled 7-bit ASCII characters.

The most widely supported encoding for e-mail is the ancient, 7-bit ISO-2022-JP standard, which has been used for Japanese e-mail since the beginning. This encoding is almost certain to be understood by a Japanese recipient. It has also been standardized by the JSA under the name "JIS X 0202", but nobody calls it that.

ISO-2022-JP is essentially a mix of plain ASCII and raw, 7-bit JIS. Like all JIS encodings, 16-bit characters are encoded in big-endian byte order. To distinguish between conflicting character sets, escape codes are used.

There are 3 revisions of ISO-2022-JP. Their MIME labels are "ISO-2022-JP", "ISO-2022-JP-1", and "ISO-2022-JP-2".

ISO-2022-JP - Supports ASCII and JIS X 0208.
ISO-2022-JP-1 - Also supports JIS X 0212.
ISO-2022-JP-2 - Also supports other languages like Chinese and Greek.

The original ISO-2022-JP will be good enough most of the time. For maximum compatibility, you should prefer it to the others.

Here are the relevant escape sequences:
ISO reg#Character setESC sequenceStandard?
6 ASCII ESC ( B ISO-2022-JP
42 JIS X 0208-1978 ESC $ @ ISO-2022-JP
87 JIS X 0208-1983 ESC $ B ISO-2022-JP
none JIS X 0201 katakanaESC ( I Nonstandard
14 JIS X 0201-Roman ESC ( J ISO-2022-JP
159 JIS X 0212-1990 ESC $ ( D ISO-2022-JP-1

The text begins in ASCII by default, and you must switch back to ASCII when you are finished. It is also recommended for newlines to always be encoded in ASCII. If this is e-mail, you must also use the escape codes in your Subject or From lines if they contain Japanese, again switching back to ASCII when done.

ESC ( B should be preferred over ESC ( J. The latter is a legacy code whose use is discouraged today. Also, avoid ESC ( I unless you *really* want those half-width katakana.

Note that by adding escape codes by hand, it is possible to easily send ISO-2022-JP encoded e-mail with any e-mail client, even if it has no support at all for Japanese.

ISO-2022-JP-2 has a variety of other escape codes, having been extended to support random other languages. As this doesn't concern me, I'm not going into it here. See RFC1554 if you're interested.

JIS X 0213 note: There is a new encoding for JIS X 0213, known by its nonstandard MIME name "ISO-2022-JP-3". It is identical to the original ISO-2022-JP, with the following additional codes.

ISO reg#Character setESC sequenceStandard?
none JIS X 0213 plane 1 ESC $ ( O ISO-2022-JP-3
none JIS X 0213 plane 2 ESC $ ( P ISO-2022-JP-3


3 Unicode

Unicode is a character set begun circa 1990 with great fanfare. It is designed to be the ultimate character set, assigning a unique code to every character in every living writing system. Unicode is a very important technology, and it will likely be the standard character set used worldwide in the future. Therefore, Unicode encodings are worth learning even though they are not very common in today's Japanese text.

Unicode, the character set, is sometimes called UCS (Universal Character Set). Encodings of Unicode are called UTF-something. Pedants make the distinction but both these terms are interchangeable in casual speech.

Many people, when they first hear about Unicode, assume that Unicode encoding is as simple and clean as ASCII, except that each character code maps to 2 bytes instead of just 1. This is wrong. Wrong, wrong, wrong! If anybody tells you this, whack them with a cluebat. Unicode encodings are much more complicated than this. Making it as simple as ASCII was the naive idea when Unicode was first conceived, but then reality caught up with the standard. There are now no less than 3 popular encodings for Unicode.

Unlike JIS, UCS is not organized in a grid. It is a flat, one-dimensional sequence, with character codes ranging from 0 to 0x10FFFF (note that a code in this range can almost, but not quite, fit into 20 bits of storage). Each writing system has its own range of codes.

The creators of Unicode thought when they started that 16 bits would be enough to contain all useful writing systems, but they were wrong. The latest Unicode standard goes up to (a little more than) 20 bits, and a kludge was designed to the new high-plane characters in what was previously 16-bit only text (UTF-16, described below). Unicode is now separated into 17 planes, from Plane 0 to Plane 16, the plane number coming from the value of the top 4 bits.

PlaneContent
0Basic Multilingual Plane (BMP)
1Obscure character sets like ancient Egyptian
2Obscure kanji
3-13Unused
14Meta-characters
15-16Private-use

Here are the character ranges of interest to us:

RangeContent
0x0020-0x007FASCII
0x3000-0x303FJapanese-style punctuation
0x3040-0x309FHiragana
0x30A0-0x30FFKatakana
0xFF00-0xFFEFFull-width roman characters and half-width katakana
0x4E00-0x9FAFCJK unifed ideographs - Common and uncommon kanji
0x3400-0x4DBFCJK unified ideographs Extension A - Rare kanji
0x20000-0x2A6DFCJK unified ideographs Extension B - Very rare kanji


3.1 Endianness and the BOM

Unlike all the JIS-based standards, which have the guts to enforce big-endian encoding for all, Unicode panders to the little-endian people. Although endianness problems don't appear in UTF-8, which is 8-bit based, they rear their ugly head in UTF-16 and UTF-32, both of which can be either big-endian or little-endian.

Endianness, or byte order, means the order of the bytes when a 16-bit or higher integer is changed into a series of 8-bit bytes. There are two commonly used orders: big-endian is the order that makes sense, and little-endian is the order that makes no sense. Observe:

In big-endian:
0x1234 -> 0x12, 0x34
0x12345678 -> 0x12, 0x34, 0x56, 0x78

In little-endian:
0x1234 -> 0x34, 0x12
0x12345678 -> 0x78, 0x56, 0x34, 0x12

Since the x86 architecture sucks, x86 PCs are little-endian internally. However, big-endian is widely used for Internet transmission and on better-designed CPUs.

Thus, both UTF-16 and UTF-32 have 3 MIME labels: "UTF-16", "UTF-16BE", "UTF-16LE", and "UTF-32", "UTF-32BE", "UTF-32LE". "BE" and "LE" stand for big-endian and little-endian, respectively.

When one of the qualified MIME labels is used (and this is preferable), there is no ambiguity, and that is the end of the discussion. But when the ambiguous "UTF-16" and "UTF-32" labels are used (or when the Unicode is found in a file with no external meta-data), the byte order can be either. How do we tell which it is?

The answer is the BOM (Byte Order Mark), a two-byte (in UTF-16) or four-byte (in UTF-32) code which can optionally be put at the beginning of a serialization (i.e. file) to specify its endianness. The BOM is considered meta-data, and not part of the actual Unicode text.

The BOM has 4 different forms:
BOMMeaning
0xFE, 0xFFBig-endian UTF-16
0xFF, 0xFELittle-endian UTF-16
0x00, 0x00, 0xFE, 0xFF  Big-endian UTF-32
0xFF, 0xFE, 0x00, 0x00Little-endian UTF-32

Note that the code U+FEFF (what the BOM would be if it was a real Unicode character instead of meta-data) is the ZERO WIDTH NO-BREAK SPACE. Not by coincidence, this is an invisible character which does absolutely nothing, so that putting a BOM outside the first character of a file by mistake (for example, by naively concatenating two Unicode files together) will generally have no serious consequences. The code 0xFEFF is always interpreted as a BOM in the first position of the file, and always interpreted as a ZERO WIDTH NO-BREAK SPACE anywhere else.

And if the file has no BOM? You'll have to ask the user or use some other heuristic.

If all this makes you wince, you're not alone. It's hard to see what conceivable reason the Unicode committee found to do things this way. All these issues could have been avoided if big-endianness were simply obligatory, like in JIS --- as far as I know, this JIS policy has never caused any problems. The existence of these infuriating endianness issues in UTF-16 is one of the reasons why I generally prefer to use UTF-8.


3.2 Unicode-based encodings


3.2.1 UTF-8 (a.k.a. UTF-2, UTF-FSS)

Ways to recognize this encoding
  1. If common characters like kana or basic kanji take up 3 bytes.
  2. If you spot the telltale pattern in the high bits: "1110", "10", "10"

UTF-8 has the following good properties, which make it popular among the Unix crowd (the following copied straight from the UTF-8 and Unicode FAQ):

The encoding is as follows. Take the UCS code in big-endian order and map its bits to the 'x'es in the following table:

U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The only thing to remember here is that it is an error to use more bytes than the minimum, and programs may bomb if they see this (the reason for this rule is that we want each character to have only one representation). Note also that most of the BMP (Basic Multilingual Plane) takes 3 bytes to encode, so even kana will take up 3 bytes in UTF-8.

UTF-8 has no endianness problems. Although it is somewhat complicated and inefficient, UTF-8 is clean, unambiguous, and interoperates well with legacy systems. Of all the encodings described here, I find it the most appropriate for interchange on the Internet.

UTF-8 uses "UTF-8" as standard MIME label.


3.2.2 UTF-16 (a.k.a. UCS-2)

Ways to recognize this encoding
  1. If ASCII characters look like: [null] H [null] e [null] l [null] l [null] o
  2. If the file begins with the BOM, namely 0xFEFF or 0xFFFE.
  3. If a kana is in the range 0x3040-0x30FF.

UTF-16 is the "obvious" encoding of Unicode. Each character code of each Plane 0 character is directly translated to a 16-bit integer.

Before the addition of the new planes, this was the whole story, but now we need a kludge to use higher-plane characters. To access them, we exploit the reserved codespace of Unicode found between 0xD800 and 0xDFFF inclusively. There are no characters in this range. We have 11 bits of free codespace available here.

If we omit plane 0, any Unicode character in the higher planes can be expressed in 20 bits. Thus we can express any higher-plane character using 2 16-bit codes between 0xD800 and 0xDFFF, of which the 10 lowest bits are used in each to encode half the character --- always in big-endian order (high 10 bits are in the first code), regardless of the endianness of the file itself.

The unicode character number is not encoded directly in these 2 sets of 10 bits: we need to subtract 0x10000 first. As a result, we can't use Plane 0 with this scheme, but Plane 16 becomes available to us. I am skeptical of the wisdom of this (do we really need to go to this extra trouble just to have one more plane?), but in any case we're stuck with it.

What do we do with the unused 11th bit (0x0400)? We use it to increase the robustness of UTF-16 text. In the first 16-bit code, the 11th bit is always 0, and in the second 16-bit code, it is always 1. That way, if a file is for example truncated in the middle of a higher-plane character, a program can easily tell that it is reading the second half, and not stupidly corrupt the character after it.

This is a fairly well-designed kludge. The only big danger with this scheme is that so few characters require 32 bits that some programs may falsely assume that all characters are 16 bits, and then explode when they encounter one of the rare 32 bit characters.

Important note: UTF-16 is affected by endianness issues. See section 3.1 for details.

UTF-16 was adopted by Microsoft for use in their APIs. It seems the Microsoft and Unix worlds will diverge on this issue, with UTF-16 popular on Windows and UTF-8 popular on Unix.

UTF-16 has three possible MIME labels. They are "UTF-16BE" for big-endian byte order, "UTF-16LE" for little-endian, and just "UTF-16" for ambiguous (not recommended).

See RFC2781 for an official standard.


3.2.3 UTF-32 (a.k.a. UCS-4)

Ways to recognize this encoding
  1. If each character takes 4 bytes, mostly nulls.

UTF-32 is the simplest of all the Unicode encodings, but also the least efficient. In UTF-32, every character is a direct translation of its Unicode character code to a 32-bit integer. That's all there is to say about it. The only tricky part is the endianness issues (see section 3.1).

Unfortunately, UTF-32 is not widely used for interchange because it is very wasteful of space. UTF-16 always outperforms or matches the space performance of UTF-32, usually by a factor of 2. UTF-8 can outperform it by a factor of 4. In any given UTF-32 file, most high bits will be all zeroes, because the vast majority of Unicode characters (including kanji) are in Plane 0, the one that can be encoded with only 16 bits. However, its simplicity makes it very appropriate for internal representation inside programs, which is the main justification for its existence.

UTF-32 has three possible MIME labels. They are "UTF-32BE" for big-endian byte order, "UTF-32LE" for little-endian, and just "UTF-32" for ambiguous (not recommended).

See the Unicode Standard Annex #19 for the official standard.


3.3 Dead Unicode encodings

These encodings never really caught on. This section is mostly of historical interest, for those who have seen these names around and would like to know what they are. The practical-minded can skip over it.


3.3.1 UTF-7

Ways to recognize this encoding
  1. If Japanese text looks like +g4rBlEdL3TtErS-

UTF-7's major property and reason for existence is that it is 7-bit, and thus adequate for e-mail. But because MIME Content-Transfer-Encodings are, anyway, good enough to use 8-bit encodings in e-mail, this never caught on.

UTF-7, like its JIS sibling ISO-2022-JP, has the property that most ASCII text can be plunked right into it without modification, and an escape character is used to indicate the beginning of "real" Unicode. But instead of using the ESC control code to do this, UTF-7 simply uses the + character. The text following the + is big-endian UTF-16 encoded in a close variant of Base64.

(In a nutshell, Base64 encodes 24 bits into 4 bytes, spreading it out to 6 bits per byte. Each of these bytes is mapped into the following alphabet:

      Value Encoding  Value Encoding  Value Encoding  Value Encoding
           0 A            17 R            34 i            51 z
           1 B            18 S            35 j            52 0
           2 C            19 T            36 k            53 1
           3 D            20 U            37 l            54 2
           4 E            21 V            38 m            55 3
           4 F            22 W            39 n            56 4
           6 G            23 X            40 o            57 5
           7 H            24 Y            41 p            58 6
           8 I            25 Z            42 q            59 7
           9 J            26 a            43 r            60 8
          10 K            27 b            44 s            61 9
          11 L            28 c            45 t            62 +
          12 M            29 d            46 u            63 /
          13 N            30 e            47 v
          14 O            31 f            48 w         (pad) =
          15 P            32 g            49 x
          16 Q            33 h            50 y

UTF-7's variant is different only insofar as the pad character '=' is not used.)

When the UTF-16 is finished, the - character acts as terminator. Any character not in the Base64 list will also work as terminator, but only - is swallowed. Therefore, UTF-7 encoded Japanese text will look something like +6shGa5Hp-, or just +6shGa5Hp. An example given in the standard is that the Unicode sequence "Hi Mom -<WHITE SMILING FACE>-!" becomes "Hi Mom -+Jjo--!".

As this encoding is not widely supported and won't ever be, DO NOT use it in your outgoing mail.

UTF-7 uses "UTF-7" as standard MIME label.

See RFC2152 for the official standard.


3.3.2 SCSU

Another failed encoding scheme whose only strong point is efficiency. SCSU is actually more of a compression scheme than an encoding. The gist of it is that it switches into small 128-character "windows" for the duration of a string, within which characters are encoded only by their offset to the beginning of the window rather than their entire code. Needless to say, this method is worthless for encoding kanji, which are spread out all over the character set. SCSU didn't catch on because, after all, if you care so much about efficiency you might as well just gzip your text instead of using this kludge.


3.3.3 Other dead encodings

You may occasionally hear about "UTF-1" and "UTF-7,5". UTF-1 is an inferior encoding proposed in the early days of Unicode but never much used, now completely superceded by UTF-8. "UTF-7,5" is a variant of UTF-8, proposed long after UTF-8 had already become standard. It offers a few small advantages over UTF-8, but not enough to merit switching over to it. You should never see either of these in the wild.


4 Encodings in the wild


4.1 Conversion issues

The list of Japanese characters in Unicode was ripped straight from JIS, so JIS can be converted into Unicode without many problems. But don't expect perfectly sensible conversions. CP932, Microsoft's proprietary extension of Shift-JIS, causes the most trouble. It has duplicates and bad mapping tables that don't do proper round-tripping (if you convert something from Unicode to CP932 and back, you'll get something different from what you started with --- kind of like Babelfish).

The best conversion tool I know of is iconv, distributed freely by GNU at gnu.org under the LGPL. It supports *all* important encodings described in this document, even ones as obscure as UTF-7 and the new JIS X 0213 encodings. It can convert from any encoding to any other encoding. It is available as a command-line tool for Unix or as a C library for both Windows and Unices.

If you don't have access to a Unix machine and don't have the time or skills to program a frontend for iconv, consider also nkf, the "Network kanji filter", which is available for Windows. It only supports EUC, Shift-JIS and ISO-2022-JP.


4.2 Text editors and Vim

There is no penury of easy-to-configure Japanese-specific editors. If you're not fanatical about your text editor, consider NJStar or one of the gazillion editors called "jEdit".

However, those of us who have become addicted to Vim or Emacs would like to continue using our favorite editor with Japanese. I don't use Emacs, so I can't offer any help (please e-mail me if you would like to contribute your Emacs configuration).

The latest version of Vim, Vim 6.1, supports the Microsoft Windows IME, my preferred method for inputting Japanese. If you want to use it, you will need to recompile vim to add support for the features you need, as they are still experimental (but I have had no problems). Also, you need to install the iconv library (be careful! if you don't have it, vim will not give a clear error message). For anyone interested, here is my configuration, which lets me work in Win32 gvim with UTF-8, Shift-JIS, EUC and French extended ASCII:

" *** Japanese language support ***
" - "cp20932" is EUC (MS calls it 'JIS X'), but it seems to work only
"   some of the time.  iconv() supports "euc-jp" so that's what I use.
" - "utf-8" for unicode
" I don't know how to read plain JIS with vim.  Convert it to EUC with
" nkf.

" Make sure to compile vim with support for this
if has('multi_byte_ime')
	set encoding=utf-8

	" Use the command :J to switch to a Japanese font
	command Japfont set guifont=MS_Gothic:h11:cSHIFTJIS
	command Normalfont set guifont=

	" termencoding should be latin to type French accented characters
	set termencoding=latin1

	" Notes:
	" - fileencodings only works on systems with the iconv library installed.
	"   All systems I know of do *not* come with iconv installed by default.
	"   If you don't have it, you will get a mysterious conversion
	"   error with no explanation.
	" - latin1 is incompatible with utf-8 when french accents are used
	" - since any file is valid latin1, it should always be last in the list
	set fileencodings=utf-8,euc-jp,sjis,latin1

	" set default input mode to English, not Japanese
	set iminsert=0  
	set imsearch=0
	
	" light gray cursor = English
	" purple cursor = Japanese
	highlight Cursor guibg=LightGray guifg=black
	highlight CursorIM guibg=Purple guifg=black

	autocmd BufRead * Normalfont
	autocmd BufRead *.{sjis,sjs,euc,jis} Japfont
endif


4.3 On the web

Japanese web pages are mainly encoded in Shift-JIS, EUC or UTF-8. Shift-JIS is by far the most popular, but EUC is found on some Unix sites. As for Unicode, you will normally find it only on multilingual sites that provide Japanese as one option among many (such as Google). This state of affairs is a bit depressing --- if even the Japanese don't want to support Unicode, who will? --- but since UTF-8 is supported on all popular web browsers (i.e. Internet Explorer), hopefully it will catch on and grow more widespread with time. In principle it would be possible to see ISO-2022-JP on web pages, since Internet Explorer supports it, but I've never seen it in the wild. I would guess that the only reason IE supports it is for e-mail rendering in Outlook.

The MIME label for the encoding is normally specified by the web server inside the HTTP headers. If you don't have access to your web server configuration, the second-best is to specify it with a META tag in your HTML head section. The META tag should overrule whatever your HTTP header says, though (infuriatingly) IE seems to ignore it sometimes. It should look like this:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

In fact, it can't hurt to use such a META tag in all your Japanese HTML even if your web server is well-configured. That will make it easier to switch to another web server, and someone looking at the source will immediately know which encoding you are using, without needing to resort to the rules of thumb in this document.

In addition to the above encodings, HTML allows you to enter Unicode characters by number, using only plain ASCII: for example &#xABCD; (including the semicolon) for the character with the hex code ABCD. This is the most reliable method and nice when you only want to use a few Japanese characters in an otherwise English document (and thus it's the one I used for the little bit of Japanese in section 1.2), but it can be painful to type.


4.4 In e-mail

ISO-2022-JP uses only 7 bits and was designed specially for e-mail. It's the most popular encoding for e-mail, and as far as I know all Japanese cellphones use ISO-2022-JP, so in most cases this is what you want and what you'll see in the wild. If for some reason you want to try another encoding, read on.

E-mail presents special difficulties, because (apparently) some archaic e-mail relays support only transfers of 7-bit data. If a message contains a byte where the 8th bit is set, it could potentially be corrupted or rejected. Since most of the above methods exploit the 8th bit of each byte, it might not work to naively put Japanese text encoded using one of them directly in an email.

However, this restriction can be circumvented. Email already provides means of getting around the 7-bit problem, for example for binary attachments. The MIME standard for e-mail headers provides, in addition to a means of specifying the text encoding, a "Content-Transfer-Encoding", a standard means of re-encoding text to avoid using the 8th bit. For example, if you want to send a Shift-JIS encoded e-mail, you might specify this in your header:

Content-Type: text/plain; charset=Shift-JIS
Content-Transfer-Encoding: base64

The text of your mail would first be encoded in Shift-JIS, and then this encoded text would in turn be encoded in Base64 (see section 3.3.1 on UTF-7 for an explanation of Base64). This method could potentially be used to send mail using any of the codings described above. A detailed discussion of MIME content transfer encodings is outside of the scope of this document; see RFC1521 for the MIME standard.

So, if your correspondent for some reason doesn't support ISO-2022-JP, you can try using this method alongside Shift-JIS for your e-mail. This is a good bet for Windows and Mac users, since Shift-JIS is the standard encoding of those systems. As for EUC-JP and Unicode, they are at present rarely supported and I would advise against them. As much as we all love Unicode, don't use it for your e-mails or your recipient very likely will only see a garbled mess of characters.