Markus Scherer, 2004-may-28
Binary data, that is, arbitrary data using unrestricted byte values, is frequently transported in text data formats where the syntax limits the available byte values. (Examples: Binary data in SMTP emails and in XML.) There are commonly used encodings for representing binary data in text; they map binary values to a small repertoire of characters that is available in almost all legacy codepages. Unicode presents an incentive and an opportunity for efficient encoding of binary data by using a larger repertoire of characters, resulting in a more efficient encoding than common methods.
The base64 encoding [RFC3548] segments binary data into 6-bit codes and represents each of them with one of 64 characters that are common to most legacy codepages including US-ASCII, namely lower- and uppercase latin letters and decimal digits, and a couple of punctuation characters. The repertoire is sometimes modified depending on the text format syntax.
This works well with legacy codepages where all of these characters are represented using single bytes. With 8-bit bytes and 6 bits per byte for the "payload", the efficiency is 75%.
Base64 is equally efficient in UTF-8, which is commonly used for Unicode text data exchange, but in many applications the base64 text will first expand into UTF-16, together with the rest of the document, before it is decoded into the original binary form. While in UTF-16 form, the effficiency drops to 6 bits per 16-bit code unit, or 37.5%. Using UTF-16 for the text data exchange as well, which is quite reasonable or even desirable for non-Latin scripts, yields this same low efficiency during transmission.
(A variant of base64 is actually part of the UTF-7 encoding for representing Unicode text for very restricted media.)
The large character repertoire of Unicode and ISO 10646 together with their near-universal use in modern software allow for very efficient encoding of binary data in Unicode. Such encoding can segment binary data into wider codes to use more of the code unit bits, using a large repertoire of character code points.
The character repertoire should have the following properties:
These properties point to using 16384 (=16*1024=16k) contiguous character code points out of the Unified Han repertoire. For the base16k encoding, the range U+5000..U+8fff is chosen to encode 14-bit codes from the segmented binary data. This yields an efficiency of 87.5% for the UTF-16 encoding form (for processing) as well as for the UTF-16, UTF-16BE/LE, SCSU and BOCU-1 charsets (for data exchange). When transformed into UTF-8, the efficiency is still 58.3% (14 bits per Han character encoded in 3 bytes), which is not as good as base64 in UTF-8, but much better than base64 in UTF-16.
As usual, the binary data bytes are encoded in memory order, and the most significant bit in the first byte is aligned with the most significant bit in the 14-bit code word, etc.
By segmenting the binary data into 14-bit codes, a byte boundary in the binary data corresponds to a code/character boundary every 7 bytes, or every 4 codes/characters. The following table shows how many Han character code points are used in the encoding, with the numbers being multiples of N (a whole number).
|Number of bytes||Number of characters|
Unlike in base64, where the number of characters per binary byte is unique (because the code segments are shorter than bytes), base16k cannot be unambiguously terminated with a character from outside the chosen repertoire. For example, 6 characters could encode either 9 or 10 bytes. It is possible to choose a secondary, distinct repertoire of 64 characters to carry the last up to 6 bits of the last binary byte if the number of bytes is not a multiple of 7. This is an unaesthetic complication.
A more elegant method to terminate the sequence precisely is to precede it with the number of binary bytes, which is anyway desirable to pre-allocate space in the decoder. For base16k, the encoding of the binary data with Han characters shall be preceded by the number of bytes expressed as a decimal number, using ASCII digits U+0030..U+0039 and no punctuation.
Further, to simplify decoding and transport, base16k shall be treated leniently:
It is possible to extend the encoding by adding further data fields, for example for a checksum. Such extensions are not essential to the encoding and thus left to higher-level protocols. The leniency of the decoder allows to append such data fields without disturbing the decoder.
Read Unicode characters or UTF-16 code units in memory order and process as follows. (Surrogates in UTF-16 can be ignored because supplementary code points are ignored.)
In theory, base16k works with legacy codepages which contain all of the necessary characters. By design, this is true for GBK/windows-936 and GB 18030. Since they also encode Han characters with 2 bytes each, they would achieve the same efficiency of 87.5%. However, due to the variations of conversion tables and the size of such tables and the performance of table-based conversion, the use of base16k with legacy codepages is not advisable.
[RFC3548] RFC 3548 - The Base16, Base32, and Base64 Data Encodings (http://www.faqs.org/rfcs/rfc3548.html)
Oren Ben-Kiki asking in 2001 whether a "base4096" exists: http://www.geocrawler.com/archives/3/12303/2001/5/0/5850798/
Rick Jelliffe arguing in 1998 that binaries in XML can use something like
base64, or "you might want to invent your own Base4K encoding": http://lists.xml.org/archives/xml-dev/199806/msg00378.html
John Cowan replied, proposing base-256 (with U+f000..U+f0ff).