2006-08-26 20:16 EEST

The UTF-8 monoculturists

Over the last year or so, as UTF-8 has finally started to gain some acceptance, I've ran into a lot of UTF-8 zealots who think that UTF-8 should be the single global one-size-fits-all standard; that it is the Final Encoding and there will be nothing after it. They seem to think that programs should assume that all input and output is, will be and should be UTF-8 or, if the program doesn't need to deal with individual characters, that it should ignore character sets and encodings altogether, assuming a single global standard – the UTF-8 monoculture.

Have they not learned that assumption is the mother of all fuck-ups? The world is still suffering from the ASCII assumption, the codepage 437 assumption, the codepage 850 assumption, the Latin1/ISO-8859-1 assumption, the Win-Latin1 assumption, and so on. Many or even most programs still don't properly support even all the 8-bit character sets. Sure, UTF-8/Unicode cover much more ground than these character sets, but are they perfect? While UTF-8 as an ASCII-preserving mapping from 16/32-bit numbers to 8-bit streams is quite nice indeed, the Unicode at the background is, to put it bluntly, awful shit. It has a lot of redundancy, and is difficult to process. If programs need specific support for even UTF-8 to go beyond ASCII, why take away the option of switching to improved character mappings, by making the UTF-8/Unicode assumption?

What happened to one of the most useful recipes for good software design: abstraction? Programs should be encoding-agnostic as far as possible. They should use the encoding specified by the unix locale in use for input and output, unless the input or output format or protocol can itself convey or is specified (often a mistake in itself) to use a particular encoding and character set. Almost everyone in their right mind will use an UTF-8 locale for now (unless some obsolete but important piece of software has other requirements), but nobody should be forced to do so, neither now, nor in the future. For, if something better or more suited to some environment comes along, it will then be easier to switch to it.

Likewise, programs, formats and protocols that are used to transfer data between people with potentially differing locales – and the reality is polycultural despite the aspirations of the UTF-8 imperialists – should either include information of the encodings used in their (own) data formats and store data in the locale or some other encoding, or at least include the encoding used in the specification of the format, and convert any input in other encodings to this encoding. In practise, for now, the latter option implies the use of UTF-8 or 16/32-bit Unicode – but only for the format. Refusing minimal locale support and encoding conversions, where possible and useful, and assuming instead everyone who wishes to partake in an exchange, to use the same locale and encoding, should not be an option, not if diversity has any value to you. And if that is not the case, I don't want to have anything to do with you or your programs.


Posted by tuomov | Permanent Link | Categories: Information technology