Re: Unicode Encoding

Evan Champion (
Sun, 03 Aug 1997 17:02:33 +0100

Martin J. Duerst wrote:
> The problem is that MIME is extremely strict on text messages
> (i.e. anything with text/* as a Content-Type). In particular,
> "binary" is not allowed (wile "8-bit" is, but UCS-2 would need
> "binary"), and lines have to be only so-and-so long and have to
> end with CRLF. In UCS-2, however, this gets expanded to
> NUL CR NUL LF, and therefore it's impossible to send UCS-2
> as text in emails.

The MIME strictness isn't so much of a problem IMHO, but the end of line
is. Every server works on a line basis, and they'll never find the EOL
if it is NUL CR NUL LF.

Unfortunately, the Unicode character 0x0d0a is used in the Malayalam
set, so we couldn't really force 8 bit CR LF as the line terminator
irrespective of the character set. Then again, how many people are
posting messages in Malayalam, and how many would otherwise benefit from
UCS-2 encoding?

We're starting to talk about the server having to deal with articles
entirely as binary, not text. That is a huge shift, and one I don't
think NNTP is prepared to handle.

> This is an interesting idea, and something I see for the first time
> in connection with character encodings. It's worth examining.
> But it's probably doomed, because:
> - It puts a lot of weight on correctly working software, even
> after the incorrect ones already have been weeded out.
> - Once UCS-2 gets seriously used, problems will be very quickly
> detected. If a line is eaten every 1000 messages, that's
> difficult to detect, but if whole messages get truncated
> all the time, such an alert device is not necessary.

Even beyond character sets issues, I think a simple article checksum is
very useful. At some point, signed messages may take care of this,
however verifying S/MIME signatures would be _very_ heavy for a server.
MD5 is really quite light in the grand scheme of things.

On servers that can only handle 7 bit, and I'm sure they're out there,
even UTF-8 is unsafe. Sure, the broken servers will be found, but
there's no requirement that they be fixed, and more importantly in the
meantime my messages are being trashed. My expectation is that any
message posted will arrive at all recipients fully intact. I don't
think that is unreasonable.