I was going to blog about the fact that using org-mode, referred to in my previous
diary entry, made the important but non-urgent tasks
more visible.
I was going to use sorting out backups for interesting data
(say, my
e-mail archives) as an example, and discuss the solution I
came up
with, but I have just realised that the trust model is exactly
backwards (my server trusts root on the backup machine).
This is
annoying, because I thought I'd got it right, and because
getting it
right would have been equally easy.
Oh well. So, instead, I'll return sorting out backups to the
TODO (or maybe STARTED)
state, and
(prompted by some recent discussion on #lisp IRC)
I'll blog
about SBCL's interpretation of Unicode characters, with the
up-front caveat that I'm Not An Expert in this languages,
glyphs,
graphemes, characters and all that jazz.
Common Lisp's string
type is defined to be a vector specialized to hold only
characters or
a subtype thereof. This definition is already hard to wrap
your head
around, and has amusing consequences documented
here in the past, but I don't want to get into it too
much; merely
to say that already this definition restricts to a fairly
large extent
the possible implementation strategies for supporting Unicode.
Why so? Because in Unicode there are several notions of
‘character’,
and we have to decide which of them we're going to use as
our Lisp character type (and use as string
constituents). The simple answer from the implementation
point of
view (and the route that SBCL currently takes) is to define
a Lisp
character as an entity corresponding directly to a
Unicode
code point. This is simple and straightforward to
implement, but
unfortunately has the side effect of making various Common Lisp
string functions less useful to the user.
How so? Well, consider the string comparison functions,
such as string=.
As specified, string= compares two strings,
character by character. In SBCL, then, this
compares two sequences of Unicode code points, character by
character,
for equality. The problem is that this operation doesn't in
general
have the semantics of ‘string equality’, because in Unicode
there is
more than one way to encode the same abstract
character: for
example, the e-acute ‘abstract character’, or possibly
‘grapheme’,
e-acute (which is usually displayed ‘é’) can be
represented
either as the single code point U+00E9, or as the
combining character sequence U+0065 U+0301.
So, that's OK; the Unicode FAQ on Combining
Marks
says that characters and combining character sequences are
different,
and even implies that programmers should be dealing with
Unicode code
points (SBCL characters). Unfortunately, Lisp has been
around for longer than Unicode, and code has been written
essentially
assuming that string= performs a language-string
equality
comparison rather than a codepoint-by-codepoint equality
comparison,
simply because (pre-Unicode) these two concepts were conflated.
What about the alternative? We could try defining Lisp
characters to be abstract characters, represented as
combining character sequences. One problem with this idea
is that
there's the char-code
function to implement: for every Lisp character
there must be
a corresponding unique integer. That's not so much
a problem
– Lisp has bignums after all – but it
will make
char-code-limit surprisingly large (in principle, I
think
every combining mark could be applied to a given base
character).
This means that we'd lose the ability to represent an arbitrary
character as an immediate object, meaning that accessing
characters
from strings would in general cause heap allocation, and lead to
surprises elsewhere in the system.
So, given that we stay with a Lisp character
corresponding to
a Unicode code point, what other pitfalls and details are
there to
consider? The memory representation of strings of type
(simple-array character (*)) is worth
mentioning;
because there's a fairly strong cultural expectation of O(1)
access
time in vectors, we don't do any compression, but simply
store each
Unicode codepoint in a 32-bit cell. SBCL has a separate
base-string representation, where each ASCII
codepoint is
stored in an 8-bit cell; a long time ago I gave a
talk about this.
Also, interpretation of the contents of strings has
caused
confusion recently. Granted that a string is a vector of
(effectively) code points, what does that mean for strings
containing
surrogate characters (code points in the range
U+D800–U+DFFF)? These code points do not
correspond to
any abstract characters directly; instead, pairs of
surrogates are (in certain Unicode encodings, such as UTF-16)
interpreted as characters beyond the Basic Multilingual
Plane. Some
Lisp implementations (such as OpenMCLClozure
Common
Lisp) go so far to resolve this ambiguity as to forbid the
creation of a Lisp character with a surrogate codepoint. In
SBCL,
however, we take the view that those characters exist, but
should
not be interpreted in any way; a string containing
surrogate
pairs should be considered to have individual surrogate
characters in
it, and no attempt should be made to combine them. If there
is data
in an encoding which uses surrogate pairs (such as UTF-16),
then that
data should be read in using the :utf-16 external
format, so
that no surrogates are present at the Lisp level; an attempt
to write
out a surrogate Lisp character in a Unicode encoding should
generate
an error. (NB: not all of this is implemented yet).
All of this merely scratches the surface of Unicode support; I'm
hoping to find time to implement better support for finding
properties
of the Unicode Character Database, and to implement Unicode
algorithms
for normalization, collation and so on; I'm also planning to
tighten
up support for the Unicode encodings (to address the potential
security issues that exist from nonconforming decoders) and
generally
to improve support for doing useful things with non-ASCII.
As usual,
there's likely to be a significant lag between planning and
doing...