Older blog entries for Ankh (starting at number 171)

Too many blogs. What to do? I wanted to put together some notes on buying and owning a home but rather than start a new blog, I am just making static Web pages for now. Once I'm up to a dozen or so pages I'll maybe rethink things.

Also working on two papers, one for the Unicode conference next Spring (on XQuery) and one for XTech 2006 (the conference is subtitled "Building Web 2.0").

fxn, yes, I agree strongly wth Tim's comments there. Before the FSF started, the Unix community used to share "public domain" software. However, I should also give Richard and the FSF credit for a unifying vision of a complete freely-available operating system -- I'll say freely available because like Tim I think the politics Free part caused some problems.

I actually do support Free software, but I also prefer to try and form consensus and agreement with all parties, and the FSF at that time wasn't known for flexible compromise. I don't think there are easy answers, though.

Spent some time once more thinking about fonts and typography in the open source and Free software community. I wrote a little more of an article on the subject and then got distracted again.

It's hard to satisfy the feature needs of professionals with ease of use for others. the About Face book talks about perpetual intermediates being good people to bear in mind when designing software. I'd like to ask for more powerful font choosing software, but most people don't care about fonts enough to want the options. So the right answer is to make more of the font environment "just work".

Also remembered my xml blog which is rather barren right now. I need to say something controversial, such as Linux wears white socks or sort is better than cat because it has more options and then I'd get lots of comments. But then I'd wish I had adverts on my blog :-) Maybe I'll blog about efficient XML, but for now it's about opaqueness and meaning.

The adverts on fromoldbooks.org/ now more than pay for the hosting, plus a sandwich for lunch each day. It's a balance: I don't want them to be too obtrusive, although the Google ads have two advantages over others I've tried: they seem to cause google to index your page more frequently, and they also display relatively interesting information, so I have kept them for now. I also noticed in trying another company that the impressions per day figures were very different: Google said I had many more page hits than the other company. Since I have access to my Web server's logs, Google won out.

TordJ, did you know that cellar door was the phrasse that got Tolkien started down the path of the Elvish language and the Lord of the Rings? He loved the sound of it. So do I, at least when said with an English or Welsh accent: celadaw.

Binary XML Politics has been interesting of late. I ran sessions at a number of conferences on three different continents, and found that people attending were in favour of W3C defining a more efficient transfer mechanism for XML, but vigorously opposed to "binary XML".

The reasons for opposition varied widely. Very few were stated clearly or coherently, so it's difficult to agree or disagree with them. As best I understand it, people are concerned about W3C introducing a second representation of XML documents into a world that already has dozens of widespread representations and probably thousands all told.

For instance, as far as an XML processor that doesn't understand EBCDIC knows, an XML document marked as encoded in EBCDIC might as well be some form of binary goop -- it's perhaps well-formed XML, but the processor can't do anything with it at all, not even pass it back to an application or check to see if it's well-formed. It just rejects it.

There are already standards (more than one at ISO, at least one at IETF, probably others elsewhere) specifying ways for XML to be interchanged, e.g. over a network, in various non-textual forms, ranging from gzip to ASN-1 used in Fast Infoset. Most of these will stick around for a long time, although perhaps some of them will be used much less often if W3C defines a spec for exchanging XML documents efficiently.

The politics is irritating because it seems to be based on spreading distrust rather than on technical arguments. Joe Gregorio wrote an article that doesn't allow comments back (fear? fear of spam? I don't know) but that seems pretty paranoid, says in essence (as I read it) "W3C is saying they are doing one thing but really doing something sinister and evil" without ever explaining why the thing is actually sinister or evil, and without justifying the claim in the slightest. I don't really know how to respond to paranoia apart from suggesting therapy and medical help. Of course, Joe could join the Efficient XML Interchange Working Group, but it's presumably more fun to make snide comments at a distance. I'm not sure saying "no, we're not doing something evil" would have much effect.

I'm singling out Joe here, whom I have never actually met. There are quite a few other people, some of whom I have met, and some of whom work for organizations with a reputation for spreading FUD, helping to make sure anyone with sensible arguments doesn't get heard. I've actually tried quite hard, as have others at W3C and elsewhere, to understand the arguments. I really have. I've flown to Japan, been to Europe and the US and Canada, spoken with (and listened to) many people, and in the end the strong, coherent, well-researched and technically supported arguments on the one side seem to me to outweigh the gibberish, emotional arguments and ranting on the other.

Even that doesn't mean the side who can communicate clearly is right in any useful sense, but only that I'm in a position to try to evaluate whether they are right. Nor would it be fair to paint everyone (on either side) with this over-simplifying brush. There have been clear arguments. I remember one from Michael Rys of Microsoft, for example, being very clearly stated and being against anything except defining any efficient format as a variant encoding (like <?xml version="goop1.0"?>) so that we are not partitioning the world into two camps, and not, as he put it (this from memory) weakening the foundations. It's an argument we heard clearly from a number of people and organisations, and have heeded.

I spent some effort this year to try and help XMLers have a clearer perception of what we're trying to do at W3C, and of the processes we use, some of which have come in part from the IETF, some from open source projects, some from ISO and other standards organisations, and some from within the W3C and its participants. I don't think W3C is perfect. Neither do I think all of our specs are good (although some are definitely above average as specs go). But neither are we evil demons seeking to destroy the XML we have created.

Oh well.

Transcriptions of texts from old books have interested me for years. I have had an eighteeth century dictionary of underworld slang on my Web site for several years now, and it gets quite a few hits, is linked to by Wikipedia, etc. I recently added a second one, by Captain Francis Grose; it's a little later, The Dictionary of the Vulgar Tongue. The interesting thing about this one is that Project Gutenberg has a text edition, so I wrote a Perl script to convert that to XML, some XSLT to split the result, and compared it to the original book.

As an aside, it pains me that the terms of Project Gutenberg are such that I'm not allowed to give them credit for the work they did, since I have fixed an average of a little over one typo per page, including some misspelt entry headings. I kept a log of changes and will send them back in case they are of use.

XSLT 2.0 (currently a candidate recommendation) has some useful new features that include regular expression substitution, and which make it easier to do conversion with fewer Perl scripts and more XSLT. I've been using Mike Kay's open source Saxon, and also his commercial SaxonSA which is Schema-aware. The extra type checking this provides can be very useful.

I linked the two dictionaries together, so words in one point to the other. I didn't do the reverse linking yet, because I want to resurrect the code I used to add internal links by looking at phrases in the definitions and comparing them to possible target headwords, and then checking for words in common in the two possibly-linked entries.

For some reason the other people who have copied the Grose dictionary of slang have mostly kept it in one file, or at most split it into one file per letter, but this makes it hard for people to bookmark entries, and also really confuses Web search engines that try and work out what each HTML document is about based on keywords inside it!

I used my lq-text text retrieval system on some of these texts, including an encyclopædia, to do things like look for words that only occur once (finding possible typos), as well as to help find links.

On this subject, I'm still working on making a new release of lq-text. If you would like to help, let me know. I think importing the RCS files into some versioning system or other (CVS, subversion, arch) and maybe some sort of autoconfigure support are the highest priorities right now, although having HTML documentation rather than SGML and PDF might also be good.

OK, I know I should post more entries instead of a few huge ones. This is what moving house can do to you!

Chromatic, I'm with Tim Bray: stopwords are a bug, not a feature. I admit, as I say that, that my own text retrieval package, lq-text, supports stop words: sometimes the bug is in limited disk and memory.

I found, though, that even if you eliminate stop words, remembering where a stop word was eliminated, but not which one, can be a useful compromise. Hence, lq-text can distinguish "printed in The Times" from "printed times".

Stemming tends to conflate senses: you might have a document in which recording is common, and another in which records is common, and you can no longer distinguish them. This may or may not matter to you, of course.

I hope you are familiar with the work by the late Gerald Salton's group at Cornell in document similarity.

One way to improve perceived performance can be to pre-compute things. I found that vector cosine differences were much more useful if you used phrases than words, but you can eliminate a lot of potential docuent pairs and make the work much faster that way too.

What I did was to treat each new document as a query against the indexed corpus before adding it. But this was more than ten years ago, when I was hoping to get involved in TREC.

Liam

31 Aug 2005 (updated 10 Nov 2007 at 03:12 UTC) »

[update: 2 years later and we had a Summer without rain...]

All the way up here in Canada we're getting rain from Hurricane Katrina, now a tropical storm. We're getting maybe 50mm (2 inches) of rain in a few hours. One of our windows blew in (the whole frame, not the glass) during the night. Luckily, the cats didn't leap through the open window and go out. Or if they did, they leapt back inside. And I don't think any other animals came inside either (the perils of living in the country!)

I've done a little more work on lq-text, the text retrieval system that I first released (for Unix) in 1989. I'd like to teach it xpath, but for now I barely have enough time to work on making sure the documentation is up to date, and that the software actually builds. I see a few people downloading it each month but I rarely hear back from them. As far as I know, lq-text is still one of the better text indexing packages for plain text, but it doesn't do word processing files, PDF, etc. It does index HTML/XML/SGML but only by ignoring the element structure.

I've played a little with OCR programs recently. The GNU gocr turned out to be no help at all for old books (e.g. I tried one printed in 1845, and also saw samples others had tried). Here's some gocr output that's better than average:

Iu a mvoode;: box, in the cl;oir, Do?v lie.s a ?yen:8?Ҁ¢bably- _i;e emgg-., of wood, of ,a Cr;;,s_,adeT; mml3o he ww it is_ í;npossible to tell 8vit); any certaii:ty, but mh-e v.ei;ture to tl;í;3k it rejirt_,R._ents ui;e uf tl(e_ t?h-o

Here is the same passage as read by Abbyy.com's reader:

In a wooden box, in the choir, now lies a remarkably fine effigy. of wood, of a Crusader: who he was it is impossible to tell with any certainty, but we venture to think it represents one of the two distinguished persons

So you can guess which program I'm using. Frankly, if gocr had a user interface as clean as that of Abbyy's program, the quality might be more nearly tolerable: you can click anywhere on the image to go to the corresponding place in the text draft, and vice versa, and the spell checker aligns both text and image as you go, highlighting regions in both very clearly.

I made a transcription (is that the right word here?) using OCR of several pages from Sir Charles Knight's Old England averaging less than five minutes per page , although careful proof-reading takes longer. I made a simple XML format that preserves all of the typographic distinctions in the original that I can discern and that appear to have been deliberate (e.g. I am not recording where a piece of metal type broke and lost a serif).

This preservation of distinctions is something Project Gutenberg doesn't seem to take care to do. For example, the `Encyclpedia Gutenberg' (actually the OCR'd text from the 1911 Encyclopaedia Britannica) has lost all the small caps, which were used to denote implicit cross references. As an experiment I have ordered a DVD with scanned images, and I'll see (if the images are good enough) how long it takes me to get something as good. Probably not long if I use their text as a baseline, although some rudimentary analysis of the published Project Gutenberg text found a lot of obvious errors that I doubt are in the original. This is not to say I would not also have many errors, of course, but I don't have a team of people doing proofreading.

When I worked at SoftQuad we did conversion of texts into SGML, often charging US$50,000 or more for a project, but still undercutting some of the competition. The trick was extensive analysis and a lot of scripting. For example, the abbreviation q.v. usually marks a cross-reference, so check for the longest phrase before that marker to find a plausible target for a link. Of course, if there are typographical distinctions it's easier. So now I'm using some of that experience. The transcription I mentioned earlier has thumbnails of pictures. These are pictures I had already scanned over the past five or six years, but because I used consistent filenames I was able to connect them to the text, which has references like (Fig. 12), automatically. This in turn gives me a list of figures not references, which helps me look for errors in the script or in the OCR'd text.

Combining threads, I made an lq-text index to the Gutenbergopedia, and then I could get a keyword-in-context index of "q.v.":

$ lqphrase
"q.v." | lqkwic
==== Document 1: vol1/aargau.xml ====
  1:ower course of the river Aar (q.v.), whence its name.
Its total area is 541
  2:hot sulphur springs of Baden (q.v.) and Schinznach,
while at Rheinfelden th
  3:pital of the canton is Aarau (q.v.), while other
important towns are Baden
  4:er important towns are Baden (q.v.), Zofingen (4591
inhabitants), Reinach (
==== Document 2: vol1/aaron.xml ====
  5: distinct from the Decalogue (q.v.) (Ex. xxxiii. seq.).
Kadesh, and not Sin
  6:o the Mosaite founder of Dan (q.v.). This throws no
light upon the name, wh

Another good error-checking technique is to look for words that only occur once, or whose frequency is very different than one might expect. You need more than just one volume to do frequency analysis really, but I can already see words like a11erican (should be American), ciimate, AAstotle (Aristotle) and so on. In a way you can think of this as debugging: doing experiments that might reveal errors, and then correcting them.

There are some other interesting things about OCR'd text to do with grammars and metadata, with links and expressing relationships, but I should put those in my XML blog when I get a chance.

On a tangentially related topic: I remember working at an aircraft company and seeing a junior consultant spend a day doing some editing that I could have done in under five minutes. He didn't know about regular expressions. I may have mentioned this here before, but another thing people often don't think of is to use regular expressions to generate shell scripts.

When I scan images I name the files with the figure number (or page number, if figures are not numbered) at the start, so they sort together, e.g.


-rwxr-xr-x  2 liam liam 200947 Aug  4  2003
    071-Penshurst-Place-Kent-the-great-hall-1032x1522.jpg
-rwxr-xr-x  2 liam liam  54461 Aug  4  2003
   071-Penshurst-Place-Kent-the-great-hall-581x857.jpg
-rwxr-xr-x  2 liam liam  68865 Aug  4  2003
   071-Penshurst-Place-Kent-the-great-hall-774x1142.jpg
(you can see these at fromoldbooks.org). I use a shell script to extract the image size and rename the files with the widthxheight. It also extracts the JPEG compression quality and adds that if it's not 75%.

Now, suppose I got the figure number wrong, and I have a bunch of files to rename from 071- to 017- (or whatever).

I can use sed (no, don't panic) like this:

ls 071* | sed 's/^071-/017-/'

This gives me the new filenames:

017-Penshurst-Place-Kent-the-great-hall-1032x1522.jpg
017-Penshurst-Place-Kent-the-great-hall-581x857.jpg
017-Penshurst-Place-Kent-the-great-hall-774x1142.jpg

But really I need to generate a set of Unix commands to rename the files:

ls 071* | sed 's/^071-\(.*\)/mv -i & 017-\1/'

If the expression intimidates you, take off your shoes and read it again :-) The \1 in the replacement part means whatever was matched by the \(...\). The & means the whole thing that was matched. So we get this:

mv -i
    071-Penshurst-Place-Kent-the-great-hall-1032x1522.jpg
    017-Penshurst-Place-Kent-the-great-hall-1032x1522.jpg
mv -i
    071-Penshurst-Place-Kent-the-great-hall-581x857.jpg
    017-Penshurst-Place-Kent-the-great-hall-581x857.jpg
mv -i
    071-Penshurst-Place-Kent-the-great-hall-774x1142.jpg
    017-Penshurst-Place-Kent-the-great-hall-774x1142.jpg

I have put the -i option to mv so that, if I make a mistake, mv will prompt me before overwriting files.

Now I'm ready to run it, and I can do that by piping my command to the shell:

ls 071* | sed 's/^071-\(.*\)/mv -i & 017-\1/' | sh

If all this sounds pointless compared to issuing three mv commands and using filename completion with tabs, I'll mention that I usually end up doing it in three of roud directories, since I want to rename the original scans as well as the JPEG files I put on the Web, and also that I use d a real but short example deliberately.

The technique of constructing programs on the fly is a very powerful one, and is also used with XSLT, but with shell scripts you get the added benefit that reuse is just an up-arrow away in your history! (or a control-P away if, like me, you don't use the arrow keys much because it's faster to use the control-key equivalents).

OK, enough rambling for now.

aristeu, roughly speaking Toronto is level with Nice in the south of France. So no, it doesn't have three months of darkness in the Winter, and nether does Ottawa :-) It does get too cold for bare feet, though, sometimes as low as -40C.

Rich Salz kindly had a go at adding configure/automake support for my somewhat antiquated lq-text text retrieval package. Unfortunately he broke something in the process, so I agreed to make a new release with some code cleanups and then we'd try to get automake working together. I'm hoping to get something out later this month, although with moving house still in full swing it's looking unlikely. I spent some time reading the GNU Autoconf, Automake, and Libtool book [referrer link]; the GNU autoconf and configure always seemed to me an example of wrong engineering: the complexity of M4 also makes the scripts fragile, although they are a lot better these days than they used to be. I just want to drag a bunch of tests into a folder and ship the result :-) which is what I used to do with my previous configure script, now long-lost. Ah well, the price of conformity!

There are now almost seven hundred index images on my Words and Pictures From Old Books Web site. I'm trying to add at least one per day. Before long I'm going to have to revamp the search interface again; it still uses XML Query but maybe I should start using OWL to make an ontology. I'm not sure, since the search is currently faceted. I'll also be adding a full text search interface soon, using lq-text, but I don't want to get too distracted from releasing the software.

Been busy working on words and pictures from old books - it's mostly pictures right now, partly because I haven't decided exactly how it should work. I need to experiment. Up to almost 700 pictures though, all with at least a little metadata.

Also started an XML Blog as an experiment. I don't think what is there now is very controversial, but I don't promise it will stay that way.

Today Clyde and I went to a Pow-wow near Napanee, ON, which was a lot of fun despite the gently drizzling rain, and then afterwards ate at a new and not altogether inexpensive restaurant in Bloomfield, where the food was very good.

There are more restaurants in the expensive range (say, Cad$40/head and up without wine) than I'd expect in this rural community. I think it's because we're within weekend range of Toronto, as evidenced also by the number of bed and breakfast places.

We're still moving in to the new house. Next is to get permits to do some floor and window work. I was slightly worried that our new scanner wouldn't survive the trip: it's an Epson E10000, which does 3,200dpi at A3+ (12x18" roughly), although if you try that you get awfully large files. The scanner is working fine and I've added several new images to the fromoldbooks site.

Next big purchases will be a new computer for me, a paper-making machine for Clyde, and a giclee printer for us to share. Probably an Epson stylus pro printer with third-party inks, but I don't know for sure.

Now, back to Skobo Deluxe Level 50!

I'm at Extreme Markup in Montreal this week; I think it the best of the XML conferences for philosophy of markup. It's the only conference where people talk about the meaning of meaning itself.

Amusingly, I found a #XML channel on irc.sorcery.net, but when I went there I got banned on sight. Maybe it's not about markup.

Found two more XSL-FO implementations to add to the XSL Web Page today, with thanks to Ken Holman for pointing them out.

Also been watching the slow growth of my Words and Pictures From Old Books Web site. This reminds me...

One of the saddest things in the markup world is Project Gutenberg. Thousands of people involved in typing books into dumb plain text, and not even preserving metadata about which edition they used. It's not scholarship. And many of these books might never be digitised again. Once you've rekeyed/OCRed the public domain text of the 1911 Encyclopaedia Britannica, and in the process you've lost all the small caps, which were used to mark cross-references, how do you restore the cross references?

In the same way, I've seen people (including the New York Public Library) scanning old out-of-copyright engravings. This is wonderful, but they are using only 300dpi, which they call "high resolution". The beautiful engraved lines become a grey mush. Surely librarians would know this? It's simply not archival quality, not useful for research at all. But that's OK, they sell prints, so maybe that's the purpose of libraries these days.

Back to Extreme Markup, the meaning of meaning, and conversations in the hallway about situational semantics.

redi, thanks!

I'm in Redmond/Bellevue America for a week for W3C XML Query/XSLT 2 meetings hosted by Microsoft.

Meanwhile still working on my pictures from old books Web site; the search is gettting better. The new server is running Fedora Core 2; we ordered FC3 but they made an error. We may upgrade. It's several years since I've run a Red Hattish server; generally I prefer Mandrake/Mandriva these days, and it's taking a while to get used to it. Yum helps, although I prefer urpmi. One thing I miss is bash completion; e.g. if on my laptop I type urpmi perl- and hit tab, I'll get a completion list of all packages whose names begin with perl that I could install; if I type perldoc XML I get a list of all XML modules for which there might be perl POD documentation, and so forth. The bash-completion package that provides this doesn't seem to be around for FC2 though.

Travel is tiring!

Liam

162 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!