A sure sign that you have been ignoring your weblog is when Mozilla bugs mentioned on your home page get fixed between posts. Mozilla 1.7 will support the Content-Location: HTTP header. I would also like to thank Ian Hickson for reminding the world that the things I write in this weblog are not spec text. Apparently there has been some confusion about that. Specs are things with important-sounding words like W3C Recommendation, RFC 2616, or ISO 8879 at the top. Weblogs are things with cat pictures at the top. If specs are unclear, they should be fixed by things with errata at the top. If weblogs are unclear… hey look, cat pictures.

I was hoping to write a followup to my piece on relative URIs in HTML called Relative URIs in XML, which would talk about XML Base, but I haven’t yet. It’s important though; we’re using it in Atom, so it would be nice if everybody knew how it worked. A short little test suite would not be entirely out of line. There’s a comprehensive XML test suite with over 2000 tests, which I believe includes a few tests for XML:Base, but it would be nice to have an Atom-specific one too to show exactly which elements and attributes in a feed can be or contain relative URIs.

There have been a number of unhelpful suggestions recently on the Atom mailing list. One suggestion was that in the Atom API, we should just use basic HTTP authentication over SSL. This doesn’t work for Bob. Don’t talk to me about Atom authentication until you have read that article. You can talk to me about man-in-the-middle attacks, you can talk to me about spoofing attacks, you can talk to me about DNS poisoning attacks, but you can not talk to me about alternatives that do not work for Bob.

Another suggestion was that we do away with the Atom autodiscovery <link> element and just use an HTTP header, because parsing HTML is perceived as being hard and parsing HTTP headers is perceived as being simple. This does not work for Bob either, because he has no way to set arbitrary HTTP headers. It also ignores the fact that the HTML specification explicitly states that all HTTP headers can be replicated at the document level with the <meta http-equiv="..."> element. So instead of requiring clients to parse HTML, we should just require them to parse HTTP headers… and HTML.

See also: “just” is a dangerous word, alarm bell phrases.

It also occurs to me that people who say parsing HTML is too hard probably aren’t sanitizing embedded HTML properly. Do not parse HTML with regexes. Here’s an example feed that illustrates the problem. My Ultraliberal Feed Parser 2.7 sanitizes this correctly, and in the true spirit of hacking, I have no sympathy for people who can’t be bothered to write code I’ve already written.

Another entire class of unhelpful suggestions that seems to pop up on a regular basis is unproductive mandates about how producers can produce Atom feeds, or how clients can consume them. Things like let’s mandate that feeds can’t use CDATA blocks (runs contrary to the XML specification), or let’s mandate that feeds can’t contain processing instructions (technically possible, but to what purpose?), or let’s mandate that clients can only consume feeds with conforming XML parsers.

This last one is interesting, in that it tries to wish away Postel’s Law (originally stated in RFC 793 as be conservative in what you do, be liberal in what you accept from others). Various people have tried to mandate this principle out of existence, some going so far as to claim that Postel’s Law should not apply to XML, because (apparently) the three letters X, M, and L are a magical combination that signal a glorious revolution that somehow overturns the fundamental principles of interoperability.

There are no exceptions to Postel’s Law. Anyone who tries to tell you differently is probably a client-side developer who wants the entire world to change so that their life might be 0.00001% easier. The world doesn’t work that way.

I maintain a feed parser. Real people rely on it. It is used in several end-user products, including Chandler and Straw, and lots of other people use it in their own homegrown aggregators. It is as liberal as possible because that is what clients need to be. It handles the 7 different versions of RSS seamlessly and equally. It handles Atom. It even goes so far as to try to to abstract away the differences between RSS and Atom, duplicating RSS elements into Atom fields and Atom elements into RSS fields, RSS 2.0 fields into RSS 1.0 fields, and so forth and so on. If all you care about is title/link/description, you can get that from any feed, even a souped-up Atom feed. If you want to use the more advanced content model of Atom, you can do that too, even from the most minimal RSS feed.

My feed parser resolves relative links. It maps non-standard elements to standard ones. It parses 10 different types of dates and then normalizes them in case somebody claims their latest entry was last modified on June 31st. It handles many common cases of non-well-formed XML, because many feeds contain XML well-formedness errors, even the feeds of people who should really know better, and a feed parsing library that can’t parse the feeds that exist in the real world is simply a waste of everyone’s time. Don’t whine to me that parsing feeds is hard. I know how hard it is.

I also help maintain a feed validator, and I strongly advocate its use among producers. It tickles me whenever I go to a new site and see a valid RSS or valid Atom banner. I have spent an ungodly amount of time making the validator as easy to use as possible and also as strict as possible. It has almost 1000 test cases backing up its rules; 300 of those were added in the last release alone. It checks for June 31st. I swear to God it does. One day I was writing test cases and Sam was writing code to pass them, and when he saw that test case fail he almost reached through his cable modem and strangled me. He almost removed the test case out of spite. He gave in and coded it anyway, and checked it in, and we deployed, and three days later I got a bug report from someone who couldn’t figure out why his feed wasn’t validating. And I couldn’t figure it out either, until he mentioned that it only seemed to choke on the date for one specific entry, and I looked at it one more time and I swear to God it said 2003-06-31.

There are no exceptions to Postel’s Law.


Respond privately

I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)



© 2001–present Mark Pilgrim