dive into mark

‣ January 16, 2004 ‣

The history of draconian error handling in XML

I suspect that most of the people discussing liberal XML parsing today are unaware that Tim Bray was the singular force behind the fail on first error behavior of XML. Virtually everyone in the XML working group disagreed with him, and many people pleaded for a sane method of error recovery, or at least the application-specific option to provide error recovery that was suitable for the application. (XML is uniquely suited for such error-tolerant applications. Because it is text-based and has so much redundant information, like verbose end tags, it provides easier re-entry points to recover after a parsing error, unlike most binary formats.)

In the end, Tim basically said there are two camps here, they both have good points, we aren’t going to convince each other on this one and then proceeded to compromise by doing it his way. Seven years later, we are still paying the price for his dogmatic draconianism.

Update: Tim agrees with the following timeline but disagrees with my conclusion. I would tend to believe him, since he was, you know, there. But we agree on my fundamental point: XML’s error handling has always been controversial, and lots of smart people disagreed with it from the beginning for lots of good reasons.

April 18, 1997. Tim Bray: Error Handling in XML

Well-formedness should be easy for a document to attain. In XML, documents will carry a heavy load of semantics and formatting, attached to elements and attributes, probably with significant amounts of indirection. Can any application hope to accomplish meaningful work in this mode if the document does not even manage to be well-formed!?!?

I suggest that we add language to section 5, “conformance”, which says:

“An XML processor which encounters a violation of the constraints of well-formedness must not thereafter pass any information about text or markup to the application. It must pass to the application a notification of the first such violation encountered. It MAY thereafter, at user option, pass to the application information about well-formedness violations encountered after the first.”

[or in English: you gotta tell the app about the first syntax botch you hit; you're allowed to send the app more error messages, but you're not allowed to send anything but error messages after you've detected an error]

April 19, 1997. Sean McGrath: Re: Error Handling in XML

Programming languages that barf on a syntax error do so because a partial executable image is a useless thing. A partial document is *not* a useless thing. One of the cool things about XML as a document format is that some of the content can be recovered even in the face of error. Compare this to our binary document friends where a blown byte can render the entire content inaccessible.

As I said in a previous post, I can think of a number of useful apps that can work sensibly with broken XML. I think the problem with M [Microsoft] and N [Netscape] is that there is no way to say “warnings = high” and get told about WF [well-formedness] problems.

April 19, 1997. Paul Prescod: Re: Error Handling In XML

I would like to weigh in on the side of moderation: require the user agent to alert the use that the parse was invalid, but don’t require it to throw away the rest of the data. Vendors will just ignore that rule anyhow.

Error recovery in HTML is a product differentiator. No matter how much they bitch moan and complain, nobody would ever unilaterally move to a “validate or reject” model. And if they had started out with that model, some product would just have removed the “rejection” part in the race to be the most “flexible” and “user friendly” and the rest would have inevitably followed.

April 20, 1997. Tim Bray: Error handling: yes, I did mean it

The vendors and *serious* information providers at one in wanting to create a non-HTML-like culture of publish-it-right on the Net; one way to do this is to shout, loudly, that there are a few (simple, thank goodness) rules, and they must be obeyed.

April 21, 1997. James Clark: Re: Error handling: yes, I did mean it

If the parser tells you about the error, then you, as an application builder, can choose to ignore any data sent by the parser after the error. The parser may even provide you with a way to do that automatically (nsgmls -E1 will stop after the first error). I think users and application builders should have a choice with what they do with invalid data. I cannot see how a user or application builder can be disadvantaged by being provided with this choice, and I therefore plan to continue to provide it even if the spec says that this is non-conforming.

April 22, 1997. Paul Prescod: Re: Error handling: yes, I did mean it

Being strict on export is laudable. Being strict on import is a hassle. I don’t want the spec. to REQUIRE that you cause me a hassle. Nor do I want it to require Netscape to cause me a hassle when some bozo leaves out some easily implied quotes. I want it to notify me that he is a bozo, but let me at the data anyhow. I think that in that scenario everybody wins.

April 22, 1997. Paul Prescod: Re: Error handling: yes, I did mean it

People must have the option to decide for themselves. They have different applications and different needs. Hopefully the business-critical application people know how to capture stderr and know how to pipe to /dev/null if that’s what they decide is best. Let’s please leave the whole class of business-mission-life-critical applications out of this discussion because those people can take care of themselves. If they can’t we have much bigger problems than well-formedness.

April 24, 1997. Rick Jelliffe: Sudden Death (Re: Error handling: yes, I did mean it)

Tim’s policy is not a strengthening of XML’s well-formedness, but a discarding of its ability to resynchronise after an error. The ability to resynchronise, by not having context dependent delimiters or CDATA and RCDATA declared content types or STAGO in text, was always, to me, not so much to allow a simpler production rule, but also to allow robustness, a major fault in SGML. I *really* hope this is not being abandoned.

April 26, 1997. Michael Spergberg-McQueen: Re: Error handling: yes I did mean it

The arguments of the Draconian camp are all centered around the unquestioned observations that

there are applications where ill-formed data is useless or worse than useless, and where ill-formedness must be detected

by their unwillingness to issue error messages, and their determination to provide attractive displays even of badly ill-formed documents, HTML browser makers have made their own lives very difficult

Neither of these observations supports a blanket ban on error recovery by XML processors.

April 29, 1997. Bill Smith: Re: Error handling: yes, I did mean it

The draconian XML model says religion is more important than ease-of-use. That’s backwards.

April 28, 1997. Dave Hollander: Re: Sudden death: request for missing input

The argument seems to be, ‘don’t worry. Since most if not all XML documents will be machine generated they will all be well formed.’ I don’t buy it! Programmers are human to and make as many errors as prose authors.

April 29, 1997. Terry Allen: Re: Error handling: yes, I did mean it

We cannot play Canute. XML is envisioned as the data format for an unimaginable range of applications, and some of those will benefit from error recovery. Humans do error recovery almost continuously (I know it’s one of my specialties), why should not their software? and if it’s useful, what chance have we of forbidding it successfully?

April 29, 1997. Tim Bray: Re: Sudden death: request for missing input

We went to a lot of work to make well-formedness easy. It is a very low bar to get over… much easier than producing valid HTML. I cannot for the life of me see why so many people here are willing to tolerate gross error, and run the risk of another race-to-the-bottom a la HTML, when the standard required to achieve reliable interoperability is so easy to explain and to achieve.

April 30, 1997. Murray Altheim: Re: Sudden death: request for missing input

I don’t think anyone is advocating tolerance for gross error, as we’ve all seen what that has done with HTML. I think some of us are simply trying to leave *exactly* what happens up to the vendors. Some sort of error notification is essential, but in certain applications the method of error “recovery” may require sending the XML source on through, others sudden death makes sense.

… Error notification is a “must”, but *how* it is done is application-specific. Error recovery is a “maybe”, depending on the application.

May 1, 1997. David Durand: Re: Sudden death: request for missing input

The race to the bottom can be prevented several ways. One way is simply that XML is simply more useful if it’s correct — and people can always fall back to HTML if they don’t care.

Why is specifying mandatory error notification harder to enforce than specifying mandatory refusal to process erroneous documents?

May 6, 1997. Terry Allen: Re: Jon on Error

Anyone who has a single error in his document is a bozo? Ahem. I don’t buy any of this.

May 6, 1997. Paul Prescod: Re: Jon on Error

If [Microsoft and Netscape] want to “solve the HTML problem” they can. They can launch a “Web Correctness Initiative” within W3C. They will get lots of good press in the trade rags. They can agree to add validators to both of their HTML browser products. They can agree that their editor products will not make bad HTML. This is all entirely within their power and does not require any new specifications.

May 6, 1997. Tim Bray: Final words, I think, on error handling

I think that the draconians and the tolerants really do understand each others’ positions, and at the same time can’t fathom why each other can possibly think the way they do.

… Bottom line: we aren’t going to convince each other on this one.

May 7, 1997. Paul Prescod: Re: Final words, I think, on error handling

Browsers do not just need a well-formed XML document. They need a well-formed XML document with a stylesheet in a known location that is syntactically correct and *semantically correct* (actually applies reasonable styles to the elements so that the document can be read). They need valid hyperlinks to valid targets and pretty soon they may need some kind of valid SGML catalog. There is still so much room for a document author to screw up that well-formedness is a very minor step down the path. The idea that well-formedness-or-die will create a “culture of quality” on the Web is totally bogus. People will become extremely anal about their well-formedness and transfer their laziness to some other part of the system.

May 7, 1997. Arjun Ray: Re: Final words, I think, on error handling

The basic point against the Draconian case is that a single (monolithic?) policy towards error handling is a recipe for failure. … The Good News for XML is that DTD conformance is not an (immediate) issue; the Bad News is that there are nevertheless enough merely lexical/syntactic gotchas to be fertile sources of errors — and not every XML document put on the wire will be the output of a smart editor.

Respond privately

I am no longer accepting public comments on this post, but you can use this form to contact me privately. (Your message will not be published.)

firehose ‧ code ‧ planet