Skip to main content

Monday, December 30, 2002

The tag soup of a new generation

Dare Obasanjo: Semantic redux.

One of the biggest gripes about HTML as the primary authoring format for Web documents has been that it focus too much on presentation and not enough on semantics and content. … Over time a number of seemingly arbitrary tags not strictly related to presentation eventually made it into the HTML tag set such as code, cite, and acronym.

Actually, code and cite have been in HTML since the very beginning. See this draft of HTML (1.0?) dated July 1, 1993. The logical styles em, strong, code, samp, kbd, var, dfn, and cite are all listed, and Tim Berners-Lee encourages using them over the physical styles tt, b, and i:

The logical styles should be
used wherever possible, unless for example it is necessary to refer
to the formatting in the text. (Eg, “The italic parts are
mandatory”.)

acronym was added in HTML 4, along with its kissing cousin, abbr. (Don’t even get me started on that rat’s nest of semantic hair-splitting. Use acronym. Don’t use abbr. Blame Microsoft.)

Dare again:

Given that the W3C thinks XML is the basis for RDF and the Semantic Web …

XML is not the basis of RDF. There is an XML serialization of RDF called RDF/XML. My girlfriend has an XML serialization of her shoe closet (I am not making this up); that doesn’t mean her shoe closet is based on XML. For a 5-minute tutorial on RDF that doesn’t get mired in the syntactic hell of RDF/XML, see Aaron Swartz’s RDF Primer.

Dare continues:

… it seems the general direction going forward is to move towards replacing a WWW full of HTML documents to one full of XML documents. If you are for the Semantic Web, you are for an XML Web not for an HTML one.

This is an inductive fallacy. I’m not sure what it’s called, but here’s the general form:

  1. A believes X.
  2. A believes all X are Y.
  3. B believes X.
  4. Therefore, B must believe all X are Y.

Specifically:

  1. The W3C believes in the Semantic Web.
  2. The W3C believes that the Semantic Web is based on XML.
  3. I believe in the Semantic Web.
  4. Therefore, I must believe that the Semantic Web is based on XML.

Regardless, I do not believe in the Semantic Web, XML-based or not. (Dare doesn’t either, so this part of the argument is not directed at him.) At least not as a mainstream technology. Cory Doctorow put the nail in that coffin a while ago, specifically the part where people lie. The entire basis of the Semantic Web is that we’ll have a universe of data that’s only machine-readable, but that it will somehow be accurate and useful. That’s fine for specific niche markets (specifically, ones with no competition or other incentives to lie). But we already have some examples of how this will play out on the public Internet: meta tags in HTML documents, which spammers and other unsavory characters stuffed with keywords in a (wildly successful) attempt to influence the first generation of search engines. Now search engines just ignore them, but developers still try all sorts of tricks to include text in the body of their pages that visitors (i.e. people) can’t see but search engines (i.e. machines) can. They use display: none CSS rules to hide it, or use absolute positioning to position it off the screen, or a hundred other tricks to try to gain a competitive advantage because search engines (i.e. machines) can’t tell the difference. And now you’re telling me that we’re going to have an entire universe of purely-machine-readable data on the public Internet, and that anyone in their right mind would trust it? Please.

Furthermore, the specific syntax suggested for this alleged Semantic Web is laughably complex. To see how badly people will bunge it up, look no further than RSS. RSS 0.91 is the simplest and most popular of all the RSS formats, it’s one of the simplest XML-based formats you’ll ever find, and 10% of the world’s RSS feeds are still invalid — mostly due to XML formatting rules (escaping ampersands, character encoding issues) that aren’t even RSS-specific. And you want to move towards replacing a WWW full of HTML documents to one full of XML documents? Are you sure? Because realistically, all you’ll manage to do is replace a morass of bloated, poorly written, invalid HTML documents with a morass of bloated, poorly written, invalid XML documents. And to tease any meaning at all out of these semantic documents, you’ll spend your days writing ultra-liberal parsers to parse invalid XML (or, God help you, invalid RDF/XML), and you’ll spend your nights and weekends decrying the new generation of tag soup on XML-DEV.

Look, semantics is hard. Forget the social problems of implementation and the technical problems of the current generation of syntax. Semantics all by itself is just fundamentally hard. I was a philosophy major in college, with a specialty in philosophy of language, so I am not completely unversed in the theoretical underpinnings of the problems the Semantic Web is attempting to solve. In many respects, I believe we’re now following in the footsteps of 18th and 19th-century philosophers. I see otherwise intelligent people falling into the same traps that intelligent people fell into 100 years ago. We’ll just define our vocabulary really well, you know, to squeeze out all possibility of ambiguity… and so forth. It’s an enticing idea, and once you wrap it all up in a shiny package of angle brackets and specifications it’s easy to get starry-eyed about it, but it’s nothing more than a garden path.

Semantics is hard. It’s a tough nut that we’re not even close to cracking. Its complexity increases exponentially as you expand the problem domain, or expand the vocabulary, or add actors. That’s why I’m concentrating on the simple but relatively well-defined semantics of HTML, with my own content, for my own use. And even that is hard to get right.

Million dollar markup But not today