Skip to content

Stian Soiland

Sections
Personal tools
You are here: Home » Documents » Remember SGML

Remember SGML

Document Actions
Do you remember SGML? The preceeder of XML? No? Well, you should, it's a very funny standard indeed. Think of all the possibilities!

Background

First, some quick background. SGML, as XML, is a meta-language, describing a way to do "markup languages", indicated by the name, "Standard generalized markup language". Now, the most well known application of SGML is the language that's used by basically all webpages of today, namely HTML. There exists a "new" version of HTML called XHTML, as it follows the rule of XML instead of SGML. XML was proposed as an "upgrade" of SGML, since following the complete SGML specification was cumbersome and difficult, but having a common way to encode and describe data was useful. XML has become very popular among Java programmers, probably because it's simpler to write XML than Java. Python programmers hate XML, as it's much easier to write Python than XML. But anyway, let's take a look at this SGML-thingie.

History

If you ask any CS student on SGML, he will probably think about some bearded old hacker and say it's really outdated and never really used except maybe on some mainframes from IBM. In a way, it's true, the basis of SGML is, according to Wikipedia, a language called GML (General Markup Language) developed by IBM in the 1960s, by Charles Goldfarb, Edward Mosher and Raymond Lorie. (Notice that the three authors surnames are also abbrevated GML). Here's an example from Goldfarb's site:

:h1.Chapter 1:  Introduction
:p.GML supported hierarchical containers, such as
:ol
:li.Ordered lists (like this one),
:li.Unordered lists, and
:li.Definition lists
:eol.

SGML was later developed and made an ISO standard, ISO 8879 in 1986. ISO standards cost money if you want a copy, but as a NTNU student, I was able to get a copy of the almost equal British Standard electronically through our university library. This article will describe some of the funny and remarkable aspects of SGML that young people like myself had no idea existed - as we've been brought up with HTML and XML and never seen any of that fancy stuff.

Now, what about SGML?

First of all, SGML is not that different from XML. Anything possible to do in XML can be done in SGML. The description language (DTD) of SGML is also one of the possible description languages available for XML. What I'll show here is all those weird things that can be done by SGML that was cleaned away for making XML a simpler language to parse and use. For some reason, the SGML people wanted to do everything at once, so the language is full of possibilities.

Omitting end-tags

Anyone fighting with HTML and XHTML have noticed the difference - in xhtml a welformed document must have both opening and closing tags for every element. This even includes empty elements, with a shorthand syntax inherited from SGML, <br /> which is equal to <br></br>

<ol>
<li>First element</li>
<li>Second element, contains a <br />
line break
</li>
</ol>
<p>That was our list.</p>

In old-fashioned HTML, based on SGML, you didn't need those closing tags if it follows from the definitions (DTD) which tags are allowed or not. As <li> elements can't contain other <li> elements, they're only allowed within <ol> and <ul>, it is not neccessary to include the end-tag. A new <li> means the previous one is finished. And likewise, as <p> is not allowed within <ol> or <li> - a <p> would mean that the whole ordered list was finished, so that </ol> isn't needed either. The empty element is defined as such, nothing is allowed inside, and therefore <br> only needs the starting tag as well, the closing tag is inserted automatically. Here's a valid HTML equivalent:

<ol>
<li>First element
<li>Second element, contains a <br>
line break
<p>That was our list.

Notice how everything's shorter and less verbose. This is the argument stressed all over the SGML standard, allowing such short-hand syntax was supposed to make SGML easier. For who? Well, the secretaries, of course, who was going to type in documents in SGML at their IBM terminals!

Now, everyone who's been trying to make a half-decent webpage in the 1990's and dived into HTML (you know, before we got these nice content management systems like Plone) has been struggling with tables and tags that suddenly take over the page because in that particular situation, including the end-tag for some reason was needed anyway. Having to think about when to include the end-tag or not wasn't really that useful, so people started always using the closing-tags. And that's the way it's become now with XML and XHTML too. But our SGML fathers say it's a good thing to have a choice.

Short tags and empty tags

Let's take a look at how deep we can do this with SGML. As some of us know, inside <ol> only <li> elements are allowed. So with some optional SGML features (not enabled by normal HTML), we can have some fun. First a feature called shorttag, here shown with the empty tag. The empty tag closes </> or starts <> the most recent open element.


<ol>
<li>First element</>
<>Second</>
<>Third</>
</ol>

There is also a variation called the unclosed short tag, and it can be used as this: (notice the removed characters < and >)

<p<em>An emphasized paragraph.</em</p>

There's even a null end tag that can be used for really small markups, with the content contained inside the start-tag, like:

<p>Only <em/these three words/ are emphasized</p> 

Omitting start-tags

We've shown how end-tags can be omitted if the next element isn't allowed within the element, or a closing-tag closes some element higher in the hierarchy. Now let's take a look at how start-tags can be omitted. If an element is required by the DTD, the start-tag might be omitted:

<ol>
First element (what else?)
<li>Second element
<li>Third element
</ol>

It's this feature that allows you to drop start-tags and write "valid" HTML as:

      <td>First cell, first row</td>
Second cell, first row
<tr>
First cell, second row
<td>Second cell, second row
</table>

Notice how we've obmitted both start-tags and end-tags, we didn't start the table or table-row, but since <td> is only allowed inside a <tr>, and a <tr> is only allowed inside a <table>, the SGML parser should be able to straighten things up for us. You might understand now why people wanted to write XML parsers instead..

Now, let's put it all together into a nice non-understandable syntax (I'm actually not quite sure if that last line is valid or not, but as far as I can see from the standards, it is):

<ol>
First element (what else?)
<>Second element</>
<>Third element
<p<em/Emphasized paragraph outside list/

Attributes

What about attributes? In XML, all attributes must be quoted, like:

<p align="right">Right-aligned paragraph</p>

In SGML, the quotes are only needed if special characters are used.. so one can say:

<p align=right>Right-aligned paragraph</p>

And "of course" we can avoid the whole align= thing all together, as the allowable values are only left, center, right and justify:

<p right>Right-aligned paragraph</p>

This might help explain why HTML attributes like <td nowrap> in XHTML are written like <td nowrap="nowrap">

Entities

You've probably seen entities sometime in your HTML career. For instance, if you have trouble writing the Norwegian letter ø, you might do it by using a HTML entity called "o slash", written &oslash;. Such entities are defined by the DTD. Now, not only characters can be defined by such entities, but whole phrases:

<!ENTITY stain "Stian Soiland">
<p>This example is made by &stain;</p>

Tags are allowed within such entities as well:

<!ENTITY stain "<a href='http://soiland.no/>'Stian Soiland</a>">
<p>This example is made by &stain;</p>

Now, what other things can we do? Well, what about replacing certain characters like TAB and LINE FEED with appropriate tags?

<!ENTITY ptag "<p>">
<!SHORTREF map1 "&#RS;&#RE;" ptag>
Some paragraph starts
about here, but ends
here.

The next paragraph is here, it
is only two lines long.

First, we define an entity for <p> named ptag, then we define a mapping between "Record start" and "Record end" and the ptag. This mapping says that an empty "record" (in our case, a line), will be replaced by <p>. So what happens is that <p> is inserted in the blank space. I think. I haven't tested this. Looks pretty cool, huh?

It is possible to make a larger example, replacing TAB-characters with <td>, and thereby allowing the "secretary" to enter tables by just using linefeeds and tabulators. However, the SGML code for this is a bit large, so I will not include it here, but it is included in the SGML standard appendix if you really want to give it a try...

Summary

I've tried to show what kind of weird things that are possible to do with SGML. Note that probably none of these examples will work with a SGML parser, and definitively not with any normal web browser. However, I've used the HTML as the example application to have familiar elements.

If you were ever thinking about using SGML instead of XML, I hope now that you will think again. However, note that most of the funny features mentioned here are not always used in SGML, as mentioned, XML is a possible SGML application.

Disclaimer

I am not a SGML expert, actually I've never really done anything related to SGML except using HTML. I've used XML, and I've read the SGML standard. That's about it. So I cannot guarantee that all examples mentioned here are valid SGML. Some examples are copied almost verbatim from the SGML standard.

Created by stain
Last modified 2004-12-13 12:32