Remember SGML
Background
First, some quick background. SGML, as XML, is a meta-language, describing a way to do "markup languages", indicated by the name, "Standard generalized markup language". Now, the most well known application of SGML is the language that's used by basically all webpages of today, namely HTML. There exists a "new" version of HTML called XHTML, as it follows the rule of XML instead of SGML. XML was proposed as an "upgrade" of SGML, since following the complete SGML specification was cumbersome and difficult, but having a common way to encode and describe data was useful. XML has become very popular among Java programmers, probably because it's simpler to write XML than Java. Python programmers hate XML, as it's much easier to write Python than XML. But anyway, let's take a look at this SGML-thingie.
History
If you ask any CS student on SGML, he will probably think about some bearded old hacker and say it's really outdated and never really used except maybe on some mainframes from IBM. In a way, it's true, the basis of SGML is, according to Wikipedia, a language called GML (General Markup Language) developed by IBM in the 1960s, by Charles Goldfarb, Edward Mosher and Raymond Lorie. (Notice that the three authors surnames are also abbrevated GML). Here's an example from Goldfarb's site:
:h1.Chapter 1: Introduction
:p.GML supported hierarchical containers, such as
:ol
:li.Ordered lists (like this one),
:li.Unordered lists, and
:li.Definition lists
:eol.
SGML was later developed and made an ISO standard, ISO 8879 in 1986. ISO standards cost money if you want a copy, but as a NTNU student, I was able to get a copy of the almost equal British Standard electronically through our university library. This article will describe some of the funny and remarkable aspects of SGML that young people like myself had no idea existed - as we've been brought up with HTML and XML and never seen any of that fancy stuff.
Now, what about SGML?
First of all, SGML is not that different from XML. Anything possible to do in XML can be done in SGML. The description language (DTD) of SGML is also one of the possible description languages available for XML. What I'll show here is all those weird things that can be done by SGML that was cleaned away for making XML a simpler language to parse and use. For some reason, the SGML people wanted to do everything at once, so the language is full of possibilities.
Omitting end-tags
Anyone fighting with HTML and XHTML have noticed the difference - in
xhtml a welformed document must have
both opening and closing tags for every element. This even includes empty elements,
with a shorthand syntax inherited from SGML, <br />
which is equal to <br></br>
<ol>
<li>First element</li>
<li>Second element, contains a <br />
line break
</li>
</ol>
<p>That was our list.</p>
In old-fashioned HTML, based on SGML, you didn't need those closing
tags if it follows from the definitions (DTD) which tags are allowed or
not. As <li>
elements can't contain other
<li>
elements, they're only allowed within
<ol>
and <ul>
, it is not
neccessary to include the end-tag. A new <li>
means the previous
one is finished. And likewise, as <p>
is not allowed
within <ol>
or <li>
- a
<p>
would mean that the whole ordered list was
finished, so that </ol>
isn't needed either. The
empty element is defined as such, nothing is allowed inside,
and therefore <br>
only needs the starting tag as well, the closing tag is inserted
automatically. Here's a valid HTML equivalent:
<ol>
<li>First element
<li>Second element, contains a <br>
line break
<p>That was our list.
Notice how everything's shorter and less verbose. This is the argument stressed all over the SGML standard, allowing such short-hand syntax was supposed to make SGML easier. For who? Well, the secretaries, of course, who was going to type in documents in SGML at their IBM terminals!
Now, everyone who's been trying to make a half-decent webpage in the 1990's and dived into HTML (you know, before we got these nice content management systems like Plone) has been struggling with tables and tags that suddenly take over the page because in that particular situation, including the end-tag for some reason was needed anyway. Having to think about when to include the end-tag or not wasn't really that useful, so people started always using the closing-tags. And that's the way it's become now with XML and XHTML too. But our SGML fathers say it's a good thing to have a choice.
Short tags and empty tags
Let's take a look at how deep we can do this with SGML. As some of us
know, inside <ol>
only <li>
elements are allowed. So with some optional SGML features (not enabled
by normal HTML), we can have some fun. First a feature called
shorttag, here shown with the empty tag. The empty tag
closes </>
or starts <>
the most
recent open element.
<ol>
<li>First element</>
<>Second</>
<>Third</>
</ol>
There is also a variation called the unclosed short tag, and
it can be used as this: (notice the removed characters <
and
>
)
<p<em>An emphasized paragraph.</em</p>
There's even a null end tag that can be used for really small markups, with the content contained inside the start-tag, like:
<p>Only <em/these three words/ are emphasized</p>
Omitting start-tags
We've shown how end-tags can be omitted if the next element isn't allowed within the element, or a closing-tag closes some element higher in the hierarchy. Now let's take a look at how start-tags can be omitted. If an element is required by the DTD, the start-tag might be omitted:
<ol>
First element (what else?)
<li>Second element
<li>Third element
</ol>
It's this feature that allows you to drop start-tags and write "valid" HTML as:
<td>First cell, first row</td>
Second cell, first row
<tr>
First cell, second row
<td>Second cell, second row
</table>
Notice how we've obmitted both start-tags and end-tags, we didn't
start the table or table-row, but since <td>
is only
allowed inside a <tr>
, and a <tr>
is only
allowed inside a <table>
, the SGML parser should be able to
straighten things up for us. You might understand now why people wanted
to write XML parsers instead..
Now, let's put it all together into a nice non-understandable syntax (I'm actually not quite sure if that last line is valid or not, but as far as I can see from the standards, it is):
<ol>
First element (what else?)
<>Second element</>
<>Third element
<p<em/Emphasized paragraph outside list/
Attributes
What about attributes? In XML, all attributes must be quoted, like:
<p align="right">Right-aligned paragraph</p>
In SGML, the quotes are only needed if special characters are used.. so one can say:
<p align=right>Right-aligned paragraph</p>
And "of course" we can avoid the whole align=
thing all
together, as the allowable values are only
left
,
center
,
right
and
justify
:
<p right>Right-aligned paragraph</p>
This might help explain why HTML attributes like
<td nowrap>
in XHTML are written like
<td nowrap="nowrap">
Entities
You've probably seen entities sometime in your HTML career. For
instance, if you have trouble writing the Norwegian letter
ø
, you might do it by using a HTML entity called "o slash",
written ø
. Such entities are defined by the DTD.
Now, not only characters can be defined by such entities, but whole
phrases:
<!ENTITY stain "Stian Soiland">
<p>This example is made by &stain;</p>
Tags are allowed within such entities as well:
<!ENTITY stain "<a href='http://soiland.no/>'Stian Soiland</a>">
<p>This example is made by &stain;</p>
Now, what other things can we do? Well, what about replacing certain
characters like TAB
and LINE FEED
with
appropriate tags?
<!ENTITY ptag "<p>">
<!SHORTREF map1 "&#RS;&#RE;" ptag>
Some paragraph starts
about here, but ends
here.
The next paragraph is here, it
is only two lines long.
First, we define an entity for <p>
named ptag,
then we define a mapping between "Record start" and "Record end" and the
ptag. This mapping says that an empty "record" (in our case, a line),
will be replaced by <p>
. So what happens is that
<p>
is inserted in the blank space. I think. I
haven't tested this. Looks pretty cool, huh?
It is possible to make a larger example, replacing
TAB
-characters with <td>
, and thereby
allowing the "secretary" to enter tables by just using linefeeds and
tabulators. However, the SGML code for this is a bit large, so I will
not include it here, but it is included in the SGML standard appendix if
you really want to give it a try...
Summary
I've tried to show what kind of weird things that are possible to do with SGML. Note that probably none of these examples will work with a SGML parser, and definitively not with any normal web browser. However, I've used the HTML as the example application to have familiar elements.
If you were ever thinking about using SGML instead of XML, I hope now that you will think again. However, note that most of the funny features mentioned here are not always used in SGML, as mentioned, XML is a possible SGML application.
Disclaimer
I am not a SGML expert, actually I've never really done anything related to SGML except using HTML. I've used XML, and I've read the SGML standard. That's about it. So I cannot guarantee that all examples mentioned here are valid SGML. Some examples are copied almost verbatim from the SGML standard.