noc16.gif (1164 bytes) Vol. 16.5

16.gif (1134 bytes)
Academic Computing
Editor Linda Simons

SGML: A Textual Representation
for Information Structure
by Robin Cover, Academic Computing robin@acadcomp.sil.org

Part 2: The Axiological Foundations of SGML

Part 1 of this serialized article on SGML identified four broad questions to be answered for the NOC readership: (1) What is SGML?, (2) Why do people use it?, (3) How does it work — in detail?, and (4) Who uses it? The first question was addressed in Part 1 (NOC 15.6.39, October 1996). There, we introduced the key features of SGML as a markup metalanguage and provided an annotated example of a simple SGML document to illustrate those features. We also explained in detail why SGML, representing something of a paradigm shift, is not what most casual observers think it is. This second article addresses the question "Why SGML?" in terms of SGML’s underlying values and goals — the philosophical commitments and foundational principles which form the core design of the metalanguage.

Why SGML? Using a direct approach to this question, we could enumerate and elaborate upon the alleged benefits of SGML, expressed in the language of SGML users. We have rejected this strategy in favor of an indirect approach to the question, focusing instead upon the core values, principles, and goals which underlie the SGML standard, and which account for many distinctive design features of the metalanguage. These characteristics of SGML do yield certain benefits when exploited in an appropriate application design. Approaching the question in this manner will enable us to see that the precise formal specification of SGML authorized in ISO 8879:1986 itself is not what is most important about the metalanguage. What is at issue, rather, is the embrace of a constellation of principles axiologically conceived. People who share these values and goals tend to use SGML, knowing full well, however, that it is an imperfect standard.

This article first identifies and explains five key principles which form the axiological foundations of SGML. It then discusses certain additional benefits which result from the use of SGML by markup language architects who share the core values expressed in the metalanguage facilities.

Principle #1: The Primacy of Formal
Models and Declarations

Many software tools present the user with a kind of blank slate: the user may "just start typing" as the initial gesture in creating document information. Philosophically, SGML does not countenance this possibility of beginning without a design; formally, an SGML document cannot be complete without a document model. SGML’s requirement for a schema or document model reflects a fundamental belief that the structure of information in a document is critical to its machine intelligibility, data integrity, and long-term usefulness. Thus, one of the most visible differences between SGML and other markup languages is SGML’s detailed mechanisms for formal declaration of the markup language rules. These rules govern SGML documents from top to bottom — from the level of characters and character sets to the top-level document structure. As overviewed in our first article, the SGML (document type) declaration formally identifies a document type and establishes its lexical rules (character set and writing system); it also defines a hierarchical model for subelements, along with attributes and default attribute values for each subelement. This declaration of lexical and syntactic rules through formal language specification yields a grammar notation that is unambiguous and robust, and one which can be exploited to make SGML documents processable in general ways.

Structure validation

Formal definition of a document schema in a DTD (Document type Definition) provides a means of automatically checking the structural validity of a document instance. By comparing an SGML document to the corresponding document schema, a validating SGML parser can report any errors in a document’s structure, including validity of the enumerated data types and links encoded by attributes. The validation principle itself constitutes a key feature of SGML as an enabling technology. For example, when data is transformed in successive stages through multiple processes, the output from each processing stage can be checked (as an SGML representation) for structural integrity. Placing the burden of validation upon the SGML parser rather than upon other processing applications simplifies the writing of conversion scripts, relaxes the demands upon processing software (e.g., allowing programmers to ignore large classes of error conditions), and prevents the propagation of common types of data errors created in workflow.

Structured editing

Generic SGML editors are widely available, including some editing tools in the public domain. If one adopts such an editor, the work which must be done to create a structure-sensitive editing tool for a particular authoring and editing task is exactly equivalent to defining the document grammar, or alternately, nominating an existing DTD. Nothing more. The SGML editor understands the document grammar, and based upon its declarations, can both constrain editing and assist in structured editing operations. Menu-driven SGML editors guide an author in supplying the required information, whether in elements or attributes; conversely, they can be adjusted to prevent or flag user errors. Authors supported by structured document editors are frequently more productive. Liberated from the demands of page layout and type design, they are free to concentrate on the task central to their domain expertise: providing information content.

Structure-based navigation

Just as an SGML document grammar encodes sufficient information for syntax-directed editing, so the hierarchical tree structure represented by the elements in the conforming document can be used by a browsing or editing interface as a means of helping the user navigate through an SGML document. Many SGML-aware software tools, including DTD editors and browsers, as well as hypertext document displays, support interactive navigation features based upon the SGML tree structure and generic (ID-IDREF) link structures.

Structure-based searching

A surprisingly sophisticated degree of search functionality can be delivered to the user through the implementation of a query language which simply knows about generic SGML structure (element structure, attributes, typed links, etc.) and about a particular SGML document type. The publicly-available Panorama Free browser from SoftQuad Inc. currently implements some structure-based search functionality, as do other free utility packages and many commercial SGML software applications. A structure-sensitive searching tool can assist the user in formulating a meaningful query because it knows about the allowable document structure and relevant attributes.

For example, if a DOCUMENT is declared to have a hierarchical model of nested SECTION, PARA, and PHRASE, then one can query a DOCUMENT for "all pairs of adjacent second-level SECTIONs whose final PARA is in English and contains more than one PHRASE object containing Tod or Gift in German." Because SGML document subelements are ordered and hierarchically structured, object sequence and object hierarchy can be specified in queries as easily as flat text strings. Regular expression syntax and extended Boolean logic can be applied to the text objects (tree nodes) in terms of their hierarchical and link relationships, as well as to their explicit and default attribute values, in addition to the string data contained within the SGML elements. As will be noted later, support for generic structure-based queries has some limitations, but the information obtained from structure-sensitive queries will be far superior to results yielded by document-level string searches.

Principle #2: The Separation of Information Representation
from Information Processing

Philosophical justification for a "processing-neutral"
representation

One feature of SGML often misunderstood by people attempting to classify the metalanguage on the basis of analogy to other markup languages is the notion of SGML’s neutrality with respect to processing. Whereas most markup languages have predefined application level processing semantics, SGML does not. Simply stated, SGML allows one to encode a representation of information structure and content, but it does not formally support any means of declaring what processing is to be done other than validation, nor any means for effecting processing. A central polemic of SGML is that this radical separation of information representation from information processing is not an inherent weakness, but a critically important design feature. Failure to adequately separate these concerns, SGML users believe, significantly reduces the information value in a marked-up document and the potential for document re-use.

Newcomers frequently ask where they can find "an SGML viewer," failing to understand why the question makes no sense: SGML makes no presumption that encoded information will ever be viewed by humans, or even that it could be visually represented in any meaningful way. What makes SGML processing-neutral is that it is based upon a commitment to descriptive (structural, declarative) markup, and more profoundly, that it has no mechanism to formally express or enforce processing semantics.

The meaning of "descriptive markup"

Precisely what do we mean by "descriptive markup?" Descriptive markup is sometimes also characterized as "structural markup," or "declarative markup," or "general(ized) markup," or "content markup"; these qualifiers focus upon certain aspects of information "description" as opposed to specific data processing tasks.

SGML markup might be called "structural markup" because its markup model, above all, facilitates the representation of information structure. However, the hierarchical element structure of an SGML document is just one kind of information structure which can be encoded. Elements can have attributes that store additional information about the structural elements, and attributes that encode pointing relationships within the document. The phrase "declarative markup" (credits to Michael Sperberg-McQueen, co-editor of the Text Encoding Initiative Guidelines) is therefore perhaps more accurate. SGML exploits declarative markup in that the elements, attributes and encoded links declare or predicate something as true about the document objects which are delimited or referenced in markup. Although SGML markup thus declares something to be true (by the authority of the markup act itself), it does not signify formally what significance or relevance this truth might have in terms of information processing.

In a similar vein, "descriptive markup" is an apt characterization because a markup element in SGML describes or identifies an information unit in terms of what it is, and what its special properties are, not in terms of what processing should be done to it (e.g., not how it should be made to appear). In this sense, the markup is "general" and not (processing) "specific." The phrase "content markup" connotes a similar idea: markup notations are based upon the identification of objects by keywords or unique content-descriptive names.

How does declarative (structural, descriptive, general, content) markup increase the longevity and portability of digital information? To answer that question is to reflect upon the characteristics of markup schemes which are not declarative, and the implications for data re-use. Several terms used to identify nondeclarative markup systems are these: "specific markup," "procedural markup," "presentational markup" and "punctuational markup." The first means simply "markup specific to a particular piece of software or for a pre-determined processing goal." Procedural markup languages use codes which instruct a processing engine (a script, program, or hardware device) to do something to or with the encoded object, usually in terms of its appearance. Examples of markup notations that are not declarative in the manner we have described: (a) in Waterloo Script, the markup notation .ce might mean "center this line"; (b) in RTF, the markup string \qr might mean "right align the following text"; (c) in UNIX nroff/troff, .cu might mean "turn on continuous underline mode"; (d) in TeX, the command \eject means "create a page break here"; (e) in SIL Standard Format, \b would mean "create a blank line" and |b would mean "turn on bold print effect"; (f) in WordStar, .pa might mean "create a page break."

These nondescriptive markup notations, as well as their equivalent proprietary binary encodings used in most word processor file formats, are immediately useful and efficient for the specific processing goal envisioned by the creator of the electronic data. They do not provide a sound basis for conversion, transformation, or editing of an encoded document based upon the intellectual structure; they encode a rendered projection of the document and not the intellectual structure itself. So historically, these approaches have proven to be shortsighted and immensely expensive shortcuts. Where specific (procedural, presentational, punctuational) markup fails to support information longevity and information portability is that their particular processing directives cannot be used readily to support other processing goals — goals such as information retrieval, structured editing, linguistic analysis, content analysis, representation in Braille, document database management, etc. — which were not immediately in focus when the electronic information was created.

A fundamental conviction of descriptive markup theory, parallel to the observation that electronic data outlives the computing technology which was used to produce it, is this: the creators of electronic information cannot foresee with clarity what value and re-use their digital information might come to have in the future. They cannot anticipate what will become possible later. A central concern of SGML, as with other descriptive markup languages, is to prepare for this unforeseeable future through the rigorous separation of structure and format, or more precisely, through the separation of specifications for information representation and information processing. This is achieved by encoding information objects in terms of their names, and providing for processing specifications in a different, separate endeavor.

Principle #3: The Commitment to Information Longevity

The single most important principle guiding SGML design was perhaps this: users should be provided a means of protecting their electronic data from obsolescence due to changes in computing technology. They need a robust but flexible data representation which will preserve the value of digital information over a long period of time, free from the control of hardware and software manufacturers. Erik Naggum, formerly the maintainer of The SGML Repository, characterized SGML as follows:

... a way of life once you have realized that the information we create takes on a life of its own and it can die if we don’t care for and feed it properly. In ancient times, you had to burn down a major library to destroy information, but you got to be remembered for it. Today, you need only upgrade to the latest version of a particular software product, change a printer, use patented software in the compression of the data, etc., to destroy many orders of magnitude more information, but the history books have yet to notice that the previous generation was the last to leave permanent traces of its tools. ("Usenet News: comp.text.sgml", 07-February-1995)

Recognizing that digital information has value, and that the information in an electronic document customarily outlives the software and hardware used to create it, SGML users insist that the underlying representation insulate that information from the predictable software and hardware obsolescence. In the following sections, we will explore in greater detail some of SGML’s features which represent the outworking of this philosophical commitment to information longevity.

SGML as a nonproprietary, vendor-neutral standard

Question: if Microsoft owns the RTF standard, if Adobe owns the Postscript and PDF standards, and if Sun Microsystems owns Java: who owns the SGML standard? Answer: the international community, specifically ISO/IEC JTC1/SC18/WG8, representing ultimately some hundreds of commercial and noncommercial entities. The distinction is critical because representatives on the ISO subcommittees and working groups do not wish to see the SGML standard changed in ways that put encoded information and established workflows at risk. Most researchers who have used proprietary software (or public domain software which supports a proprietary data format) have suffered at some point from the decision of the controlling entity to "change the standard." In terms of protecting electronic information, it does not matter whether a data standard is "open" (viz., openly published) or not, but rather, who controls the standard. Of course, if electronic information created in a given setting has no lasting value, and cannot possibly be regarded a corporate asset, then it’s inconsequential whether the data format conforms to a standard at all. Many companies today are electing to use SGML because they require that their information assets be insulated from the twists of economic fate and monopolistic stratagems that are commonplace in the competitive world of commercial software development.

SGML as system independent

A second feature of SGML which reflects a commitment to data longevity and portability is the notion of system independence. Facilities were built into the structure of SGML at two levels to account for the fact that different hardware platforms and operating systems use different low-level conventions which make data interchange hazardous. These conventions include default character sets, encodings for newline, end-of-file marking, escape sequences, (in)significance of trailing white space, and other features of internal file formats. SGML’s general entity mechanism exploits a robust and entirely flexible kind of indirection which supports the representation of characters or symbols that are not encoded the same way on different platforms. Judicious use of facilities accessible in the SGML declaration can also enhance system independence. Injudicious use of these facilities can decrease data portability — at least in terms of blind interchange. The fact that SGML requires one to make declarations about the character sets and other system conventions provides a critical piece of documentation for information encoded in SGML.

SGML as application-independent

The application independence of SGML document data has been explained in detail above, under Principle #2. As a further illustration, we note that some large documentation projects have a 15-30 year life cycle. It is known in advance that the life cycle of the technical or legislative manuals will be measured in decades and not years, and that many different pieces of software, some not invented yet, will be used to maintain and deliver the electronic information. In such an environment, corporate managers now understand, it is imperative that one employ a robust but flexible information representation format capable of protecting the data and facilitating their migration into many different software applications. It turns out that computing technology is surprisingly short lived, and that digital information has a surprisingly long life. Each new generation of computing technology presents opportunities to re-use electronic information in ways that could not have been envisioned in the previous era. SGML’s goal of application independence embraces this reality, seeking to leverage the value of the encoded information.

SGML as human-readable

SGML and Standard Format share the notion that data files should be encoded in a manner which makes their interpretation plain and unambiguous to human readers. Human-readability does not mean that an SGML document is intended for human readers, but that it can be read and edited without the assistance of proprietary software. For this reason, the use of nonprintable characters such as controls characters (ASCII decimal 0-31 other than SPACE and NL [CR, LF]) is generally deprecated. Likewise, the names given to the SGML markup units are supposed to be mnemonic and unambiguous, using words and phrases from natural languages. The markers and markup constructs are supposed to be described informally in a manner that clarifies their purpose and usage. Thus, in being descriptive, an SGML markup system sacrifices the compactness that comes with binary or other proprietary encoding — in favor of a verbose robustness that lends itself to unambiguous human (and machine) interpretation.

The notion of SGML’s human-readability must be qualified in two important ways. First, SGML does allow for the inclusion of proprietary/binary data types (e.g., music, graphics) in documents through the use of a NOTATION declaration, which associates the non-SGML data with an application. Such data, at one level, are not human-readable. Second, human-readability does not imply that SGML documents ought to be easily editable using a plain text editor. The world of HTML in 1997 provides an example: though it is possible to edit HTML 4.0 source code by hand, most users do not.

Principle #4: The Distinction between
Elements and Attributes

At several junctures we have observed that the central concern of SGML is not printing or other processing: its central concern is information representation. A design feature of SGML reflecting this concern for information richness is the formal distinction between elements and attributes. Attributes provide the mechanism for storing additional information about elements. As implemented in ISO 8879:1986, attributes allow the encoding of a unique identifier for any element, for referencing any other element or group of elements, and for providing a wide range of other information about the element: subclassification, description, qualification, or other attributive notions which can be expressed in a name-value pair. A validating SGML parser uses the attribute declarations to check, at least nominally, whether the values given for attributes in a document instance conform to the declared data types (e.g., whether an element identifier is unique; whether a pointer has a target; whether a value is a proper token from a declared list of possible values, etc.). Processing systems, depending upon the particular processing goals, can make use of the additional information stored in attributes to intelligently (selectively) make use of the elements in the SGML document.

It is freely admitted that the notion of attribute is weak in SGML — compared to modern object-oriented database systems, for example. We may also justifiably claim that the additional validation services provided by SGML through attributes is significant, and that the formal distinction between elements and attributes, analogous to Objects and Attributes in object-oriented systems, is well-motivated in terms of SGML’s focus upon the richness of structured information.

Principle #5: Flexibility through Extreme Generality

The widespread applicability of SGML throughout industry, government, and academic sectors is due, in part, to the fact that SGML is completely information neutral. Virtually every major industry now uses SGML — not simply for generic documents, but for the management of industry-specific information types. If SGML’s weakness is that it cannot validate information content at a semantic level — and indeed it was designed not to do so — then its strength is that it can be used to represent the structure of information within any domain of human knowledge. Databases that are part of dedicated processing systems normally have particular strengths that make them optimally suited for creating and maintaining certain kinds of information, but utterly useless for information that does not fit. SGML as a notation for representing information structure has no similar limitation.

Concomitant with this generality and flexibility is SGML’s complete indifference to information complexity: there are no intrinsic, artificial restrictions placed upon levels of nesting and embedding, granularity of markup, link density, document size, markup density, number of subdocuments, number of disk files, etc. Thus, SGML has scaled tolerably well for ten years with advances in hardware and software technology, as well as with the rapidly-accumulating terabytes of corporate data that need to be made accessible to document database management systems.

Key Benefits Derived from SGML’s
Philosophical Commitments

SGML is used today as an enabling technology: used in conjunction with other integrated processing components, it participates in an "open information management" philosophy. SGML plays a key role by protecting the neutrality, integrity, modularity, and long-term usability of document component information as it is managed within a central repository and accessed by dedicated processing systems. Some manifestations of the beneficial role of SGML have been mentioned in conjunction with the five foundational principles above. Some additional benefits are elaborated in this final section.

Support for information interchange

Multi-national corporations as well as smaller enterprises often need to be able to share electronic documents or document components across multiple applications and machine architectures, perhaps in several different countries. Collaborative work environments frequently demand that several authors contribute to a writing project at the same time. Given these requirements and the reality of computing platform pluralism, it makes eminent sense to define a single interchange format that can serve as a database standard. SGML serves this purpose well. Given N different applications that make use of the information, having a single interchange format means having to build 2N transformation modules rather than (N-1)2.

Support for data archiving

Archiving digital information is analogous to data interchange: corporations need a machine-neutral and open format that can survive years or decades of technology change. SGML provides such a format, particularly to the extent that the information is stored in a textual representation. Of course, SGML of itself does nothing to ensure that digital information will not be trapped on physically unreadable media. However, with today’s ubiquitous networks and sophisticated digital copying technologies (multiple validation phases using cyclical redundancy checks), the problem of infinite information longevity is now effectively solved at the technology level. The archive integrity problem now amounts to corporate or personal discipline in scheduling the creation of multiple storage copies of archived data on a timely basis. SGML helps ensure the long-term accessibility, interoperability, and meaningfulness of this archived information by storing it in an open, documented standard representation.

Support for information re-use

Corporate entities in industry and government are now discovering that 90% or more of their information assets reside in documents. The percentage is not so high in financial institutions, where transaction data is the core asset. Some 20% of the GNP in industrialized countries now involves the creation and management of new information. In almost all enterprises, documents and document database repositories are being discovered as information banks, and information management is becoming the focus in enterprise productivity. In many cases, corporations have only gradually become aware of the importance and magnitude of the information asset represented by their legacy documents.

Given these changes, we observe a fundamental mismatch between the individual worker’s (traditional) view of a "document" as a deliverable which finds its final expression on paper and the corporation’s (new) view of that document as an electronic information asset. Thus, a critical task in corporate information management has now become — and increasingly with the Intranets of 1997 — the rescue and conversion of word processor documents, integrating them in a standard format into centralized document databases for wider delivery. From the central document component repository, new documents as virtual objects may be generated. The bottleneck in this process is conversion: conversion is a painfully expensive proposition.

The terms "rescue" and "expensive" in the previous paragraph oblige us to rehearse what has now become classical polemic against the traditional view of the paper document as a final deliverable. Conventional word processors are used to compose these documents, and the associated word processing files typically contain a conglomeration of structurally underspecified textual and graphical data in any sequence or arrangement an author elects. This situation obtains — the availability of named styles in some word processors notwithstanding — because word processor design has been dominated by assumptions and goals which are commensurate with the creation of formatted character text, but which are fundamentally antithetical to the creation of structured, normalized digital information. Authoring software has been designed primarily for the purpose of creating and manipulating data which will find their ultimate expression outside the computing environment: on paper (or its electronic page equivalent), on overhead transparencies, on film, on television screens, or on other medium for which visual presentation of the data is of paramount importance. In most such WYSIWYG (alias "What you see is all you’ve got!") applications, the internal representation of the "digital stuff" within the disk files matters little so long as the visible product (the book, the advertising copy, the presentation graphic) communicates effectively and is delivered on schedule.

This traditional model’s goals and assumptions support the immediate felt needs of authors and traditional publishers tolerably well within closed systems where the formatted text, prepared for screen or paper, is thought to be the final deliverable. Increasingly, however, the assumptions of this model are being found inadequate for the versatile electronic environment in which a document, as a network-deliverable information object with revisable and re-usable content, is seen to constitute a valuable corporate resource. The document information, corporate managers discover, needs to be re-purposed and re-used.

Information re-use is realized in a variety of ways. It may be discovered that document information needs to be indexed, or published on CDROM, or revised for use in a different documentation effort, or abridged, or analyzed for natural language properties, or converted into Braille, or atomized for incorporation into a document-component database. The critical feature of SGML which makes document re-use possible is that information units are descriptively tagged — demarcated and linked — in ways that support intelligent indexing, conversion, analysis, and related operations. The use of specific markup, by contrast, almost always gets in the way of data re-use because transformations need to be based upon identifiable intellectual structures, not upon presentational or procedural markup commands.

The re-use of digital information discussed above focused upon the corporate perspective: how corporate bodies can best leverage their information assets by encoding digital information in a way that reflects the underlying intellectual structure. The argumentation applies no less to individual scholars and researchers who use computers to manage data electronically. If they are committed in principle to the long-term relevance of their research databases, they will seek to encode the data in a standard, robust, application-neutral, machine-independent way. The widespread use of SGML for structuring literary and linguistic data in universities and digital libraries already bears testimony to this verdict.

Conclusion

Such are the theoretical and practical benefits that come from SGML, articulated in light of the core values and principles reflected in SGML’s metalanguage design. In the next article, we will dig deeper into the SGML mechanisms themselves, explaining how the different levels of abstraction support information re-use through document database repositories.

16.gif (1134 bytes)

NOC Home Page

Copyright © Summer Institute of Linguistics, Inc.

Any use of the materials contained herein must acknowledge NOC and the Summer Institute of Linguistics, and the authors.

webmaster: emmalee_higgins@sil.org