A Metadata Architecture for Digital Libraries

 
Ron Daniel Jr.
Los Alamos National Laboratory
rdaniel@lanl.gov
Carl Lagoze
Cornell University
lagoze@cs.cornell.edu
Sandra D. Payette
Cornell University
payette@cs.cornell.edu
 

1 Introduction

Our society and economy have long been based on the production and consumption of physical goods. Computers were initially developed to model, and later to control, physical processes. But the development of computers had the unforseen effect of spurring the migration of our society and economy to an information basis. Now, large segments of our economy are devoted to producing, enhancing, and consuming data. In order to make this more efficient, we are applying computers to the modeling and control of information processes. Thus, the increasing importance of metadata (data about data) can be seen as an inevitable development. Seen in this light, we do not anticipate any slowdown in the uses and abuses of the term "metadata".

We deliberately mention abuses, because the simple definition of metadata as "data about data" causes a number of problems. To be more precise, these problems arise not from the definition, but from unspoken assumptions about the data being described and the purposes the description is to serve. For example, the massive storage community considers metadata to be the information on how a terascale dataset has been sliced and diced for storage onto a set of tapes. Statisticians consider metadata to be information on how the samples for a set of experimental measurements were prepared, as well as information on known biases in experimental measurements that need to be corrected. Database designers consider metadata to be relational schemas and data dictionaries, while librarians consider it to be catalog records. Even within a community there will be a number of different purposes to be achieved, and forms of metadata developed to meet those purposes. As an example, librarians not only have catalog records for the items in their collection, they have controlled vocabularies for subject descriptions and author names, and special statements on the terms and conditions for accessing a work.

This diversity of datasets and purposes holds little hope of a universal taxonomy for metadata, short of the development of a theory of everything. Yet there are application areas, such as digital libraries, where implementers will need to be able to deal with all the varieties of metadata mentioned above, and no doubt many more. As one example, Los Alamos National Laboratory is developing a Scientific Data Management system that must deal with:

Managing these various forms of data, especially in the face of changing requirements over the expected lifetime of the system, is a broad technical challenge. Coming up with a metadata architecture that is general enough to accommodate these different forms of data, while still being of assistance to implementers, is a particularly interesting facet of that challenge. The Warwick Framework [WF1] was developed to deal with just this diversity of task-specific and community-specific metadata sets.

Other applications, such as a system of interoperable digital library repositories, also require a reasonably general approach to the metadata problem. We already see that documents are becoming more like assemblies of software components [Extreme]. This is best seen in modern web documents, where Java applets and JavaScript functions are used to interact with remote servers. The goal of this componentization is two-fold; to better communicate with the reader, and to allow the reuse of the components in many documents. A repository architecture that can accommodate tailoring documents to the capabilities of the reader's system, as well as control how components are re-used, will need a very flexible architecture if it is to remain viable for any length of time.

What is needed is a model for metadata that allows new forms to be dealt with in a fashion similar to existing forms, a model that fully exploits the unique combination of computation and connectivity that characterizes the digital library. In this paper, we describe an extension of the Warwick Framework that we call Distributed Active Relationships (DARs). DARs provide a powerful model for representing data and metadata in digital library objects. They explicitly express the relationships between networked resources. New relationships may defined anywhere in the network, and may even be defined in such a way that they could be dynamically downloaded and executed in a manner analogous to a Java applet.

The DAR model is based on the following principles, which our examination of the "data about data" definition has led us to regard as axiomatic:

The remainder of this paper describes the development and consequences of the DAR model. Section 2 reviews the Warwick Framework, which is the basis for the model described in this paper. Section 3 examines the concept of the Warwick Framework Catalog, which provides a mechanism for expressing the relationships between the packages in a Warwick Framework container. With that background established, section 4 generalizes the Warwick Framework by removing the restriction that it only contains "metadata". This allows us to consider digital library objects that are aggregations of (possibly distributed) data sets, with the relationships between the data sets expressed using a Warwick Framework Catalog. Section 5 further extends the model by describing Distributed Active Relationships (DARs). DARs are the explicit relationships that have the potential to be executable, as alluded to earlier. Section 6 describes two ongoing implementations of these concepts, and section 7 concludes the paper.
 

2 The Warwick Framework: Modularizing the Metadata Universe

Metadata efforts often fall into the trap of trying to create a universal metadata schema. Such efforts fail to recognize the basic nature of metadata: namely, that it is far too diverse to fit into one useful taxonomy. Within the digital library domain alone there exist a variety of metadata forms -- bibliographic description, content rating, rights management, and many others -- that correspond to the interests of unique communities of expertise. An investigation of other domains -- such as mass storage, database, and statistical -- uncovers reasonable, but widely divergent, definitions of what constitutes metadata.

Coordinating metadata development across all those domains is impossible. Therefore the creation, administration, and enhancement of individual metadata forms should be left to the relevant communities of expertise. Ideally this would occur within an framework that will support interoperability across data and domains. The Warwick Framework (WF) provides just such a modular approach to metadata.

The Warwick Framework originated from an attempt, at the Second Invitational Metadata Workshop [WW], to define an extension mechanism for the Dublin Core Metadata Element Set [DC] in order to prevent unrestricted growth in its complexity. Named after the site of the workshop in Warwick, the WF tackles the extension problem by aggregating typed metadata packages into containers. The WF defines three types of package:

 
Figure 1 - Simple Warwick Framework Container

Figure 1 illustrates a simple example of a Warwick Framework Container. The container contains three logical packages of metadata. The first two, a Dublin Core record and a MARC record, are physically in the container. The third metadata package, which defines the terms and conditions for access to a content object, is referenced in the container indirectly via a URI.

The framework is a simple concept, but it has important implications for interoperation, and as the basis for long-lived metadata systems. By factoring complex descriptions into simpler components, interoperation can be addressed at a component level, rather than at an "all or nothing" monolithic level. The framework also allows for lowest- common-denominator descriptions, such as the Dublin Core, to exist beside complex descriptions from specialized communities, such as MARC. Thus, members of the same community can exchange their rich descriptions in preference to more general ones. System evolution is facilitated since, as new purposes for datasets emerge, new metadata schemas and formats can be developed instead of trying to evolve an already-established schema. Instances of the new schemas can be added as new packages to the container(s) associated with the dataset. New handlers can be added to utilize the new package, and this can occur without significant disruption to the metadata system architecture as a whole.
 

3 Warwick Framework Catalog: Relationships among Metadata Packages

The Warwick Framework offers significant benefits, but imposes a burden on the client (or agent) that accesses a complex container. The client encounters a richly structured set of packages, which presumably are related, but those relationships are not made explicit. The client must try to infer the relationships from the data types of the packages, or by looking inside packages that have a known format. This is not a difficult problem in constrained systems, but it is an impossible task for more general- purpose systems. Such systems need a catalog of the packages in the container that makes the relationships explicit.

To meet this need, we defined a new abstraction called the Warwick Framework Catalog (WFC). A WFC is a list of assertions about individual packages and the relationships between packages. Example relations are one package acting as a digital signature, bibliographic description, or access control specification for another package.

        (bibliographic-description package-1 package-2)
        (terms-for-accessing package-1 package-3)
        (derived-via-transformation package-1 package-6 package-5)
        (digital-signature package-1 package-4)
        (digital-signature package-6 package-7)

Listing 1 - Contents of a Warwick Framework Catalog

Listing 1 illustrates an example Warwick Framework Catalog. It shows package-2 is a bibliographic description of package- 1, while package-3 provides the terms for gaining access to package-1. Relations need not be binary, we might state that package-5 is derived from package-1 by a transformation that is specified in package-6. The same relation might hold between different sets of resources, as shown by the digital-signature relation in the last two lines. Figure 2 shows a simple Warwick Framework Container with a relationship package.
 

Figure 2 - Relationships for a Warwick Framework Container

The WFC could be provided as the first package in a container, and would provide enough information to the receiver to allow proper treatment of the remaining packages. Although the example above uses an s-expression syntax, the WFC is essentially another conceptual model that can be expressed in a number of ways. The key contribution of the WFC is that it leads to some far-reaching generalizations to the Warwick Framework. Those generalizations are described in the next two sections.
 

4 From Metadata Containers to Digital Objects

Our original motivation for developing the Warwick Framework was to aggregate multiple independent metadata sets. But, there is no essential difference between data and metadata. Metadata is data, no more and no less. As an example, consider a movie review by the well-known critics Gene Siskel and Roger Ebert. From one perspective, such a review is clearly metadata about the movie. From another point of view, it is a piece of intellectual content (data) with its own copyright and other metadata. If metadata did have an essential difference from data, then the copyright statement "about" the movie review would be meta-metadata. Information on "about" when that statement was drafted, and what team of lawyers did the drafting would be meta-meta-metadata. This fruitless proliferation in degrees of meta-ness is another indication that the metadata/data distinction is pointless.

A better approach is to consider the information architecture as a collection of inter-related resources. While these resources may have a type, such as PostScript, HTML, or a Java program, this type is orthogonal to whether the resource is acting as data vs. metadata in some context. That contextual information is specified by the relationships between the resources. We can model these inter-related resources using directed graphs, where nodes represent the resources and the labeled arrows between nodes represent the relationships. Since a resource may be related to many other resources, nodes may have many arcs originating from or terminating at them. Looking at the direction of an arrow, it is easy to see whether a resource is playing the role of data or metadata in the context of that particular relationship. We can easily accommodate such a model by generalizing the Warwick Framework so that it may contain any resources, not just those considered "metadata". Thus, we can use the Warwick Framework Catalog to specify the relationships between various resources, both inside and outside the container.

As a simple example (we will use more complex examples later), assume that the relationship arcs are uni-directional and that the only relationship they specify is "has-metadata". Figure 3 shows a set of resource nodes and relationship arcs that correspond to the Siskel and Ebert movie review mentioned earlier. For the moment, ignore the three overlapping ovals in the figure. As illustrated, certain resource nodes have both outgoing and incoming arcs; thus they are "data" in one context and "metadata" in another. For example, the Siskel and Ebert review is metadata for the movie "Men in Black", but the review has metadata of its own (it is acting as "data" relative to a Dublin Core record and a Terms and Conditions specification).

 
Figure 3 - Data Nodes with Simple Relationships

We can take a different perspective on Figure 3 and formulate three digital library resources, which can be found through resource discovery and accessed using unique identifiers (such as URLs and URNs). Each of these resources aggregates data and related metadata. These aggregations, shown by the overlapping ovals, are:

  1. The movie "Men in Black", with three metadata objects: a review, a Dublin Core record, and a terms and conditions specification.
  2. The Siskel and Ebert review, with two metadata objects: a Dublin Core record and a terms and conditions specification.
  3. A terms and conditions record, with one metadata object: an administrative record (i.e. who is responsible for its creation and maintenance).
This notion of digital library resources being aggregations (containers) of data packages that are connected via relationships leads us back to the ideas already discussed in this paper: the Warwick Framework container and the Warwick Framework Catalog. Released from the restriction that it is merely a container for metadata, it makes sense to consider Warwick Framework containers as a framework for aggregating datasets into identifiable digital library objects (or digital objects as in [ARMS]). Figure 4 shows a Warwick Framework container that represents the "Siskel and Ebert" review digital object referred to above.

 

Figure 4 - Digital Object as a Warwick Framework Container

In generalizing the Warwick Framework as a digital object container, we emphasize two features and then introduce a significant extension.

First, recall that the Warwick Framework places no locality restriction on the packages that it "contains". A package may either be physically in a container or indirectly referenced via a URI (thus, it might be located anywhere in the global information space). This is demonstrated in Listing 2, in which the relationships in a Warwick Framework Catalog refer to resources using URIs as well as internal package references. Figure 5 illustrates a digital object container that references, through the relationship catalog, a component of an external digital object. One interesting manifestation of this is that a container, or digital object, may actually
have no physically contained data sets, but may act merely as a logical container with only relationships that reference remote data sets.

Second, the example in Figure 3 illustrates only one simple type of uni-directional relation, the "has-metadata" relation. However, as we have emphasized throughout our work on the Warwick Framework, the notion that something "is metadata" does little to convey its actual meaning and, therefore, such a simple relationship should be avoided. The Warwick Framework Catalog can include a variety of relationships with much richer semantics, such as "terms-for-accessing", "bibliographic-description", and the like.

            (bibliographic-description package-1 URI-1)
            (terms-for-accessing package-1 URI-2)

Listing 2 - Relationships with indirect references

Up to this point, we have assumed that the relationships in the Warwick Framework catalog are identified with simple names, which might be listed in some registry. A more general solution is to let the relationship names be URIs. This provides a scoping mechanism to preclude name clashes. More interestingly, it opens up the possibility of making the relations into resolvable first-class resources in their own right. These "relation resources" might have their "metadata" including access controls and descriptions. In this scenario, the simple relationship arcs illustrated in Figure 3 become nodes in their own right, with possible
relationships to other data nodes. In the next section, we extend this notion even further by describing executable relationships that enable dynamic and interpretable data and metadata.
 

Figure 5 - Distributed Digital Object Container
 

5 Distributed Active Relationships: Enabling Dynamic Data and Metadata

Two characteristics that distinguish the digital domain from its physical counterpart are connectivity and computation. The concepts we have presented so far have exploited the connectivity component. In this section, we exploit the computation component by proposing Distributed Active Relationships (DAR). These are relations that not only may be drawn from anywhere on the network, but may also be executable.

The best way to describe the motivation and use of DARS is to apply them to a well-known problem, rights management. Managing intellectual property rights for digital library objects is complex, and we refer the reader to [GLAD] for a more thorough treatment of the subject. At one end of the spectrum, rights management metadata may be a simple textual description, say that used in "shrink-wrap" licenses. At the other end, there are complex access control schemes that may involve interaction and negotiation with authentication services, billing services, agents, etc. Any reasonable architecture for networked information management must accommodate the full set of rights management possibilities.

One approach to this problem is executable rights management metadata. The metadata returned to a client could be an executable object, or a handle to an executable object using distributed object technology such as CORBA. Using this executable metadata, the client may present, obtain, or negotiate the proper certificates or authorization to access the content of the digital object. During this process, the executable metadata may contact other services that are necessary to obtain the certification or authorization.

Figure 6 illustrates a Distributed Active Relationship that manages the access rights to a resource. In this case, the rights management scheme is based on the notion of an access control list. Note the separation between the access control list in the package labeled P2 and the mechanism for the enforcement, which is in the external relation object. Also note that the relation object is a digital object in its own right, referenced via a global identifier, URN1 in the relationship catalog. The activation package in the figure stands for an executable component of the relation that would be invoked when a client accesses the content in the package labeled P1. The description package in the relationship container might be some textual description of the relationship. Section 6.1 describes one possible implementation of such a rights management mechanism.

An important component of this rights management scheme, and for the DAR concept in general, is that the executable aspect of the DAR is external to the resource being accessed and to the repository containing the resource. This level of modularization maximizes code reuse and extensibility. This means that not all contingencies and consequences need be anticipated before an object is released. Rather, a rights holder may add to or subtract from the metadata as circumstances change and new services become available. Section 6.1 describes a digital library repository architecture that implements this scheme for rights
management.
 

Figure 6 - Distributed Active Relationship for Rights Management

Another consequence of the DAR model is that metadata packages can be virtual or dynamic [LAG]. That is, the package data may only exist as the result of a computation on some other resource. For example, we might state that both MARC and Dublin Core descriptions of a resource are available. The Dublin Core description could be computed on-demand from the MARC description. Active relationships can capture the dependency of the virtual Dublin Core package on the MARC package. This is similar in concept and could be applied to the notion of "Just-in-time Conversion" addressed in [PW]. For this purpose, a single underlying format, such as a scanned image, could be associated with several different DARs that on-demand can convert the object to a variety of formats such as JPEG, GIF, or OCR-ed text.

While the DAR model is intriguing, there are three problem areas that must be addressed in practical implementations.
 

  1. Efficiency: Downloading code to implement all relationships is a nightmare from the standpoint of efficiency.
    1. We assume that actual implementations of DARs will use techniques such as hard-coding often-used relationships into repositories or caching them as means of improving efficiency. Only in rare cases will a novel relationship be downloaded and run.
  2. Security: Security considerations are another reason for avoiding the execution of unknown code.
    1. Just as Java applets are restricted in the actions they can perform, any reasonable implementation of DARs would place restrictions on the capabilities of the downloaded relationships. They might only load relations that come from a small number of known sources, and not allow those relations a great deal of access to the system.
  3. Semantics: One of the hardest conceptual questions is how to determine the semantics of a novel relationship. Even if we can determine the name of an executable relationship and the types of its arguments and return value, how do we know if the relation is important to us?
    1. In a purely machine interpretable sense, we cannot. At some point, we have to rely on a human -- a system designer, a user, or an intermediary such as a librarian -- to determine meaning and importance. Using inheritance hierarchies for the relationship types is one way that a system designer can help a system to deal with novel relationships in a useful manner. This approach is discussed further in section 6.2.

6 Implementing the Framework: RDF and FEDORA

The DAR model provides a very rich conceptual structure for metadata systems, but it is important to consider how it can be reduced to practice. This section of the paper discusses two approaches to implementing the framework. The first is FEDORA (Flexible and Extensible Digital Object Repository Architecture), a CORBA implementation of a digital library repository system based on DARs. The second approach maps the DAR model onto the facilities provided by the Resource Description Framework (RDF) being specified by the World Wide Web Consortium (W3C).

6.1 FEDORA

The Digital Library Research Group at Cornell University has a primary interest in promoting open, distributed digital libraries though the development of interoperable services and protocols that can form the basis of a larger digital library infrastructure. A core component of this research is a DAR-based repository architecture called FEDORA (Flexible and Extensible Digital Object Repository Architecture)[DL]. FEDORA is an interoperable, distributed repository service that can serve as a fundamental component in an open digital library infrastructure. It is intended to operate with other service modules that support searching, information discovery, name resolution, and rights management. Combining concepts from the Kahn/Wilensky Framework [KWF]and the Warwick Framework, FEDORA uses the abstraction of a "container" or a "wrapper" to aggregate distinct packages of data that can be either local or remote. Once assembled, these individual packages become part of a "Digital Object" to which one can attach behaviors and rights enforcement mechanisms. Distributed Active Relationships are the basis for implementing FEDORA components called "Interfaces" and "Enforcers" which are linked to Digital Objects. Interfaces define relationships and behaviors, and are attached to Digital Objects to enable them to produce various outputs (or "disseminations") of their content packages. Enforcers are a special type of Interface that protect the intellectual content in a Digital Object.

The FEDORA architecture is designed to enable interoperability by three means: (1) supporting the aggregation of heterogeneous, distributed content, (2) providing a means for attaching extensible behaviors to a digital object, and (3) providing a mechanism for associating externally-supplied rights enforcement mechanisms with the digital object to protect intellectual content.

Aggregation of heterogeneous content

The ability to handle disparate digital resources and accommodate emerging and future forms of digital content is a key requirement of any digital library repository service that strives to cross institutional boundaries and to stand the test of time. The FEDORA architecture allows for the integration of any type of digital content through the creation of wrapper objects called "Digital Objects." Digital Objects provide a level of core interoperability above the level of the individual content packages. While an individual piece of content, such as an HTML document or a MARC record, can have a life outside the scope of FEDORA, that same piece of content can attain the benefits of interoperability when it is housed in a FEDORA Digital Object container. In FEDORA, content packages become opaque, MIME-typed byte streams (called "Datastreams") that are accessed using the default behaviors of a Digital Object. Digital Object content can also be disseminated through Interfaces, which extend the default behavior of the object. Thus, the FEDORA digital object architecture unifies heterogeneous content, making it accessible through a consistent set of methods that invoke the default and extended behaviors of the object, without exposing the underlying structure of the content packages. Figure 7 shows a simple Digital Object with four MIME-typed content packages which illustrates the co-existence of heterogeneous content. An intelligible entity in its own right, the Digital Object is composed of multifarious data, including specialized bibliographic records, pure document-like content, access-control data, and a remote executable piece of software. These Datastreams are not distinguishable as either metadata or data, except through the relationships and behaviors defined by the Interfaces that are attached to a Digital Object.

Behavioral evolution

The modular nature of Digital Objects also allows for the graceful addition and accommodation of new content types and new behaviors over time. Interfaces provide the "public view" of a digital object, and provide different ways of accessing the content, without exposing the underlying structure of the object. From a client perspective, a Digital Object simply announces a list of supported operations or disseminations. These can be invoked irrespective of the physical form in which content is maintained in the object. Interfaces can be attached to a Digital Object to produce any possible computable derivation of its base content.

Essentially, FEDORA Digital Objects are designed to avoid functional obsolescence through this distinction between the internal form of digital content, and the disseminated form. While the raw content of an object will persist over time, the behaviors of the object (and the requests users can make upon on the objects) will change. So, not only can Digital Objects evolve by incorporating new content forms, but also they can exhibit new behaviors through the ability to "plug in" new Interfaces. For instance, today an object may assert its ability to produce of a Dublin Core record and a Postscript version of its content. Later, through the unplugging of old or obsolete interfaces, and the addition of newly developed interfaces, the same base object may have a different set of behaviors. For example, when accessed, it may announce that it can now disseminate a Dublin Core record, an XML-wrapped document, and a newly developed high-compression image format of the content.

As previously mentioned, Interfaces are one way in which FEDORA uses the abstraction of DARs. An Interface embodies all the requisite relationship information expressed in the semantics of a DAR. It endows a Digital Object with the ability to disseminate content by specifying which Datastreams are related, the nature of the relationships, and the operational semantics of the relationships. In Figure 7, for example, there is a Distributed Active Relationship that returns a MARC record describing the content in DS4, and one that returns a Dublin Core record that is computed on-the-fly from the MARC record. Any number of Interfaces can be linked to a Digital Object enabling it to perform specialized operations. Without knowing any of the structural details of a digital object, a client could discover that the object will produce a number of views of itself, such as: a watermarked image of a identified graphic; a particular page of document content in Postscript format; or a visualization of a dataset.

Adaptable, extensible security enforcement

As indicated in the Kahn/Wilensky framework, digital object repositories must permit mechanisms to protect the intellectual content they store and disseminate. Since Datastreams are opaque components of a Digital Object, rights management occurs at the level of the container abstraction. This can be thought of as a behavior-centric approach to rights management where security is applied to digital object behaviors instead of to the individual content entities themselves. By architecting Enforcer objects into the container level, FEDORA establishes a means for moderating the invocation of Digital Object operations. Similarly, each dissemination (as specified in Interfaces) is performed in the context of an Enforcer object.

To maximize interoperability and long-term viability of digital objects, FEDORA provides a modular, extensible rights management architecture that is not dependent on any particular security scheme. Just as behaviors can be added or removed from Digital Objects over time, the means of securing these behaviors can change and adapt as security applications mature. Enforcers are first-class objects that are stored persistently with each Interface linked to a Digital Object. As such, they can take advantage of enforcement mechanisms that live outside of FEDORA; Enforcers serve as the means of connecting FEDORA Digital Objects to any number of external rights management services.
 

Figure 7 - A FEDORA Digital Object

Figure 7 shows an example of a simple Enforcer that secures the Postscript behaviors of the depicted Digital Object. The relevant DataStream (DS4) is protected by an Enforcer that is wrapped around the Postscript Interface. Again, the Enforcer is directly securing the behaviors of the object (e.g., getPage, getContent), thus indirectly protecting the content. Also, it should be noted that the Enforcer implements the same rights management DAR described earlier (see Figure 6). It links an Access Control List (DS1) with an external enforcement engine (remote Datastream pointing to URN1) and protects the intellectual content (DS4) by running the Enforcer every time the Postscript Interface methods are invoked. While the Access Control List is stored in the Digital Object as a distinct content package, the mechanism for executing the Enforcer can exist outside the Digital Object, and optionally, outside of the FEDORA repository.

FEDORA is currently being implemented and will be tested in the context of a reference implementation that includes other key services (e.g., searching and name resolution) of an interoperable, distributed digital library architecture.

6.2 Implementing DARs in the Resource Description Framework

The Resource Description Framework (RDF) is being specified by the World Wide Web Consortium (W3C) to provide a unified foundation for several metadata projects. The design was strongly influenced by the Warwick Framework, one indicatof which is the statement in a W3C press release that "RDF will allow different application communities to define the metadata property set that best serves the needs of each community. " [W3R] A key distinction is that RDF does not enforce the Warwick Framework notion that packages are complete in their own right, nor does it easily allow them to be expressed in a community-specific syntax. Instead, RDF allows a much freer mixing and matching of elements from different schemas, and requires the use of XML [XML] as its syntax.

RDF has four components; the modeling facility, the serialization syntax, schema definitions, and rule definitions. Currently, a public draft for the modeling facility and syntax has been released [RDF], and the schema working group has begun its deliberations. The model and syntax draft will be revised in the near future to add a typing mechanism similar to that of modern Object-oriented programming languages once the interactions between typing and schemas have been specified.

Similar to the approach discussed in section 4, RDF models are directed graphs. Nodes represent web resources, arcs state that certain properties (such as "Author") are associated with a node, and arcs terminate either at a node or at a string. As an example, Figure 8 shows a model for some simple Dublin Core bibliographic information associated with a web page.
 

Figure 8 - Simple RDF Model for Dublin Core Elements

Listing 3 shows the serialized version of that model.

        <?namespace href="http://www.purl.org/Metadata/DublinCore/" as="DC"?>
        <?namespace href="http://www.w3.org/Schemas/RDF/" as="RDF"?>
 
        <RDF:Serialization>
        <RDF:Assertions href="http://www.acl.lanl.gov/~rdaniel/">
            <DC:Creator>Ron Daniel Jr.</DC:Creator>
            <DC:Publisher>Los Alamos National Laboratory</DC:Publisher>
        </RDF:Assertions>
        </RDF:Serialization>

Listing 3 - Serialized form of Figure 8

One of the key features of RDF is its pervasive use of URIs. The namespace declarations in Listing 3 provide one indication of this. Tag names like DC:Creator expand to a 2-tuple composed of a URI, such as http://www.purl.org/Metadata/DublinCore, and the identifier "Creator". This give us scoped names, preventing confusion between differing definitons of terms like "Title". (Legal title is not the same as royal title, which in turn is different than the title of a book). Using URIs for the terms in a namespace also allows name space definitions to be fetched from the network.

In order to implement DARs in RDF we extend the name-space definition slightly by allowing scoped tag names to expand to a URI such as http://www.purl.org/Metadata/DublinCore/Creator. (The XML name space is only now being specified [XML2] and neither blesses nor precludes this extension.) With this extension, the arcs in RDF correspond toDARs. For example, the DC: Creator arc in Figure 8 can be expressed as a DAR through the 3-tuple scheme shown in Listing 4.

    (http://www.purl.org/Metadata/DublinCore/Creator  - the arc type
     http://www.acl.lanl.gov/~rdaniel/                - the source of the arc
     "Ron Daniel Jr.")                                - the dest. of the arc

Listing 4 - RDF Relationship Expressed as a DAR

Thus, RDF seems to provide the facilities needed to construct an active metadata system.

Los Alamos National Laboratory is currently prototyping the use of RDF and DARs for a large scientific data management system. One of the issues being considered at this time is the question of how to efficiently handle executable relations. Assume we have a repository similar to that of FEDORA, and that we wish to implement enforcers and interfaces. We can pick a particular form of executable content (such as Java class files) to support in our system. Determining the meaning of an executable relationship and deciding whether to run it remains a problem. As mentioned earlier, blindly executing all relationships would be foolish due to performance and security considerations. We can use RDF's typing system to indicate that particular relations are subclasses of known relationships such as "Enforcer" or "Interface". The security manager of our repository could look at the type of all DARs. Only those that are subclasses of known, pre-approved types would be executed. Therefore we can implement a security manager in our repository that will only execute relations when they are of a known type, giving us some indication of their meaning.

7 Conclusion

This paper has presented Distributed Active Relationships, a general framework for dealing with metadata issues in digital libraries and other information systems. DARs originated as an extension of the Warwick Framework, an extension that was motivated by several conclusions we reached when considering the definition of metadata as "data about data". By treating metadata as data, rather than giving it a special distinguished role, we allow arbitrary resources to be associated by arbitrary relationships. The relationships are also data, and are identified and retrieved like any other resource. A particularly interesting consequence of this is that a relationship can be defined through executable code, thus DARs can form the foundation for systems that use dynamic, interpretable data and metadata, as well as more conventional notions of static data and metadata.

This foundation has proven very useful in the design of FEDORA, where it allowed a graceful and promising integration of such divergent notions as the Kahn/Wilensky Digital Library architecture and downloadable code (e.g. Java applets). We are particularly interested in the capabilities of the new Resource Description Framework to facilitate the construction of systems based on DARs. If it proves successful in prototypes, it could have an enormous impact on the design of metadata systems.

8 Acknowledgements

This paper is an extension of an earlier paper published in the November issue of DLib Magazine (www.dlib.org). Work on this paper was partially funded by the Department of Energy under Contract No. W-7405-ENG-36 and by the Defense Advanced Research Project Agency under Grant No. MDA 972-96-1-0006 with the Corporation for National Research Initiatives. This paper does not necessarily represent the views of DOE, LANL, CNRI, or DARPA. Thanks to Cliff Lynch, Mic Bowman, Terry Allen, Michael Mealling, and Michelle Baldonado for their discussions on the topics of this paper.

References

[WF1] Lagoze, Carl, Clifford A. Lynch, and Ron Daniel Jr., "The Warwick Framework: A Container Architecture for Aggregating Sets of Metadata", Cornell Computer Science Technical Report TR96-1593, July 1996, http://cs-tr.cs.cornell.edu:80/Dienst/UI/2.0/Describe/ncstrl.cornell/TR96-1593.

[Extreme] Computing and Communications in the Extreme: Research for Crisis Management and Other Applications; National Academy Press; Washington, D.C., 1996.

[WW] Metadata Workshop II, http://www.oclc.org:5046/oclc/research/conferences/metadata2/

[DC] Dublin Core Metadata Element Set Resource Page, http://purl.oclc.org/metadata/dublin_core/

[ARMS] Arms, William Y., "Key Concepts in the Architecture of a Digital Library", D-lib Magazine, July 1995, http://www.dlib.org/dlib/July95/07arms.html

[GLAD] H.M Gladney and J.B. Lotspiech, "Safeguarding Digital Library Contents and Users: Assuring Convenient Security and Data Quality", D-lib Magazine, May 1997, http://www.dlib.org/dlib/may97/ibm/05gladney.html

[LAG] Lagoze, Carl, "From Static to Dynamic Surrogates: DataStream Discovery in the Digital Age", D-Lib Magazine, June 1997, http://www.dlib.org/dlib/june97/06lagoze.html.

[PW] Price-Wilkin, John, Just-in-time Conversion, Just-in-case Collections: Effectively leveraging rich document formats for the WWW, D-lib Magazine, May 1997, http://www.dlib.org/dlib/may97/michigan/05pricewilkin.html

[DL] Daniel Jr., Ron and Carl Lagoze, "Distributed Active Relationships in the Warwick Framework", Proceedings of the 1997 IEEE Metadata Conference, September, 1997, http://computer.org/conferen/proceed/meta97/papers/rdaniel/rdaniel.pdf

[KWF] Kahn, Robert and Robert Wilensky, "A Framework for Distributed Digital Object Services", Corporation for National Research Initiatives, http://www.cnri.reston.va.us/cstr/arch/k-w.html

[W3R] Press Release, W3C announces RDF, http://www.w3.org/Press/RDF

[XML] Extensible Markup Language (XML), World Wide Web Consortium, http://www.w3.org/XML/

[RDF] Lassila, Ora and Ralph R. Swick, "Resource Description Framework (RDF) Model and Syntax", World Wide Web Consortium, http://www.w3.org/TR/WD-rdf-syntax/

[XML2] Bray, Tim and Dave Hollander and Andrew Layman (eds.), "Name Spaces in XML", W3C XML Working Group White Paper 15-October-1997, http://www.textuality.com/xml/xml-names.html