Copyright1996 IEEE. Personal use of this material is permitted. However, permission to reprint/ republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Metadata to Support Data Quality and Longevity

Jeff Rothenberg

RAND; 1700 Main Street; Santa Monica, CA 90407

Internet: jeff@rand.org

1. OVERVIEW

There are many kinds of metadata that can be associated with a database, including metadata to improve or restrict access to data, to facilitate sharing and interoperability, to characterize and index data, etc. This paper discusses two key needs for metadata: to support data quality and to ensure the longevity of data.

Section 2 below argues that data should be viewed as modeling reality; it follows from this that data quality can be defined as a measure of the suitability of data for its intended purpose (or range of purposes). Section 3 discusses the first key need for metadata--to evaluate and improve data quality. Section 4 outlines a strategy for assessing and improving data quality by means of two equally important, parallel approaches: performing explicit evaluation of data and establishing organizational control over the processes that generate and modify data. The core of this strategy is the definition and maintenance of appropriate data quality metadata. Section 5 proposes a framework for such metadata and discusses metadata issues that arise in connection with data quality.

A second key need for metadata is to improve the effective longevity of digital data by ensuring their future accessibility and readability. As discussed in Section 6, the longevity of digital information is constantly threatened by the combined assault of limited media life and the inexorably rapid evolution (and consequent obsolescence) of the software and supporting hardware systems needed to access and interpret digital data. These factors conspire to limit the effective lifetime of digital records despite the fact that they can be copied perfectly; this has prompted my ironic contention that digital information lasts forever--or five years, whichever comes first [11].

Media longevity issues can be addressed by copying (or "migrating") digital data periodically to new, fresh media; but digital records depend on obsolete software to interpret obsolete formats, codes and file structures, which presents a harder problem. Sections 7 and 8 propose a solution based on the use of encapsulation along with metadata to make the encapsulated information intelligible.

Section 9 summarizes the issues surrounding the use of metadata for both of these key purposes.

2. DATA AS MODEL

Data can best be thought of as modeling, i.e., the result of attempting to describe the real world (Figure 1).

(Figure: Data as Model)

Any such description of reality is always an abstraction, always partial, always just one of many possible "views" of reality. (This discussion will remain independent of the specific representations chosen for these abstractions.)

Any given real-world entity, process, or phenomenon can be modeled by many different data views, depending on one's purpose. For example, the speed of a ship might be represented by one of several symbolic values (e.g., slow, medium, or fast), by a single numerical value (e.g., 20), by a table of numerical values representing speeds under different sea and load conditions, by a set of parameters to an arbitrarily complex function of such conditions, etc. In each case, a single aspect of reality (the speed of the real ship) is represented by one or more data values. The choice of what kind of data to use to model reality is made prior to generating the data, just as the choice of how to build any kind of model is necessarily made before beginning to build such a model. The appropriateness of this choice (e.g., which of the above ways to represent the speed of a ship) is a crucial aspect of the quality of the resulting data, regardless of what data values are ultimately recorded. Note that this modeling phase includes formal "data modeling" as well as any prior conceptual modeling, in which abstract choices are made about how to view (i.e., model) the world.

3. EVALUATING DATA QUALITY IN ORDER TO IMPROVE IT

Ensuring data quality (specifically for use in modeling and simulation studies) is a major concern of the Defense Modeling and Simulation Office (DMSO) [2]. Model and simulation builders and users typically devote considerable effort to collecting suitable, high quality data[1]. A comprehensive approach to evaluating and improving data quality, derived from research sponsored by DMSO, is discussed in detail in Rothenberg [9]. One of the key tenets of this approach is that data quality is not a binary attribute. It is not enough to declare that a data set is (or is not) of high quality; it is necessary to evaluate data in the context of each specific intended use and to feed the results of these evaluations back to improve data quality. This implies that multiple evaluations must be supported, and that the results of each evaluation must be recorded for use by future data users and by data developers and maintainors. Sections 4 and 5 elaborate these ideas and present a metadata structure to support a wide range of data quality evaluation and improvement activities.

Data quality has two distinct aspects, one involving the objective "correctness" of data (such as accuracy and consistency), the other involving the appropriateness of data for some intended purpose [5,7,10]. Data producers and users--as well as Quality Assurance ("QA") and Total Quality Management ("TQM") proponents--generally assume that the purpose of data quality assurance is to provide the best data possible [3,4]. But this obscures the need to evaluate data: The implication is that if a data set is the best available and is as good as it can be made, then there is no other option but to use it, in which case there is no point in worrying about just how good it is. The flaw in this is that merely saying that a data set is as good as it can be made does not tell us how good it is or whether it is any good at all: without explicit evaluation, there may well be an attractive alternative to using it, namely, to avoid using it. Furthermore, most databases are fairly general purpose, which means that assuring their quality in advance of using them can only address the objective aspects of their quality. The appropriateness of using a database for some purpose cannot even be defined--let alone evaluated--until that purpose is specified. For these reasons, it is important to focus on the evaluation and assessment of data quality, in addition to its improvement.

The process of evaluating and assessing data is often referred to as "Verification, Validation and Certification" (VV&C). In brief, Data Verification, Validation, and Certification are those activities that are done to ensure that data are correct and appropriate for their intended use. Verification is generally intended to ensure that data values are consistent with each other and accord with any specifications and requirements applicable to some general or specific intended use. Validation is intended to ensure that data values accord with what they represent and are appropriate for some specific or general intended use. Certification establishes that some appropriate authority has determined that data are correct and appropriate within margins that carry a risk acceptable to the certifying authority for some intended use.

It is also crucial to recognize that data users do not entirely control the quality of the data they use. When a data set is obtained from a data source or intermediate data center, the objective aspects of its quality (accuracy, consistency, etc.) may already have been determined. Though a user may perform verification checks on a database (e.g., to verify referential integrity), users do not often have the resources to perform comprehensive validation checks, which require comparing data values with the real world or with other, independently-derived values that are known to be correct or are considered to be "best-estimate" values. Users must therefore rely on data producers to perform such checks and to document their results as measures of the quality of their data. Moreover, if users are to have confidence in the stated quality of a database, producers must document more than just the results of their quality checks: they must also provide sufficient documentation about how they performed these checks to allow users to understand what they have done and evaluate their conclusions.

Furthermore, since data producers cannot evaluate the appropriateness of a database for an unknown user's purpose, they must provide sufficient documentation about their data to enable the user to perform this evaluation. This documentation must make explicit any implicit assumptions in the data or in the way the database was generated. Although such documentation may not by itself constitute part of a data quality evaluation, it is essential for performing such evaluations. In addition, whenever a database is evaluated--whether by its producer, by an intermediate data center, or by a user--the results of this evaluation must be documented with sufficient context to allow future users to benefit from this evaluation (to the extent that it is relevant to their future purpose) and to allow producers to improve their databases by utilizing detailed evaluations and analyses of their data, performed by intermediaries or data users. Finally, it is vital to consider the ways that data may be transformed between initial production and final use. Various intermediaries (such as data centers) and users themselves may combine, aggregate, filter, edit and modify data from different sources in order to prepare a data set for a specific use. These transformation processes--along with the processes that generate data initially--all affect the quality of the resultant data. Therefore, in addition to recording information about data values themselves and evaluating the quality of these values, it is important to record information about the processes that affect data, both to ensure that data quality is not corrupted and to allow improving these processes. The organizations that own and perform processes that affect data must be funded and motivated to establish control over their processes to improve them, thereby improving the quality of the data they produce. The next two sections present an overview of a comprehensive approach to evaluating, assessing, maintaining and improving data quality, based on the use of metadata that should be associated with every database.

4. VV&C PLUS PROCESS IMPROVEMENT

A comprehensive approach to data quality involves more than just "doing the best we can" to provide good data. It requires evaluating the quality of data values (performing VV&C) and evaluating the processes that generate and modify data. This in turn requires documenting databases to facilitate understanding their implicit assumptions and evaluating and assessing their quality, while documenting the performance of the processes that affect data to facilitate improving them. As shown in Figure 2, this can be viewed as two parallel activities: (1) performing explicit evaluation of data and (2) establishing organizational control over the processes that generate and modify data.

Figure 2 is not intended as a flowchart of the activity of an individual data producer or user; rather, it attempts to show the parallel processes that should be followed by a community of producers and users in order to improve the quality of their data. For example, the top left activity ("Identify intended/potential users & purposes") should be interpreted as an action to be undertaken by an entire data community (or, alternatively, by a group of users) prior to defining appropriate metadata requirements for use in data VV&C. Similarly, the top right activity ("Identify sources/owners of data & processes") is intended as an action to be undertaken by a data community (producers and users) to identify which organizations perform data-affecting processes on a required database. As indicated in the figure, the definition and maintenance of appropriate metadata is the key to both recording the specifications needed to perform V&V and recording a full history of the VV&C activities that have been performed on a database. Similarly, it is necessary to define and maintain appropriate metadata to support and record the results of process control.

(Figure: A Framework for Improving Data Quality)

The actual VV&C that must be performed on data should be divided into producer and user VV&C as discussed elsewhere [8,9,10]. Essentially, producers must warrant the objective accuracy of a database, whereas users must determine its appropriateness for their intended use. Producers and modifiers of databases must be given the resources and mandate to perform VV&C along with their other responsibilities. Users must be encouraged, motivated, and funded to perform VV&C as an integral part of their work. The processes that generate, modify, transform, and propagate data should be examined, controlled, and improved whenever possible, in order to improve the quality of the resulting data [9,10].

Implementing these parallel activities requires: (i) augmenting databases with metadata to record information needed to assess data quality, record the results of these assessments, and support process control of processes affecting data; (ii) performing explicit VV&C on data, using metadata to both direct this activity and record its results; and (iii) establishing control over processes that affect data to improve the quality of data transformations, again using metadata to both support this activity and record its results.

5. METADATA TO SUPPORT DATA QUALITY

Performing the activities presented above requires that databases be augmented with appropriate data quality metadata. Much of this metadata must be provided by the data producers or data.centers that supply the data. This section outlines the kinds of data quality metadata that must be provided to support data quality evaluation and improvement (this is as discussed in further detail in [9]). Note that many of these categories are contextual: they provide information necessary to evaluate and improve data quality, rather than measuring data quality themselves.

Although the potential amount of metadata required to support data quality evaluation and improvement may appear overwhelming, several factors should mitigate the effort required to generate and maintain quality metadata. First, many metadata values for related items in a data set will be identical, since the sources, histories and times of entry of many data values in a given data set will be shared. Data entry and maintenance tools can take advantage of this commonality, allowing data values to "inherit" many of their quality metadata values from those of related data items. In addition, it should be feasible to develop automated tools for data entry, database management, data transformation and data VV&C that create and capture much of the relevant data quality metadata automatically, as data values are generated, transformed or checked.

Tables 1 through 3 list categories of metadata required to support data quality evaluation and improvement. These are shown at three distinct levels: the database level (Table 1), the data- element (or data dictionary) level (Table 2), and the data value (or instance data) level (Table 3). All of the categories shown are necessary for either evaluating or recording data quality; those used to report data quality results per se (and to provide a minimal context for understanding those results) constitute a data quality profile, and are marked with "*" to show where the quality profile fits in the universe of quality metadata. The boundary between the profile and the larger universe of quality metadata is not a rigid one: the quality profile is simply a view (in the database sense) into this universe, which can be extended as required.

Many of the quality metadata items described here are qualitative in nature (e.g., textual): it should not be surprising that the evaluation of quality is necessarily somewhat qualitative. In addition, many of the quality attributes shown here apply to metadata as well as data: that is, there may need to be metadata describing the quality of the metadata in a database.

It would clearly be advantageous to propagate and transform metadata as necessary when deriving new databases from existing ones, to allow "leveraging" metadata for multiple uses; this possibility deserves further research.

Although the cost of developing and enforcing data quality procedures and the collection and maintenance of data quality metadata may be substantial, this cost must be borne if we are to reduce the risk of using data of unknown quality.


Table 1: Database Level Metadata


Table 2: Data-element Level Metadata


Table 3: Data-value Level Metadata

6. THE ASSAULT ON THE LONGEVITY OF DIGITAL DATA

The longevity of digital information is threatened by the combined assault of limited media life and the rapid obsolescence of the software and supporting hardware systems needed to access and interpret digital data. These factors limit the effective lifetime of digital records despite the fact that they can be copied perfectly. Though media longevity can (and must) ultimately be solved by periodically copying data to new, fresh media, the dependence of digital records on obsolete software requires less obvious measures. As discussed in [11], digital records may only be accessible and understandable when viewed using the software that created them--or functionally identical software. The increasing use of graphics, hypertext, linked structures, and multimedia makes records increasingly dependent on specific software for interpretation: a "data file" may be nearly useless without the software to interpret the structure and meaning of the file. Traditional documents and records can often be thought of as having linear content that is relatively independent of their structure, but this is rapidly becoming untrue of modern documents and records. While it may be possible to decipher simple text or numeric formats by using "brute force" methods, this will just not work for more complex file structures, in which there may be no inherent way to know whether to interpret a given sequence of bits as text, number, pointer, image, sound, video, program, or new formats yet to be invented.

It may be tempting to imagine that future software will be able to read old records and recreate the ways they were intended to be seen, but this is unrealistic. Software paradigms change every few years, and record-keeping paradigms themselves are evolving in response (see Michelson and Rothenberg [6]). Nor can we expect to reproduce the precise behavior of obsolete software in the future by interpreting a saved description of that software--the only description of software that can capture its functional richness is the software itself. In general, viewing obsolete records as they were intended to be viewed requires running the actual software that created them.

Standards may seem to alleviate this problem, but they necessarily lag behind current practice, and (as the saying goes) the best thing about standards is that there are so many different ones to choose from. Furthermore, it is naive to think we can standardize data representation or presentation formats when our paradigms are still evolving so fast. Although data can be translated into new forms as standards change, this introduces the risk of corruption as each translation is performed; since it is not always possible to provide backward translation without loss, the original version of a record is often impossible to reconstruct once it is translated. The only general, long-term solution to this problem would appear to be a strategy that makes records self-contained and self- explanatory. The next section proposes such a strategy, based on the use of encapsulation along with metadata to make the encapsulated information intelligible.

7. ENCAPSULATING DATA TO ENSURE LONGEVITY

I have proposed in [11] that the ultimate solution to this problem is to save records (data, documents, etc.) along with the application software that created them, as well as the supporting system software needed to run this application software and a description of the required hardware environment needed to run this entire suite. The hardware description would be used in the future to construct an emulator for the original hardware environment: this would be run on whatever computer happens to be available, in order to access the original records. Figure 3 shows the factors that must hold true for this approach to work. The entire collection--records, software, and hardware description--would be encapsulated to prevent corruption. This in no way alleviates the media longevity problem: encapsulated records would still have to be copied to fresh media periodically, to avoid loss. It does, however, offer a solution to the software/hardware dependence problem, which has largely been ignored (or dismissed by wishful thinking).

The motivation for encapsulating records in this way is two-fold. First, it avoids corruption by marking the entire suite of saved information as inviolable. An encapsulated record must not, for example, be subjected to lossy compression or modified in any way: it must be recognized as a saved bit-stream whose bits must be copied verbatim whenever they migrate to new media. In addition, encapsulation ensures that all of the necessary components (software, hardware environment description, and data) of a record remain contiguous: so long as the encapsulation is not violated, future readers can then be assured of having all they need to access and understand saved records.

(Figure: Factors Required to Read an Encapsulated Record)

Encapsulation introduces a problem, however: a prospective reader needs to know how to open the encapsulation and read the record inside it. Furthermore, it is unreasonable to expect data administrators to open and read each encapsulated record every time they need to decide where to store it, how to index it, who should be allowed access to it, etc. The solution to these problems is to attach annotation metadata to the "surface" of each encapsulation, both to explain how to decode the obsolete records contained inside the encapsulation and to provide whatever contextual information is desired about those records.

Unfortunately, this introduces a new, recursive problem, since the form and encoding of these metadata annotations themselves can quickly become obsolete. One solution to this problem is to use encapsulation recursively, along with metadata to make each level of encapsulated information intelligible at the next level. That is, whenever the form of the metadata used to annotate a given encapsulated record becomes obsolete itself, it can be wrapped in a new layer of encapsulation, using new metadata (in a new form) to explain how to read the old metadata. Yet this recursive process must terminate somehow: there must be some annotation at the outermost level of encapsulation that can be read without further interpretation--otherwise the original problem recurs at the outermost level, making it impossible to open the outermost encapsulation. Furthermore, it is overkill to encapsulate most annotations, since they consist largely of linear, descriptive text rather than hypermedia records whose structure may be as important as their textual content. The solution is therefore to devise a transparent annotation format that can annotate encapsulated records in a way that does not require significant interpretation effort.

Since most annotations can be restricted to linear text, a simple textual encoding can serve as a transparent annotation format. Such a format can be used to explain how to proceed further: it can consist of a simple, linear stream of fixed-length characters or some similar encoding that would be readily decipherable by anyone armed with a computer and the expectation that the format represents linear text. At any given stage of evolution during the information age, the choice of such an encoding is likely to be obvious. Choosing a particular transparent annotation format has the effect of adopting a "bootstrap standard" that solves the problem of interpreting recursive encapsulations: this bootstrap enables a future reader to decipher some initial portion of the metadata annotating an encapsulated record. This initial explanatory annotation can in turn explain how to read the encapsulated record itself.

Yet a given bootstrap standard--however natural it may be when first chosen--will not last forever. ASCII may have essentially replaced EBCDIC, Baudot, and various other text encodings, but it may soon be replaced by Unicode or some other scheme. It is erroneous to assume that any such standard will reign for long. Fortunately, it is not necessary to make this assumption if we are willing to restrict the transparent annotation format to linear text: doing so ensures (with high probability) that any future bootstrap standard will be able to represent whatever we encode in our current bootstrap standard. If the expressive power of any new bootstrap standard is always a proper superset of that of the older bootstrap standard that it replaces then, although it may not be possible to translate any arbitrary expression from the new standard backward into the old one, it should be possible to translate back anything that was originally expressed in the older standard. We can therefore guarantee that we can translate our current standard into a new one when necessary, while retaining the ability to translate back again without loss. This partially reversible translation capability is the key to unlocking encapsulated records, as elaborated in the next section.

8. METADATA TO ANNOTATE ENCAPSULATED RECORDS

Encapsulated records must be annotated with "surface metadata" consisting of three categories, one mandatory, the other two optional. The mandatory category is explanatory metadata that describes how to unwrap the encapsulated information. This can be thought of as the protruding tip of an iceberg of explanatory metadata, the rest of which may be contained within the encapsulation: all that is really needed at the surface of the encapsulation is a description of how to decode and interpret any additional, encapsulated explanatory metadata. This is logically sufficient, since any other required metadata can be encapsulated, to be retrieved from within the encapsulation by following the instructions in the surface explanatory metadata.

Nevertheless, two additional categories of metadata should also be made visible at the surface of the encapsulation, if only for purposes of efficiency. The first of these consists of reference and dependency metadata, specifying any objects referenced or required by the encapsulated object: this allows keeping related objects together when they are copied to new media or repackaged, without having to open the encapsulation to look for such dependencies. If it could be guaranteed that encapsulated objects would never be split apart, this might be unnecessary, but in general this cannot be assumed. Large objects may have to be split to make efficient use of fixed-size storage media, and complex distributed objects (such as hypertext documents that are linked to each other) may be impossible to store contiguously. For these and other reasons, dependencies should be represented explicitly, rather than being wished away.

The final (third) category of surface metadata is indexing information that might include keywords, search terms, provenance, and other context for where an encapsulated record belongs, where it comes from, what it contains, and how it should be accessed. Making this visible at the surface of the encapsulation facilitates data management and allows searching for relevant data without having to delve into encapsulated information every time a query is processed.

The resulting encapsulation scheme is illustrated in Figure 4.

(Figure: Encapsulated Record and Metadata)

The surface metadata for an encapsulated record must remain readable with minimal effort. This can be ensured by encoding this metadata in the current bootstrap standard for annotation. Since these standards must be expected to evolve, such annotations must be translated into a new bootstrap standard when necessary. This can be done routinely as part of the "refresh cycle" when media are copied to new media. It is crucial, of course, that encapsulated records not be translated: the whole point of encapsulation is to prevent such translation. Only the surface annotations should be translated, and then only when the current bootstrap standard is different from the one currently attached to the surface of the encapsulation. In addition, translations between successive bootstrap standards can be saved in a known, central repository, or they can themselves be encapsulated along with each record, allowing the original annotations to be reconstructed, if necessary, using the partially reversible translation property of successive bootstrap standards discussed above.

9. CONCLUSIONS

Data must be of known quality if the results of using them are to be accepted as valid, credible and useful. Understanding the quality of data may ultimately be more important than merely ensuring that the quality is "as good as possible" under the circumstances. In order to evaluate and assess the impact of data quality, it is necessary to augment databases with data quality metadata of the kind outlined above. After defining and populating these quality metadata categories, a concerted, coordinated effort should be undertaken among data producers and users to utilize this quality metadata to perform explicit VV&C on their data and to institute control over the processes that generate and affect their data.

To ensure the longevity of digital data, records should be encapsulated along with explanatory metadata sufficient to allow accessing and deciphering the encapsulated information in the future. Records should be encapsulated along with whatever application and system software is required to view them, as well as metadata describing the hardware environment needed to run the required software, to allow future emulation of the required system. Explanatory metadata must be attached to this collection to explain how to use the emulated system and encapsulated software to view the encapsulated record, while indexing and other descriptive metadata should be included to enable the record to be found by search queries and managed intelligently. The surface of an encapsulated record must be annotated in a transparent form, using textual "bootstrap standards" to ensure that future readers will be able to interpret the encapsulated record with a minimum of effort.

10. REFERENCES

[1] Dick, Captain Lee, Simulation Data and the Need for Standardization, Phalanx, Vol. 28, No. 4 (ISSN 0195-1920), p. 1, ff., December 1995.

[2] DoD, M&S Master Plan, 5000.59-Paa, Draft January 1995.

[3] Kon, Henry B., Lee, Jacob, and Wang, Richard Y., A Process View of Data Quality, TDQM Research Program, MIT, March 1993.

[4] Lee, Jacob, Wang, Richard Y. On Validation Approaches in Data Production, TDQM Research Program, MIT, August 1993.

[5] Liepins, Gunar E., and V. R. R. Uppuluri (eds.), Data quality control: theory and pragmatics, Marcel Dekker, Inc., New York, 1990

[6] Michelson, A., and J. Rothenberg, Scholarly Communication and Information Technology: Exploring the Impact of Changes in the Research Process on Archives, The American Archivist, 55:2, 1992, pp. 236-315. ISSN 0360-9081. Reprinted by RAND as RP-187, April 1993.

[7] Redman, Thomas C., Data Quality Management and Technology, Bantam Books (ISBN 0- 553-09149-2), 1992.

[8] Rothenberg, J., Verification, Validation and Certification of DIS Exercise Data, Proceedings of the 14th Workshop on Standards for the Interoperability of Distributed Simulation (DIS-14), (paper ID: 96-14-120).

[9] Rothenberg, J., A Discussion of Data Quality for Verification, Validation, and Certification (VV&C) of Data to be Used in Modeling, RAND Draft DRR-1025-DMSO (forthcoming).

[10] Rothenberg, J., and I. Kameny, Data Verification, Validation, and Certification to Improve the Quality of Data Used in Modeling, Proceedings of the 1994 Summer Computer Simulation Conference (SCSC'94), (La Jolla, CA, July 18-20, 1994), pp. 639-44, Society for Computer Simulation (SCS) (ISBN 1-56555-029-3), 1994.

[11] Rothenberg, J., Ensuring the Longevity of Digital Documents, Scientific American, Vol. 272, Number 1, pp. 42-7, January 1995.

Revision: 3/6/96