Section 2 below argues that data should be viewed as modeling reality; it follows from this that data quality can be defined as a measure of the suitability of data for its intended purpose (or range of purposes). Section 3 discusses the first key need for metadata--to evaluate and improve data quality. Section 4 outlines a strategy for assessing and improving data quality by means of two equally important, parallel approaches: performing explicit evaluation of data and establishing organizational control over the processes that generate and modify data. The core of this strategy is the definition and maintenance of appropriate data quality metadata. Section 5 proposes a framework for such metadata and discusses metadata issues that arise in connection with data quality.
A second key need for metadata is to improve the effective longevity of digital data by ensuring their future accessibility and readability. As discussed in Section 6, the longevity of digital information is constantly threatened by the combined assault of limited media life and the inexorably rapid evolution (and consequent obsolescence) of the software and supporting hardware systems needed to access and interpret digital data. These factors conspire to limit the effective lifetime of digital records despite the fact that they can be copied perfectly; this has prompted my ironic contention that digital information lasts forever--or five years, whichever comes first [11].
Media longevity issues can be addressed by copying (or "migrating") digital data periodically to new, fresh media; but digital records depend on obsolete software to interpret obsolete formats, codes and file structures, which presents a harder problem. Sections 7 and 8 propose a solution based on the use of encapsulation along with metadata to make the encapsulated information intelligible.
Section 9 summarizes the issues surrounding the use of metadata for both of these key purposes.

Any such description of reality is always an abstraction, always partial, always just one of many possible "views" of reality. (This discussion will remain independent of the specific representations chosen for these abstractions.)
Any given real-world entity, process, or phenomenon can be modeled by many different data views, depending on one's purpose. For example, the speed of a ship might be represented by one of several symbolic values (e.g., slow, medium, or fast), by a single numerical value (e.g., 20), by a table of numerical values representing speeds under different sea and load conditions, by a set of parameters to an arbitrarily complex function of such conditions, etc. In each case, a single aspect of reality (the speed of the real ship) is represented by one or more data values. The choice of what kind of data to use to model reality is made prior to generating the data, just as the choice of how to build any kind of model is necessarily made before beginning to build such a model. The appropriateness of this choice (e.g., which of the above ways to represent the speed of a ship) is a crucial aspect of the quality of the resulting data, regardless of what data values are ultimately recorded. Note that this modeling phase includes formal "data modeling" as well as any prior conceptual modeling, in which abstract choices are made about how to view (i.e., model) the world.
Data quality has two distinct aspects, one involving the objective "correctness" of data (such as accuracy and consistency), the other involving the appropriateness of data for some intended purpose [5,7,10]. Data producers and users--as well as Quality Assurance ("QA") and Total Quality Management ("TQM") proponents--generally assume that the purpose of data quality assurance is to provide the best data possible [3,4]. But this obscures the need to evaluate data: The implication is that if a data set is the best available and is as good as it can be made, then there is no other option but to use it, in which case there is no point in worrying about just how good it is. The flaw in this is that merely saying that a data set is as good as it can be made does not tell us how good it is or whether it is any good at all: without explicit evaluation, there may well be an attractive alternative to using it, namely, to avoid using it. Furthermore, most databases are fairly general purpose, which means that assuring their quality in advance of using them can only address the objective aspects of their quality. The appropriateness of using a database for some purpose cannot even be defined--let alone evaluated--until that purpose is specified. For these reasons, it is important to focus on the evaluation and assessment of data quality, in addition to its improvement.
The process of evaluating and assessing data is often referred to as "Verification, Validation and Certification" (VV&C). In brief, Data Verification, Validation, and Certification are those activities that are done to ensure that data are correct and appropriate for their intended use. Verification is generally intended to ensure that data values are consistent with each other and accord with any specifications and requirements applicable to some general or specific intended use. Validation is intended to ensure that data values accord with what they represent and are appropriate for some specific or general intended use. Certification establishes that some appropriate authority has determined that data are correct and appropriate within margins that carry a risk acceptable to the certifying authority for some intended use.
It is also crucial to recognize that data users do not entirely control the quality of the data they use. When a data set is obtained from a data source or intermediate data center, the objective aspects of its quality (accuracy, consistency, etc.) may already have been determined. Though a user may perform verification checks on a database (e.g., to verify referential integrity), users do not often have the resources to perform comprehensive validation checks, which require comparing data values with the real world or with other, independently-derived values that are known to be correct or are considered to be "best-estimate" values. Users must therefore rely on data producers to perform such checks and to document their results as measures of the quality of their data. Moreover, if users are to have confidence in the stated quality of a database, producers must document more than just the results of their quality checks: they must also provide sufficient documentation about how they performed these checks to allow users to understand what they have done and evaluate their conclusions.
Furthermore, since data producers cannot evaluate the appropriateness of a database for an unknown user's purpose, they must provide sufficient documentation about their data to enable the user to perform this evaluation. This documentation must make explicit any implicit assumptions in the data or in the way the database was generated. Although such documentation may not by itself constitute part of a data quality evaluation, it is essential for performing such evaluations. In addition, whenever a database is evaluated--whether by its producer, by an intermediate data center, or by a user--the results of this evaluation must be documented with sufficient context to allow future users to benefit from this evaluation (to the extent that it is relevant to their future purpose) and to allow producers to improve their databases by utilizing detailed evaluations and analyses of their data, performed by intermediaries or data users. Finally, it is vital to consider the ways that data may be transformed between initial production and final use. Various intermediaries (such as data centers) and users themselves may combine, aggregate, filter, edit and modify data from different sources in order to prepare a data set for a specific use. These transformation processes--along with the processes that generate data initially--all affect the quality of the resultant data. Therefore, in addition to recording information about data values themselves and evaluating the quality of these values, it is important to record information about the processes that affect data, both to ensure that data quality is not corrupted and to allow improving these processes. The organizations that own and perform processes that affect data must be funded and motivated to establish control over their processes to improve them, thereby improving the quality of the data they produce. The next two sections present an overview of a comprehensive approach to evaluating, assessing, maintaining and improving data quality, based on the use of metadata that should be associated with every database.
Figure 2 is not intended as a flowchart of the activity of an individual data producer or user; rather, it attempts to show the parallel processes that should be followed by a community of producers and users in order to improve the quality of their data. For example, the top left activity ("Identify intended/potential users & purposes") should be interpreted as an action to be undertaken by an entire data community (or, alternatively, by a group of users) prior to defining appropriate metadata requirements for use in data VV&C. Similarly, the top right activity ("Identify sources/owners of data & processes") is intended as an action to be undertaken by a data community (producers and users) to identify which organizations perform data-affecting processes on a required database. As indicated in the figure, the definition and maintenance of appropriate metadata is the key to both recording the specifications needed to perform V&V and recording a full history of the VV&C activities that have been performed on a database. Similarly, it is necessary to define and maintain appropriate metadata to support and record the results of process control.

The actual VV&C that must be performed on data should be divided into producer and user VV&C as discussed elsewhere [8,9,10]. Essentially, producers must warrant the objective accuracy of a database, whereas users must determine its appropriateness for their intended use. Producers and modifiers of databases must be given the resources and mandate to perform VV&C along with their other responsibilities. Users must be encouraged, motivated, and funded to perform VV&C as an integral part of their work. The processes that generate, modify, transform, and propagate data should be examined, controlled, and improved whenever possible, in order to improve the quality of the resulting data [9,10].
Implementing these parallel activities requires: (i) augmenting databases with metadata to record information needed to assess data quality, record the results of these assessments, and support process control of processes affecting data; (ii) performing explicit VV&C on data, using metadata to both direct this activity and record its results; and (iii) establishing control over processes that affect data to improve the quality of data transformations, again using metadata to both support this activity and record its results.
Although the potential amount of metadata required to support data quality evaluation and improvement may appear overwhelming, several factors should mitigate the effort required to generate and maintain quality metadata. First, many metadata values for related items in a data set will be identical, since the sources, histories and times of entry of many data values in a given data set will be shared. Data entry and maintenance tools can take advantage of this commonality, allowing data values to "inherit" many of their quality metadata values from those of related data items. In addition, it should be feasible to develop automated tools for data entry, database management, data transformation and data VV&C that create and capture much of the relevant data quality metadata automatically, as data values are generated, transformed or checked.
Tables 1 through 3 list categories of metadata required to support data quality evaluation and improvement. These are shown at three distinct levels: the database level (Table 1), the data- element (or data dictionary) level (Table 2), and the data value (or instance data) level (Table 3). All of the categories shown are necessary for either evaluating or recording data quality; those used to report data quality results per se (and to provide a minimal context for understanding those results) constitute a data quality profile, and are marked with "*" to show where the quality profile fits in the universe of quality metadata. The boundary between the profile and the larger universe of quality metadata is not a rigid one: the quality profile is simply a view (in the database sense) into this universe, which can be extended as required.
Many of the quality metadata items described here are qualitative in nature (e.g., textual): it should not be surprising that the evaluation of quality is necessarily somewhat qualitative. In addition, many of the quality attributes shown here apply to metadata as well as data: that is, there may need to be metadata describing the quality of the metadata in a database.
It would clearly be advantageous to propagate and transform metadata as necessary when deriving new databases from existing ones, to allow "leveraging" metadata for multiple uses; this possibility deserves further research.
Although the cost of developing and enforcing data quality procedures and the collection and maintenance of data quality metadata may be substantial, this cost must be borne if we are to reduce the risk of using data of unknown quality.
It may be tempting to imagine that future software will be able to read old records and recreate the ways they were intended to be seen, but this is unrealistic. Software paradigms change every few years, and record-keeping paradigms themselves are evolving in response (see Michelson and Rothenberg [6]). Nor can we expect to reproduce the precise behavior of obsolete software in the future by interpreting a saved description of that software--the only description of software that can capture its functional richness is the software itself. In general, viewing obsolete records as they were intended to be viewed requires running the actual software that created them.
Standards may seem to alleviate this problem, but they necessarily lag behind current practice, and (as the saying goes) the best thing about standards is that there are so many different ones to choose from. Furthermore, it is naive to think we can standardize data representation or presentation formats when our paradigms are still evolving so fast. Although data can be translated into new forms as standards change, this introduces the risk of corruption as each translation is performed; since it is not always possible to provide backward translation without loss, the original version of a record is often impossible to reconstruct once it is translated. The only general, long-term solution to this problem would appear to be a strategy that makes records self-contained and self- explanatory. The next section proposes such a strategy, based on the use of encapsulation along with metadata to make the encapsulated information intelligible.
The motivation for encapsulating records in this way is two-fold. First, it avoids corruption by marking the entire suite of saved information as inviolable. An encapsulated record must not, for example, be subjected to lossy compression or modified in any way: it must be recognized as a saved bit-stream whose bits must be copied verbatim whenever they migrate to new media. In addition, encapsulation ensures that all of the necessary components (software, hardware environment description, and data) of a record remain contiguous: so long as the encapsulation is not violated, future readers can then be assured of having all they need to access and understand saved records.

Encapsulation introduces a problem, however: a prospective reader needs to know how to open the encapsulation and read the record inside it. Furthermore, it is unreasonable to expect data administrators to open and read each encapsulated record every time they need to decide where to store it, how to index it, who should be allowed access to it, etc. The solution to these problems is to attach annotation metadata to the "surface" of each encapsulation, both to explain how to decode the obsolete records contained inside the encapsulation and to provide whatever contextual information is desired about those records.
Unfortunately, this introduces a new, recursive problem, since the form and encoding of these metadata annotations themselves can quickly become obsolete. One solution to this problem is to use encapsulation recursively, along with metadata to make each level of encapsulated information intelligible at the next level. That is, whenever the form of the metadata used to annotate a given encapsulated record becomes obsolete itself, it can be wrapped in a new layer of encapsulation, using new metadata (in a new form) to explain how to read the old metadata. Yet this recursive process must terminate somehow: there must be some annotation at the outermost level of encapsulation that can be read without further interpretation--otherwise the original problem recurs at the outermost level, making it impossible to open the outermost encapsulation. Furthermore, it is overkill to encapsulate most annotations, since they consist largely of linear, descriptive text rather than hypermedia records whose structure may be as important as their textual content. The solution is therefore to devise a transparent annotation format that can annotate encapsulated records in a way that does not require significant interpretation effort.
Since most annotations can be restricted to linear text, a simple textual encoding can serve as a transparent annotation format. Such a format can be used to explain how to proceed further: it can consist of a simple, linear stream of fixed-length characters or some similar encoding that would be readily decipherable by anyone armed with a computer and the expectation that the format represents linear text. At any given stage of evolution during the information age, the choice of such an encoding is likely to be obvious. Choosing a particular transparent annotation format has the effect of adopting a "bootstrap standard" that solves the problem of interpreting recursive encapsulations: this bootstrap enables a future reader to decipher some initial portion of the metadata annotating an encapsulated record. This initial explanatory annotation can in turn explain how to read the encapsulated record itself.
Yet a given bootstrap standard--however natural it may be when first chosen--will not last forever. ASCII may have essentially replaced EBCDIC, Baudot, and various other text encodings, but it may soon be replaced by Unicode or some other scheme. It is erroneous to assume that any such standard will reign for long. Fortunately, it is not necessary to make this assumption if we are willing to restrict the transparent annotation format to linear text: doing so ensures (with high probability) that any future bootstrap standard will be able to represent whatever we encode in our current bootstrap standard. If the expressive power of any new bootstrap standard is always a proper superset of that of the older bootstrap standard that it replaces then, although it may not be possible to translate any arbitrary expression from the new standard backward into the old one, it should be possible to translate back anything that was originally expressed in the older standard. We can therefore guarantee that we can translate our current standard into a new one when necessary, while retaining the ability to translate back again without loss. This partially reversible translation capability is the key to unlocking encapsulated records, as elaborated in the next section.
Nevertheless, two additional categories of metadata should also be made visible at the surface of the encapsulation, if only for purposes of efficiency. The first of these consists of reference and dependency metadata, specifying any objects referenced or required by the encapsulated object: this allows keeping related objects together when they are copied to new media or repackaged, without having to open the encapsulation to look for such dependencies. If it could be guaranteed that encapsulated objects would never be split apart, this might be unnecessary, but in general this cannot be assumed. Large objects may have to be split to make efficient use of fixed-size storage media, and complex distributed objects (such as hypertext documents that are linked to each other) may be impossible to store contiguously. For these and other reasons, dependencies should be represented explicitly, rather than being wished away.
The final (third) category of surface metadata is indexing information that might include keywords, search terms, provenance, and other context for where an encapsulated record belongs, where it comes from, what it contains, and how it should be accessed. Making this visible at the surface of the encapsulation facilitates data management and allows searching for relevant data without having to delve into encapsulated information every time a query is processed.
The resulting encapsulation scheme is illustrated in Figure 4.

The surface metadata for an encapsulated record must remain readable with minimal effort. This can be ensured by encoding this metadata in the current bootstrap standard for annotation. Since these standards must be expected to evolve, such annotations must be translated into a new bootstrap standard when necessary. This can be done routinely as part of the "refresh cycle" when media are copied to new media. It is crucial, of course, that encapsulated records not be translated: the whole point of encapsulation is to prevent such translation. Only the surface annotations should be translated, and then only when the current bootstrap standard is different from the one currently attached to the surface of the encapsulation. In addition, translations between successive bootstrap standards can be saved in a known, central repository, or they can themselves be encapsulated along with each record, allowing the original annotations to be reconstructed, if necessary, using the partially reversible translation property of successive bootstrap standards discussed above.
To ensure the longevity of digital data, records should be encapsulated along with explanatory metadata sufficient to allow accessing and deciphering the encapsulated information in the future. Records should be encapsulated along with whatever application and system software is required to view them, as well as metadata describing the hardware environment needed to run the required software, to allow future emulation of the required system. Explanatory metadata must be attached to this collection to explain how to use the emulated system and encapsulated software to view the encapsulated record, while indexing and other descriptive metadata should be included to enable the record to be found by search queries and managed intelligently. The surface of an encapsulated record must be annotated in a transparent form, using textual "bootstrap standards" to ensure that future readers will be able to interpret the encapsulated record with a minimum of effort.
[2] DoD, M&S Master Plan, 5000.59-Paa, Draft January 1995.
[3] Kon, Henry B., Lee, Jacob, and Wang, Richard Y., A Process View of Data Quality, TDQM Research Program, MIT, March 1993.
[4] Lee, Jacob, Wang, Richard Y. On Validation Approaches in Data Production, TDQM Research Program, MIT, August 1993.
[5] Liepins, Gunar E., and V. R. R. Uppuluri (eds.), Data quality control: theory and pragmatics, Marcel Dekker, Inc., New York, 1990
[6] Michelson, A., and J. Rothenberg, Scholarly Communication and Information Technology: Exploring the Impact of Changes in the Research Process on Archives, The American Archivist, 55:2, 1992, pp. 236-315. ISSN 0360-9081. Reprinted by RAND as RP-187, April 1993.
[7] Redman, Thomas C., Data Quality Management and Technology, Bantam Books (ISBN 0- 553-09149-2), 1992.
[8] Rothenberg, J., Verification, Validation and Certification of DIS Exercise Data, Proceedings of the 14th Workshop on Standards for the Interoperability of Distributed Simulation (DIS-14), (paper ID: 96-14-120).
[9] Rothenberg, J., A Discussion of Data Quality for Verification, Validation, and Certification (VV&C) of Data to be Used in Modeling, RAND Draft DRR-1025-DMSO (forthcoming).
[10] Rothenberg, J., and I. Kameny, Data Verification, Validation, and Certification to Improve the Quality of Data Used in Modeling, Proceedings of the 1994 Summer Computer Simulation Conference (SCSC'94), (La Jolla, CA, July 18-20, 1994), pp. 639-44, Society for Computer Simulation (SCS) (ISBN 1-56555-029-3), 1994.
[11] Rothenberg, J., Ensuring the Longevity of Digital Documents, Scientific American, Vol. 272, Number 1, pp. 42-7, January 1995.