FC Navigation Console <a href="http://web.archive.org/web/19991009042525/http://ad.doubleclick.net/jump/fc.us468/member/tech;abr=!ie;s1=m;s3=tech;pos=1;tag=i;sz=468x60;mtile=1;ord=9278869"><img src="http://web.archive.org/web/19991009042525im_/http://ad.doubleclick.net/ad/fc.us468/member/tech;abr=!ie;s1=m;s3=tech;pos=1;tag=i;sz=468x60;mtile=1;ord=9278869" border="0" height="60" width="468"></a>

DBMS - March 1998 DBMS Online: Data Warehouse Architect By Ralph Kimball

Meta Meta Data Data

Making a List of Data About Metadata and Exploring Information Cataloging Tools.

Metadata is an amazing topic in the data warehouse world. Considering that we don’t know exactly what it is, or where it is, we spend more time talking about it, worrying about it, and feeling guilty we aren’t doing anything about it than any other topic. Several years ago we decided that metadata is any data about data. This wasn’t very helpful because it didn’t paint a clear picture in our minds. This fuzzy view gradually cleared up, and recently we have been talking more confidently about the "back-room metadata" that guides the extraction, cleaning, and loading processes, as well as the "front-room metadata" that makes our query tools and report writers function smoothly.

The back-room metadata presumably helps the DBA bring the data into the warehouse and is probably also of interest to business users when they ask from where the data came. The front-room metadata is mostly for the benefit of the end user, and its definition has been expanded not only to include the oil that makes our tools function smoothly, but also a kind of dictionary of business content represented by all the data elements.

Even these definitions, as helpful as they are, fail to give the data warehouse manager much of a feeling for what it is he or she is supposed to do. It sounds like whatever this metadata stuff is, it’s important, and we better:

Make a nice annotated list of all of it.
Decide just how important each part is.
Take responsibility for it.
Decide what constitutes a consistent and working set of it.
Decide whether to make it or buy it.
Store it somewhere for backup and recovery.
Make it available to the people who need it.
Assure its quality and make it complete and up to date.
Control it from one place.
Document all of these responsibilities well enough to hand this job off (soon).

Now there is a good, solid IT set of responsibilities. So far, so good. The only trouble is, we haven’t really said what it is yet. We do notice that the last item in the above list really isn’t metadata, but rather, data about metadata. With a sinking feeling, we realize we probably need meta meta data data.

To get this under control, let’s try to make a complete list of all possible types of metadata. We surely won’t succeed in this first try, but we will learn a lot. First, let’s go to the source systems, which could be mainframes, separate nonmainframe servers, users’ desktops, third-party data providers, or even online sources. We will assume that all we do here is read the source data and extract it to a data staging area that could be on the mainframe or could be on a downstream machine. Taking a big swig of coffee, we start the list:

Repository specifications
Source schemas
Copy-book specifications
Proprietary or third-party source specifications
Print spool file source specifications
Old format specifications for archived mainframe data
Relational, spreadsheet, and Lotus Notes source specifications
Presentation graphics source specifications (for example, Powerpoint)
URL source specifications
Ownership descriptions of each source
Business descriptions of each source
Update frequencies of original sources
Legal limitations on the use of each source
Mainframe or source system job schedules
Access methods, access rights, privileges, and passwords for source access
The Cobol/JCL, C, or Basic to implement extraction
The automated extract tool settings, if we use such a tool
Results of specific extract jobs including exact times, content, and completeness.

Now let’s list all the metadata needed to get the data into a data staging area and prepare it for loading into one or more data marts. We may do this on the mainframe with hand-coded Cobol, or by using an automated extract tool. Or we may bring the flat file extracts more or less untouched into a separate data staging area on a different machine. In any case, we have to be concerned about metadata describing:

Data transmission scheduling and results of specific transmissions
File usage in the data staging area including duration, volatility, and ownership
Definitions of conformed dimensions and conformed facts
Job specifications for joining sources, stripping out fields, and looking up attributes
Slowly changing dimension policies for each incoming descriptive attribute (for example, overwrite, create new record, or create new field)
Current surrogate key assignments for each production key, including a fast lookup table to perform this mapping in memory
Yesterday’s copy of a production dimension to use as the basis for Diff Compare
Data cleaning specifications
Data enhancement and mapping transformations (for example, expanding abbreviations and providing more detail)
Transformations required for data mining (for example, interpreting nulls and scaling numerics)
Target schema designs, source to target data flows, target data ownership, and DBMS load scripts
Aggregate definitions
Aggregate usage statistics, base table usage statistics, potential aggregates
Aggregate modification logs
Data lineage and audit records (where exactly did this record come from and when)
Data transform run-time logs, success summaries, and time stamps
Data transform software version numbers
Business descriptions of extract processing
Security settings for extract files, software, and metadata
Security settings for data transmission (that is, passwords, certificates, and so on)
Data staging area archive logs and recovery procedures
Data staging-archive security settings.

Once we have finally transferred the data to the data mart DBMS, then we must have metadata, including:

DBMS system tables
Partition settings
Indexes
Disk striping specifications
Processing hints
DBMS-level security privileges and grants
View definitions
Stored procedures and SQL administrative scripts

DBMS backup status, procedures, and security. In the front room, we have metadata extending to the horizon, including:

Precanned query and report definitions
Join specification tool settings
Pretty print tool specifications (for relabeling fields in readable ways)
End-user documentation and training aids, both vendor supplied and IT supplied
Network security user privilege profiles, authentication certificates, and usage statistics, including logon attempts, access attempts, and user ID by location reports
Individual user profiles, with links to human resources to track promotions, transfers, and resignations that affect access rights
Links to contractor and partner tracking where access rights are affected
Usage and access maps for data elements, tables, views, and reports
Resource charge back statistics
Favorite Web sites (as a paradigm for all data warehouse access).

Now we can see why we didn’t know what this metadata was all about. It is everything! Except for the data itself. Suddenly, the data seems like the simplest part.

With this perspective, do we really need to keep track of all this? We do, in my opinion. This list of metadata is the essential framework of your data warehouse. Just listing it as we have done seems quite helpful. It’s a long list, but we can go down through it, find each kind of metadata, and identify what it is used for and where it is stored.

There are some sobering realizations, however. Much of this metadata needs to reside on the machines close to where the work occurs. Programs, settings, and specifications that drive processes have to be in certain destination locations and in very specific formats. That isn’t likely to change soon.

We are going to need a tool for cataloging metadata and keeping track of it at the very least. The tool probably can’t read and write all the metadata directly, but at least it should help us manage the metadata that is stored in so many locations. Fortunately, there is a category of tools dedicated to this very purpose. Go and look at the Information Catalog Tools section of Larry Greenfield’s Web site at pwp.starnetinc.com/larryg/catalog.html. As of this writing, Larry lists no fewer than 14 tools that aim squarely at the metadata problem. Of these, six in particular caught my eye as claiming to take very closely the perspective that I have described in this article. These six are deliveryManager from Virtual Integration Technology (www.vit.com), InfoCat from Enterprise Solutions Inc. (www.infocat.com), Logic Works Universal Directory from Logic Works (www.logicworks.com), Marlow from One Meaning (www.onemeaning.com), Metadata Control Center from Intellidex Systems (www.intellidex.com), and Prism Warehouse Directory from Prism Solutions (www.prismsolutions.com). I recommend that you take a look at these products to see if they meet your metadata needs.

Once we have taken the first step of getting our metadata corralled and under control, can we hope for tools that will pull all the metadata together in one place and be able to read and write it as well? With such a tool, not only would we have a uniform user interface for all this disparate metadata, but on a consistent basis we would be able to snapshot all the metadata at once, back it up, secure it, and restore it if we ever lost it.

Don’t hold your breath. As you can appreciate, this is a very hard problem, and encompassing all forms of metadata will require a kind of systems integration that we don’t have today. I believe the Metadata Coalition (a group of vendors trying seriously to solve the metadata problem) will make some reasonable progress in defining common syntax and semantics for metadata, but it has been two years and counting since they started this effort. Unfortunately, Oracle, the biggest DBMS player, has chosen to sit out this effort and has promised to release its own proprietary metadata standard. Other vendors are making serious efforts to extend their product suites to encompass many of the activities listed in this article and simultaneously to publish their own framework for metadata. These vendors include Microsoft, who’s working with the Metadata Coalition to extend the Microsoft Repository, as well as a pack of aggressive, smaller players proposing comprehensive metadata frameworks, including Sagent, Informatica, VMark, and D2K. In any case, these vendors will have to offer significant business advantages in order to compel other vendors to write to their specifications. You can read the Metadata Coalition’s position papers and progress reports on www.he.net/~metadata. Meanwhile, take a look at the information catalog tools I mentioned, and get started entering your meta meta data data.

Ralph Kimball was coinventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems and is the founder and former CEO of Red Brick Systems. He now works as an independent consultant designing large data warehouses. His book The Data Warehouse Toolkit: How to Design Dimensional Data Warehouses (Wiley, 1996) is now available. You can reach Ralph through his Web page at www.rkimball.com.

This is a copy of an article published @ http://www.dbmsmag.com/


	<a href="http://web.archive.org/web/19991009042525/http://ad.doubleclick.net/jump/fc.us468/member/tech;abr=!ie;s1=m;s3=tech;pos=1;tag=i;sz=468x60;mtile=1;ord=9278869"><img src="http://web.archive.org/web/19991009042525im_/http://ad.doubleclick.net/ad/fc.us468/member/tech;abr=!ie;s1=m;s3=tech;pos=1;tag=i;sz=468x60;mtile=1;ord=9278869" border="0" height="60" width="468"></a>