K I Gordon - Index to articles

Welcome

An AEI Cool Site

This index page refers to a number of articles related to data issues, Data Warehouse implementations, standards, year 2000 problems, Technical Architectures, applications development using high level languages, tools creation and others.

The why of data standards - Do you really know your data? Summary:Much of the data "mess" is hidden from the business users by the filtering which occurs when a report is created. Even many senior IS managers have little idea of the real content of their files. The article provides results from the analysis of millions of values of actual data, illustrates what standards can be applied to correct the problems, and shows why any Data Warehouse implementation must convert the source data to standards.

Data Warehouse Implementation Plan Summary: An objective for a Data Warehouse is to make all critical business data available in standard form for rapid inquiry, analysis and reporting. To achieve this, it is necessary to move data out of its non-standard form in existing applications by steps such as those outlined below. These steps were followed in a successful implementation which resulted in useful data being available within 3 months, and a full capability being available in 6. For most organizations, creating a Data Warehouse should not be a major undertaking.

Data Warehouse Implementation Summary: Much effort in IS is currently going into creating "Data Warehouses". These are stores of data periodically extracted from older legacy applications, converted to common standards and made accessible for user analysis. The warehouse acts as a WORM (Write once Read Many times) storage. Where the extract and transfer is performed nightly, they provide access to what is termed 'Near Operational' data and can be used to replace much of the existing reporting. In other cases they are used to store mostly historical data for analysis of trends, market impact, financial status and do on. While often implemented with a variety of different clean up tools, languages, data base products and query tools, this article describes an implementation done almost entirely with APL. It includes a query capability termed "Query by Mail" which enables anyone with access to E-Mail to send queries to the Warehouse and receive responses or extracts of data by return mail. The "query" includes customized analysis of field content to allow identification of fields and records containing invalid data. Built upon a proprietary inverted file system, it provides rapid response to user queries and little load on the server system.

Neutral Form Data Standards Summary: Simple standards are presented for the various data types encountered in commercial data files. These "Neutral Standards" are applicable to the interchange of data between applications, to the transfer of data to a Data Warehouse, and, it can be argued, to the storage of original transaction data in new applications where data is input, viewed, or printed many more times than it is used in calculations.

Indexing Revisited Summary: Legacy applications tend to have identifiers or reference numbers with embedded attributes or sub-coding. The rationale for using such identifiers no longer exists, and modern data analysis prescribes the use of meaningless unique numbers as item identifiers. This allows very simple index design which also confers fast storage and retrieval. In effect, most indexed record storage techniques are more complicated than needed, particularly when allied with a data warehouse as the main analysis capability.

Array processing of commercial data - to drastically cut elapsed times Summary: Commercial data processing has traditionally used scalar processing of fields in records. To take advantage of today's high speed processors, large main memory, and peak disk transfer rates, requires a paradigm shift to processing data as arrays. Illustrated is how typical commercial processes can be re-designed for array processing to provide up to 2 orders of magnitude reduction in elapsed times. While optimally requiring new data structures, as well as new processing techniques, the methods are generic enough to apply to many applications from payroll to billings.

Compressed Data Structures for Data Warehouses Summary: The increasing importance of the Data Warehouse concept implies a need for an alternative data structure optimized for read-only access. This objective also allows more emphasis on data compression techniques to both reduce the total volume, and, even more important, to reduce the disk transfer for faster processing on access.

Queries on Packed data Summary: Relatively simple code can be used to compress or pack data for a WORM Data Warehouse and even simpler code to unpack it. This can speed up data transfer rates from the disk by an average of 3 to 5 times (depending on data characteristics) with little increase in CPU times. In addition, many queries can be performed directly against the compressed data, reducing the CPU cycles relative to the uncompressed data, and also reducing the amount of data movement in memory. This latter is a major component of cycle times when dealing with very large arrays of data and can exceed the CPU time for the comparison operations.

Data Warehouse oriented Data Analysis Tools Summary: Data Warehouses implemented from legacy applications data often encounter difficulty mapping data to standards because of the poor quality of the sources and the disparities between documentation and actuality. The tools described here address the issue of analyzing the actual content of the data field in a flat file so as to provide the information for correct mapping of the source data to standards. The tools are generic enough that they can also support the initial warehouse data access capabilities and look up of data on the part of users. The tools are GUI based to allow interactive use during the analysis phase, and results can be written to text files for transfer to other tools or simply for documentation.

Basic Data Analysis Tools -with code Summary: As described above, Data Warehouses implemented from legacy applications data often encounter difficulty mapping data to standards because of the poor quality of the sources and the disparities between documentation and actuality. The basic tools described here address the issue of analyzing the actual content of the data fields in a flat file so as to provide the information for correct mapping of the source data to standards. They tools represent a simpler basic subset of the windows oriented tools, but still support the initial analysis of content using a command line interface.

Data independent code - The key to reuse and IS productivity Summary: Embedded data definitions are what customize code for a particular file, form or screen. Making all Meta Data external to the code allows one set of code to be used in place of many programs. The article describes how this can be achieved, what Meta Data is required, and shows examples related to a Data Warehouse implementation where a single set of code standardizes and normalizes the data from tens of files. Further examples are from display and capture screens.

A management perspective of the "J" programming language Summary: The J language is a powerful interpretive language with a concise syntax. Its very power and conciseness makes it difficult for a novice to master, but also makes it of interest to a manager because little code and effort is required to achieve results. This article discuses the features of J which make it applicable to typical commercial processing tasks. It provides simple code examples to illustrate for a manager how little code may be required to implement powerful data manipulation capabilities. Hopefully the article will be of value to those evaluating high level languages suitable for commercial processing in a client/server or mainframe environment.

Buy or Build? Summary:Application Packages appear to offer the lowest cost solution for most organizations, yet there are many practical reasons why they end up costing much more than anticipated. On the other hand, custom development of applications to meet the business needs appears to be the most costly solution, yet this conclusion can be based on obsolete approaches. This article provides arguments for the custom approach based on a Technical Architecture and use of high level languages for the creation of tools in which the custom applications are developed. Such approaches have been shown to have a major impact on costs and time to implement.

A Persistent Cache for Distributed Applications Summary: Mechanisms for managing a persistent cache (P-Cache) solve many of the problems associated with distributed applications. It manages currency of local data and software on a PC or server, relative to a master file. This then supports local access to reference and other data needed in applications. Given that the cost of a network of PCs is largely the cost of managing the software and data in a distributed environment, the persistent cache approach can provide a low cost solution, at least for the business applications. However, it does require that file content and indexes be specifically designed to support such a capability, and it is not likely that it can be easily applied to existing legacy applications.

Y2000 - A Continuing Problem Summary:The partial solutions being implemented for the Year 2000 problem will result in continuing costs for most organizations.

Computer Form Factors - USB and 1394 Impact Summary:An Opinion piece about the potential impact of the USB and 1394 standards on the form factor of future computers, along with some questions about market and business impacts of small modular, easily connected components.

Author's Bio

Last updated 1999 05 10, Counters reset 1997 12 10.

Copyright © 1996-1999 K. I. Gordon. All rights reserved.
Permission to reproduce in full freely given by contacting
gordon@island.net