THE PRINCIPLE OF ORTHOGONAL DESIGN PART 1
THE PRINCIPLE OF ORTHOGONAL DESIGN PART 1
by Chris Date and David McGoveran

 

 

 

Update Notice

 

The following is the draft of the abstract of a planned paper on this subject.

 


Database design is more art than science, that is, it is mainly of an informal nature. Whatever formality is there, it has been limited to what we currently call the Principle of Full Normalization (POFN), expounded in The Costly Illusion: Normalization, Integrity and Performance (paper #3 in this series). In 1993, however, McGoveran and Date presented in a two-part paper an additional formal design principle, The Principle of Orthogonal Design (POOD). Jointly, the two principles are intended to prevent redundancy, update anomalies,  harder to understand databases, and other practical complications in database design.

 

The POOD formally defined in the original paper was informally described as prohibition of tables with overlapping meanings. But there is now disagreement between the two authors on the original definition of POOD, which has been questioned by at least one (see On Non-Loss Decomposition); Date has since revised his formal definition in Data Redundancy and Database Design, Further Thoughts Part 1, but it is quite complex, it does not have a very easy informal description, and we are not sure about some of the assumptions behind it.

 

The purpose of this paper is to further clarify POOD, for a better understanding that would enable adherence with relative ease by the average practitioner.


Posted 1/13/06

 

 

 

The work described in this two-part article was done jointly by David McGoveran and myself during 1993. 

 

 

It was the view update question that was the overall spur to our work.  I had been developing the outlines of a new approach to view updating, one that enjoyed several nice properties and overall looked quite promising, but I kept running into the problem that the approach I was proposing could occasionally lead to strange results. Then David pointed out that those strange results occurred only if the database design was "strange" too in a certain sense, and proposed a new design principle that said, in effect "Don't design databases in this strange way."  And that observation led directly to the production of the original versions of the view update articles.

 

Despite the foregoing, I must make it clear that avoiding view update anomalies is far from being the sole benefit that accrues from the new design principle.  Indeed, I would claim that, like the principles of further normalization, the new principle injects a little rigor into a field (database design) that as I have written elsewhere is still very much of an art, not a science.

 

 

1. Introduction

 

The purpose of this paper is to present and describe a new and, we believe, very fundamental database design principle.  Like the well-known principles of normalization, our principle can be seen in part as a systematic means of avoiding redundancy, and thereby avoiding certain update anomalies that might otherwise occur.  Although the principle is very simple, we have not previously seen it articulated in published form; in fact, the experience of one of us (McGoveran) suggests that the principle is quite often violated in practice.  We briefly explore some of the consequences of such violations.

 

 

2. The Loves-Hates Example

 

We begin with a simple example.  Consider the following database:

 

   CREATE DOMAIN PERSONS ... ;

 

   CREATE BASE TABLE LOVES

        ( X DOMAIN ( PERSONS ),

          Y DOMAIN ( PERSONS ) ... ) ;

 

   CREATE BASE TABLE HATES

        ( X DOMAIN ( PERSONS ),

          Y DOMAIN ( PERSONS ) ... ) ;

 

The intended semantics are, of course, that the row <x,y> appears in LOVES only if "x loves y" is true, and the row <x,y> appears in HATES only if "x hates y" is true (Throughout this paper we use a modified form of conventional SQL syntax, for reasons of simplicity and explicitness.)

 

Now suppose the row <Romeo,Juliet> is to be inserted into this database.  The user responsible for the INSERT will presumably insert the row into base table LOVES.  Note carefully, however, at he or she could just as easily have inserted the row into base table HATES instead.  It is only because the user had some additional information namely, the knowledge that "Romeo loves Juliet" is true that he or she decided to insert the row into LOVES and not into HATES.  Since that additional information is not known to the DBMS, the "decision procedure" for deciding whether a given row should be inserted into LOVES or HATES is likewise not known to the DBMS.  In other words, part of the meaning of the database is concealed from the system.

 

Note, moreover, that it is not just the DBMS that is affected by this concealment.  Exactly the same information is being concealed from other users as well.  That is, given the same row <Romeo,Juliet>, another user will be just as unable to decide which of LOVES and HATES the row is to go into if he or she does not have the necessary extra information (viz., that "Romeo loves Juliet" is true).  To put the point another way:  Suppose we are told that for a certain pair of persons x and y either "x loves y" is true or "x hates y" is true, but we are not told which.  In general, then, we will only be able to tell which of the two possibilities is in fact the case by looking to see which of the two base tables the row <x,y> appears in.

Note further that even when we have found which table the row <x,y> appears in, we still don't really know which of the two possibilities is the casewe know only that the row appears in LOVES or HATES, as the case may be.  And lest the reader object that we surely do now know which possibility is the case, since the necessary information is represented by the names of the two base tables involved (LOVES and HATES), let us now rename those tables ABC and XYZ, respectively.  Now can you tell which of "x loves y" and "x hates y" is true?  The answer, of course, is that you can tell only if you know that ABC "means" loves and XYZ "means" hates.  In fact, of course, there is nothing to stop us renaming LOVES as HATES and HATES as LOVES ... Now can you tell?

 

Before attempting to draw any conclusions from the foregoing, let us move on to examine another example.

 

 

3. The Employees Example

 

Suppose we have a database concerning employees, in which every employee has a (unique) employee number EMP#, a name ENAME, a department number DEPT#, and a salary SAL.  Further, suppose that we decide (for some reason the precise reason is not important for the moment) to represent employees by two base tables, EMPA and EMPB. See Fig. 1 for some sample values where

 

·    EMPA contains rows for employees in department D1;

·    EMPB contains rows for employees who are either not in department D1 or have a salary in excess of 33K.

 

 EMPA                           EMPB

+----------------------------+ +----------------------------+

¦ EMP# ¦ ENAME ¦ DEPT# ¦ SAL ¦ ¦ EMP# ¦ ENAME ¦ DEPT# ¦ SAL ¦

+------+-------+-------+-----¦ +------+-------+-------+-----¦

¦ E1   ¦ Lopez ¦ D1    ¦ 25K ¦ ¦ E2   ¦ Cheng ¦ D1    ¦ 42K ¦

¦ E2   ¦ Cheng ¦ D1    ¦ 42K ¦ ¦ E3   ¦ Finzi ¦ D2    ¦ 30K ¦

+----------------------------+ ¦ E4   ¦ Saito ¦ D2    ¦ 45K ¦

                               +----------------------------+

 

Fig. 1: Base tables EMPA and EMPB (first version): sample values

 

The reader will surely agree that this is a bad design.  But why exactly is it bad?  Some insight into this question can be gained by considering the following scenario.  Suppose first of all that we start off with an empty database (i.e., base tables EMPA and EMPB both contain no rows at all).  Suppose next that we are asked to insert information regarding employee E2 (name Cheng, department D1, salary 42K) into this database.  We construct the row

 

   <E2,Cheng,D1,42K>

 

But which base table do we put it in?  The answer, obviously, has to be both (as suggested by Fig. 1).  It has to be both, because the new row satisfies both (a) the criterion for membership in EMPA (the department number is D1), and (b) the criterion for membership in EMPB (the salary is greater than 33K).  After all, if the row were to be inserted into just one of the two base tables, the question is which one?  There are no grounds, except arbitrary ones, for choosing either table over the other.

 

(In fact, if we put the row into just one of the tables, say, EMPA and not EMPB, we could be accused of a contradiction.  For the appearance of the row in EMPA would mean that Cheng works in department D1 and earns 42K, while the simultaneous nonappearance of the row in EMPB would mean that Cheng either does not work in department D1 or does not earn more than 33K.)

 

Thus we see that one reason the design is bad is that it leads to redundancy:  The very same information is represented twice, in two distinct base tables.

 

Of course, it is fairly easy to see what causes the redundancy in this particular example.  To be specific, there is a certain lack of independence between the two base tables EMPA and EMPB, inasmuch as "their meanings overlap" (it is possible for the same row to satisfy the membership criterion for both).  Lack of independence between objects is generally to be avoided if possible, because it implies that changes to one will require changes to the other as well.  For instance, a DELETE on one table might require a DELETE on another (as is indeed the case in our EMPA-EMPB example, if we wish to delete the information for employee E2).

 

Thus it looks as if a good design principle might be "Don't have tables whose meanings overlap" and indeed, so it is.  Before we can make this principle more precise, however, we need to examine the question of the meaning of a table in greater depth.  Note in particular that the principle as just-- very loosely! --Articulated is not sufficient in itself to explain what is wrong with the LOVES-HATES example discussed earlier.

 

 

4. Integrity Constraints

 

In order to discuss what tables mean, we must first digress for a few moments to consider the general issue of integrity constraints.  We classify such constraints into four kinds, namely domain constraints, column constraints, table constraints, and database constraints (see Constraints and Predicates Parts 1, 2, 3).

 

Ø       A domain constraint is just the definition of the set of values that go to make up the domain in question; in other words, it effectively just enumerates the values in the domain (We mention domain constraints merely for completeness they play no part in the database design principle that is the primary focus of this paper.

 

Ø       A column constraint states that the values appearing in a specific column must be drawn from some specific domain.  For example, consider base table EMPB from the previous section.  The columns of that table are subject to the following column constraints:

 

·  e.EMP# IN EMP#_DOM

·  e.ENAME IN NAME_DOM

·  e.DEPT# IN DEPT#_DOM

·  e.SAL IN US_CURRENCY_DOM

 

Here e represents an arbitrary row of the table and EMP#_DOM, NAME_DOM, etc., are the names of the relevant domains.

 

Ø       A table constraint states that a specific table must satisfy some specific condition, where the condition in question refers solely to the table under consideration i.e., it does not refer to any other table, nor to any domain.  For example, here are two table constraints for the base table EMPB:

 

1.     e.DEPT# =/ 'D1' OR e.SAL > 33K

 

2.     IF e.EMP# = f.EMP# THEN e.ENAME = f.ENAME

AND e.DEPT# = f.DEPT#

AND e.SAL = f.SAL

 

The first of these is self-explanatory.  The second says that if two rows e and f have the same EMP# value, then they also have the same ENAME value, the same DEPT# value, and the same SAL value in other words, they are the same row.  (Of course, this is just a longwinded way of saying that EMP# is a candidate key.  Naturally we assume that all tables do have at least one candidate key! i.e., duplicate rows are not permitted.)

 

Note: Observe that we talk of table constraints in general, not just base table constraints.  The point is, all tables, base or otherwise, are subject to table constraints, as the authors have discussed in detail elsewhere [5].  For present purposes, however, it is indeed base table constraints in particular that are of primary interest; for the remainder of this paper, therefore, we will take the unqualified term "table constraint" to mean a base table constraint specifically, barring explicit statements to the contrary.

 

Ø       A database constraint states that the overall database must satisfy some specific condition, where the condition in question can refer to as many tables as desired.  For example, suppose the database of Fig. 1 was extended to include a departments table DEPT.  Then the referential constraints from EMPA and EMPB to that table DEPT would both be database constraints (they would both in fact refer to exactly two tables).

 

 

5. The Question Of Meaning

 

Now we can get back to our discussion of what tables (and indeed databases) mean.  The first point is that every table, be it a base table, a view, a query result, or whatever certainly does have an associated meaning.  And, of course, users must be aware of those meanings if they are to use the database correctly and effectively.  For example, the meaning of base table EMPB is something like the following:

 

"The employee with the specified employee number (EMP#) has the specified name (ENAME), works in the specified department (DEPT#), and earns the specified salary (SAL).  Furthermore, either the department number is not D1 or the salary is greater than 33K (or both).  Also, no two employees have the same employee number."

 

Formally, the foregoing "meaning" is an example of what is called a predicate, or truth-valued function, a function of four arguments, in this particular case.  Substituting values for the arguments is equivalent to invoking the function (or "instantiating" the predicate), thereby yielding an expression that evaluates to either true or false.  For example, the substitution

 

   EMP# = 'E3' ENAME = 'Finzi' DEPT# = 'D2' SAL = 30K

 

yields the value true.  By contrast, the substitution

 

   EMP# = 'E3' ENAME = 'Clark' DEPT# = 'D2' SAL = 25K

 

yields the value false.  And at any given time, of course, the table contains exactly those rows that make the predicate evaluate to true at that time. 

 

It follows from the foregoing that if (for example) a row is presented as a candidate for insertion into some table, the DBMS should accept that row only if it does not cause the corresponding predicate to be violated.  More generally, the predicate for a given table represents the criterion for update acceptability for that table, that is, it constitutes the criterion for deciding whether or not some proposed update is in fact valid (or at least plausible) for the given table.  In other words, such a predicate corresponds to what we earlier referred to as the membership criterion for the table in question.

 

In order for it to be able to decide whether or not a proposed update is acceptable for a given table, therefore, the DBMS needs to be aware of the predicate for that table.  Now, it is of course not possible for the DBMS to know exactly what the predicate is for a given table.  In the case of base table EMPB, for example, the DBMS has no way of knowing a priori that the predicate is such that the row <E3,Finzi,D2,30K> makes it true and the row <E3,Clark,D2,25K> does not; it also has no way of knowing exactly what certain terms appearing in that predicate (such as "works in" or "earns") mean.  However, the DBMS certainly does know a reasonably close approximation to that predicate.  To be specific, it knows that, if a given row is to be deemed acceptable, all of the following must be true:

 

·         The EMP# value must be a value from the domain of employee numbers

·         The ENAME value must be a value from the domain of names

·         The DEPT# value must be a value from the domain of department numbers

·         The SAL value must be a value from the domain of US currency

·         Either the DEPT# value is not D1 or the SAL value is greater than 33K (or both)

·         The EMP# value is unique with respect to all such values in the table

 

In other words, for a base table such as EMPB, the DBMS does at least know all the integrity constraints (column constraints and table constraints) that have been declared for that base table.  Formally, therefore, we can define the (DBMS-understood) "meaning" of a given base table to be the logical AND of all column constraints and table constraints that apply to that base table (and it is this meaning that the DBMS will check whenever an update is attempted on the base table in question).  For example, the formal meaning of base table EMPB is:

 

e.EMP#  IN EMP#_DOM AND

e.ENAME IN NAME_DOM AND

e.DEPT# IN DEPT#_DOM AND

e.SAL   IN US_CURRENCY_DOM AND

(e.DEPT# =/'D1' OR e.SAL > 33K) AND

(IF e.EMP# = f.EMP# THEN e. ENAME = f.ENAME AND

    e.DEPT# = f.DEPT# AND e.SAL = f.SAL) 

 

We will refer to this expression; let us call it PE as the table predicate for base table EMPB.

 

Incidentally, note how the foregoing remarks serve to point up once again the fundamental importance of the relational domain concept.  Relational vendors should be doing all within their power to incorporate proper domain support into their DBMS products.  It is perhaps worth pointing out too that "proper domain support" here does not mean support for the very strange construct called "domains" in the SQL/92 standard. 

 

To return to the main thread of our discussion:  As indicated above, in order for the DBMS to be able to decide whether or not a given update is acceptable on a given table, the DBMS needs to be aware of the table predicate that applies to the table in question.  Now, the DBMS certainly is aware of the relevant predicate in the case of a base table, as we have just seen.  But what about derived tables e.g., what about views?  What is the table predicate for a derived table?  Clearly, what we need is a set of rules such that if the DBMS knows the table predicate(s) for the input(s) to any relational operation, it can deduce the table predicate for the output from that operation.  Given such a set of rules, the DBMS will then know the table predicate for all possible tables. 

 

It is in fact very easy to state such a set of rules they follow immediately from the definitions of the relational operators.  For example, if A and B are any two type-compatible tables tType-compatibility is usually referred to as union-compatibility in the literature.  We prefer our term for reasons that are beyond the scope of the present discussion.) and their respective table predicates are PA and PB, then the table predicate PC for table C, where C is defined as A INTERSECT B, is obviously (PA) AND (PB); that is, a row r will appear in C if and only if it appears in both A and Bi.e., if and only if PA(r) and PB(r) are both true.  So if, for example, we define C as a view and try to insert r into that view, r must satisfy both the table predicate for A and the table predicate for B, or the INSERT will fail.

 

Here is another example:  The table predicate for the table that results from the restriction operation

 

T WHERE condition

 

is (PT) AND (condition), where PT is the table predicate for T.  For example, the table predicate for EMPB WHERE SAL < 40K is

 

(PE) AND (SAL < 40K)

 

where PE is the table predicate for EMPB as defined earlier.

 

Stating the table predicates corresponding to the other relational operators is left as an exercise for the reader.

 

We conclude this section by remarking that although it is somewhat irrelevant to the main theme of this paper the overall database has a formal meaning too, just as the individual base tables do.  The meaning of the database the database predicate for that database is essentially the logical AND of all individual table predicates for the (named) tables in that database, together with all database constraints that apply to that database.

 

Originally published in Database Programming & Design 7, No. 6 (June 1994) and published as a two-part article in RELATIONAL DATABASE WRITINGS1991-94. It is republished here by permission of David McGoveran, Miller Freeman Inc. and Pearson Education,  Inc. Research has shown that certain detail level corrections might be needed, which we may undertake in the future. However, we still believe strongly that the overall approach is sound.

 

(Continued in Part 2)