Entity-Relationship DiagramA good model and a proper database design form the foundation of an information system. Building the data layer is often the first critical step towards implementing a new system, and getting it right requires attention to detail and a whole lot of careful planning. A database, like any computer system, is a model of a small piece of the real world. And, like any model, it’s a narrow representation that disregards much of the complexity of the real thing.

Modern database systems rely on the relational model to store and retrieve data. The name comes from the relationship between columns in a table (not because you can relate tables to one another). In other words, relational means that several values that belong to the same row in a table are related.

The primary key is an attribute (or a combination of attributes) that uniquely identifies a row. Though not strictly required by relational mathematics, primary keys make it reasonably easy to deal with relational data programmatically. They make mapping relational data to an object-oriented model feasible, and allow applications to uniquely identify and manipulate each entity (row) in the database.

Natural Keys

The concept of a unique identifier is familiar in the real world — you use account numbers to identify credit cards, addresses to identify buildings or houses, etc. These are examples of natural keys, real-world identifiers that are used to uniquely identify real-world objects.

In general, if the data you are modeling has a decent natural identifier you should use it as the primary key. That said, not all natural keys make good primary keys. The goal of the primary key is to uniquely identify an entity in your database. It does not have to describe the entity. The fact that a particular identifier can be used to describe an object in the real world doesn’t mean it’s a good primary key.

There are a number of desirable (not necessarily required) primary key characteristics that natural identifiers sometimes lack:

  • Unique values: The primary key must uniquely identify each row in a table.
  • Non-intelligent: The primary key should not have embedded semantic meaning. In other words, it should not describe characteristics of the entity. A customer ID of 398237 is typically preferred over Michael J. Malone.
  • No change over time: The value of a primary key should never change. Changing a primary key value means you’re changing the identity of an entity, which doesn’t make sense. Non-intelligent keys are preferred because they are less likely to change.
  • Single-attribute: A primary key should have the minimum number of attributes possible. Single-attribute primary keys are desirable, because they’re easier for applications to work with, and they simplify the implementation of foreign keys.
  • Numeric: It’s easier to manage unique values when they are numeric. Most database systems have internal routines that facilitate auto-incrementing primary key attributes. While these facilities are useful, be careful not to use them as a crutch.

For each of these rules, there are exceptions. For example, composite primary keys are particularly useful as identifiers in join tables, modeling a many-to-many relationship. And an otherwise suitable single-value natural key should not be disregarded simply because it’s not numeric.

Surrogate Keys

When a natural key doesn’t exist, or when the natural key isn’t up to snuff, it’s time to consider using a surrogate key (also called a synthetic key) to uniquely identify entities. A surrogate primary key is typically a numeric, single-attribute key and is often auto-generated by the database system. While some DBAs continue to debate their use, surrogate primary keys are pretty much accepted practice these days.

A surrogate key is meaningless by itself. Thus, it has no embedded semantic meaning. The sole purpose of a surrogate key is to uniquely identify entities, and to facilitate relational operations like joins and filters. It’s a single, unique value that never has to change (because its only job is to identify the entity). Thus, it’s an ideal primary key.

Because surrogate keys always consist of a single attribute, they can simplify business logic. If the column name for the primary key of a table can be derived from the table name, for example, a code generator can be used to built a primitive database abstraction layer. These keys, in combination with the table name, act as a globally unique identifier at runtime. This identifier can be used to build sophisticated caching mechanisms, facilitate lazy-loading, and simplify serialization.

Conclusion

As the foundation of many information systems, database design should be carefully planned and properly implemented. Choosing the proper primary key is a critical task in modeling relational data. If possible, entities should have a unique identifier that has meaning rather than some obscure sequential integer. But the natural identifier need not be the primary key — it’s perfectly acceptable to use a synthetic, or surrogate key as a tables primary key. That said, don’t use auto-generated primary keys in order to avoid identifying and properly handling the natural keys present in your data.