Database Design: Choosing a Primary Key

Database, Tutorials, Web Development August 14th, 2007 - 18,655 views

Entity-Relationship Diagram A good model and a proper database design form the foundation of an information system. Building the data layer is often the first critical step towards implementing a new system, and getting it right requires attention to detail and a whole lot of careful planning. A database, like any computer system, is a model of a small piece of the real world. And, like any model, it’s a narrow representation that disregards much of the complexity of the real thing.

Modern database systems rely on the relational model to store and retrieve data. The name comes from the relationship between columns in a table (not because you can relate tables to one another). In other words, relational means that several values that belong to the same row in a table are related.

The primary key is an attribute (or a combination of attributes) that uniquely identifies a row. Though not strictly required by relational mathematics, primary keys make it reasonably easy to deal with relational data programmatically. They make mapping relational data to an object-oriented model feasible, and allow applications to uniquely identify and manipulate each entity (row) in the database.

Natural Keys

The concept of a unique identifier is familiar in the real world — you use account numbers to identify credit cards, addresses to identify buildings or houses, etc. These are examples of natural keys, real-world identifiers that are used to uniquely identify real-world objects.

In general, if the data you are modeling has a decent natural identifier you should use it as the primary key. That said, not all natural keys make good primary keys. The goal of the primary key is to uniquely identify an entity in your database. It does not have to describe the entity. The fact that a particular identifier can be used to describe an object in the real world doesn’t mean it’s a good primary key.

There are a number of desirable (not necessarily required) primary key characteristics that natural identifiers sometimes lack:

Unique values: The primary key must uniquely identify each row in a table.
Non-intelligent: The primary key should not have embedded semantic meaning. In other words, it should not describe characteristics of the entity. A customer ID of 398237 is typically preferred over Michael J. Malone.
No change over time: The value of a primary key should never change. Changing a primary key value means you’re changing the identity of an entity, which doesn’t make sense. Non-intelligent keys are preferred because they are less likely to change.
Single-attribute: A primary key should have the minimum number of attributes possible. Single-attribute primary keys are desirable, because they’re easier for applications to work with, and they simplify the implementation of foreign keys.
Numeric: It’s easier to manage unique values when they are numeric. Most database systems have internal routines that facilitate auto-incrementing primary key attributes. While these facilities are useful, be careful not to use them as a crutch.

For each of these rules, there are exceptions. For example, composite primary keys are particularly useful as identifiers in join tables, modeling a many-to-many relationship. And an otherwise suitable single-value natural key should not be disregarded simply because it’s not numeric.

Surrogate Keys

When a natural key doesn’t exist, or when the natural key isn’t up to snuff, it’s time to consider using a surrogate key (also called a synthetic key) to uniquely identify entities. A surrogate primary key is typically a numeric, single-attribute key and is often auto-generated by the database system. While some DBAs continue to debate their use, surrogate primary keys are pretty much accepted practice these days.

A surrogate key is meaningless by itself. Thus, it has no embedded semantic meaning. The sole purpose of a surrogate key is to uniquely identify entities, and to facilitate relational operations like joins and filters. It’s a single, unique value that never has to change (because its only job is to identify the entity). Thus, it’s an ideal primary key.

Because surrogate keys always consist of a single attribute, they can simplify business logic. If the column name for the primary key of a table can be derived from the table name, for example, a code generator can be used to built a primitive database abstraction layer. These keys, in combination with the table name, act as a globally unique identifier at runtime. This identifier can be used to build sophisticated caching mechanisms, facilitate lazy-loading, and simplify serialization.

Conclusion

As the foundation of many information systems, database design should be carefully planned and properly implemented. Choosing the proper primary key is a critical task in modeling relational data. If possible, entities should have a unique identifier that has meaning rather than some obscure sequential integer. But the natural identifier need not be the primary key — it’s perfectly acceptable to use a synthetic, or surrogate key as a tables primary key. That said, don’t use auto-generated primary keys in order to avoid identifying and properly handling the natural keys present in your data.

19 Responses to “Database Design: Choosing a Primary Key”

Mike Wilson Says:
August 15th, 2007 at 11:05 am
I would say from my experience that you PK should *always* be surrogate. Long integer based for local tables or GUID if the table will be replicated. No need for further discussion, if you follow these simple rules for all of your tables (no matter how small) you will never have a problem related to PK’s and you can focus on other issues.
mike Says:
August 15th, 2007 at 11:45 am
I completely agree. But I still think it’s important to identify the natural keys in your data and place unique key constraints on them. Otherwise the database engine can’t maintain the integrity of your data.

Basically, my point is that you can identify and constrain the natural keys appropriately, then add a surrogate key to simplify the database design / application logic.
FlySwat Says:
August 15th, 2007 at 11:45 am
I prefer to use GUID/UUID on my tables as the primary keys, over an autoincrementing integer, but that is just my opinion.

Surrogate keys make life simple.
webdude Says:
August 15th, 2007 at 12:56 pm
what software do you use to design your database?
Mark Says:
August 15th, 2007 at 1:27 pm
>what software do you use to design your database?
My brain.
mike Says:
August 15th, 2007 at 1:40 pm
Yea, for simple databases I don’t typically use any software (I might draw an ERD by hand). For a larger project I sometimes use software to create UML diagrams. Honestly, what keeps me from doing this more often is the lack of good open source UML software. Poseidon for UML by Gentleware is ok, but it’s not great. If you have a budget, Rational Rose is pretty good. And if you’re just looking for a simple tool for entity-relationship diagrams you can use Visio (not my favorite software, but it’s an option).
planetmcd Says:
August 15th, 2007 at 2:21 pm
I often use surrogate keys for my tables, solely because they simplify joins. If a tables natural key is >3 fields, it can be a huge pain PITA to add all the fields to every query.

I do see a bit of a problem, particularly in the less experienced members of the Rails community, of not understanding that a surrogate key is just that, a surrogate for a natural key.

There’s nothing wrong with a surrogate key so long as you understand what it represents.
mhanney Says:
August 15th, 2007 at 3:06 pm
I want to raise the ORM consideration - since moving to an Object Relational Mapper (HNibernate) I never use natural keys as PKs even though they clearly exist in our models…

If your application is object oriented and you are using an ORM technology in your data access layer then you would almost certainly be using surrogate keys in the way you describe. All persistent objects would have the same type of primary key identifier, usually a long int or GUID/UUID, and be derived from a PersistentObject base class.

The PK for each object could be generated by a single sequence generator to ensure PKs are unique system wide - very convenient if you have to serialize or cache an object by a unique key in a global pool of objects of mixed types. Better still use GUIDs to ensure PKs are unique throughout the Universe.

I also have natural keys with unique constraints as described because it is not always desirable to fetch by the primary key. For example, a product entity may have a unique UPC bar code or SKU that is always used in our fetch plan.
links for 2007-08-16 « D e j a m e S e r Says:
August 16th, 2007 at 10:21 am
[...] Database Design: Choosing a Primary Key - I’m Mike (tags: dba) [...]
Ron Says:
August 16th, 2007 at 2:06 pm
I have run quite a few big database design efforts over the last 20 years, and there are always the same problems, and very few people that understand the issues and stable solution paths/patterns.

One thing I would throw in is that here recently, (last 5 years) we have been moving from a traditional 2nd/3rd normal relational design centered on noun/verb data modeling, to early render strategies for the web. And this most often significantly changes the services / data modeling required in the DB design. Now, with the early render model, we do not even keep a fully relational model of the data space at all. Although it would not hold up to rigorous definition, I think we have started to use the DB with more of a second tier object service store.
thanks
phloid domino Says:
August 19th, 2007 at 1:39 am
natural keys are correct
surrogate keys are implementation

these days many practitioner get this confuse
they think that what everybody does is right thing
but
they are wrong

should always model with natural key,
even if multi column

as article says, primary key is not part
of the relational theoritical foundation

it is optimisation at implementation time

thus
logical model => natural key
physical model => surrogate key ok
vikesh Says:
August 21st, 2007 at 11:59 pm
wich one have better performance unique key + not null

or the primary key and why
MarownIOM Says:
February 15th, 2008 at 6:42 pm
I agree with Ron. For a web application with a limited number of tables a surrogate key may seem easier and faster. Who wants to type out all those join conditions anyway?

I consider Surrogate keys as part of the denormalization process which comes after the database has been designed. A Surrogate can speed up an index where needed; but, will make it more difficult for a human to relate to. There should be a compelling reason for denormalizing data. If you resort to Surrogates on each table then you have dispensed with 1NF (in my opinion). This is the crux of First Normal Form (1NF): each attribute consists of one piece of meaningful data which has its own unique purpose. An arbitrary primary key is a redundant (and meaningless) piece of data. Relying upon Alternate Keys to handle your data uniqueness increases the number of indexes your database must maintain thus slowing down inserts and causing more fragmentation when you delete.

If you forget to add an AK then redundant records will (according to Murphy) pile up and the problem may not be found for a long time. When the problem is found which record holds the correct data? This could become a support headache.
mike Says:
February 15th, 2008 at 8:40 pm
@MarownIOM,

Interesting argument. I’d disagree w/ you re: surrogate keys making the database harder to understand, but that’s an opinion. You don’t consider code simplification and performance gains a compelling enough reason to use surrogate keys?

Re: first normal form, I definitely disagree. First normal form is a fairly rigorously defined concept, so opinion shouldn’t have anything to do with it… “you should depend on the key, the whole key, and nothing but the key.” That’s first normal form. And surrogate keys don’t violate that condition.

Second normal form is related to full functional dependency of non-prime attributes, which isn’t an issue if you don’t have any composite primary keys. So if you’re in 1NF and you’re using surrogate keys, you’re in 2NF too. Third normal form has to do with transitive dependency on the primary key, and again is not related to surrogate keys.

I’ll skip the rest, you get the point.
Azeem Says:
February 25th, 2008 at 4:10 am
I would just like to ask which do you think is the better one of the following two;
Using surrogate keys or natural keys?
I an’t decide which one is more better for a more advanced user of Microsoft Access.
Thankyou, and how you can help me decide.
MarownIOM Says:
February 26th, 2008 at 11:02 am
The definition of 1NF is debated in colleges as well as in the work place (and on Wikipedia before someone deletes it). Database design almost becomes a branch of theology or at least ideology with frequent burning at the stakes (or complaining to the boss.)

Here’s an example from a system I currently support. I’ve picked up support for this application after 3 different people have worked on it over the last year. Only one person seemed to understand how to implement surrogates in a design.

The naming convention was to have one field in the PK named Id for every table and there are 75 tables. FK field names change from “Id” to “TablenameId”. One table ended up looking like this:

Employee
Id (PK) Identity
EmployeeId (AK)
SupervisorId (FK to Id)
LastName
FirstName
…etc…

There is an unfortunate problem now with naming conventions. The previous developers would occasionally reference the wrong field in queries “EmployeeID” vs “Employee.Id”. Why do that when you already have a unique EmployeeId field?

—

Arguement 1: Some argue that part of the description of 1NF is that each attribute is not only unique but uniquely named throughout the database. By this definition the use of “Id” by itself breaks 1NF. Proponents of this concept would say the “Id” field should be qualified either by including the table name or an abbreviation for the table on each field. So the “Employee” table “Id” field could have any of these names: “EmployeeId”, “empl_Id”, “empl_nId” (which also may include a limited or full Hungarian notation which is so old school.)

Argument 2: each attribute consists of one piece of meaningful data. When a surrogate replaces a natural key for uniqueness the surrogate is not meaningful. Saying a surrogate is the PK does not endow it with meaning. The rest of the attributes in the table do not really depend upon the surrogate (PK) but the natural key (AK) (which now breaks 2NF).

Argument 3: each attribute consists of one piece of meaningful data which has its own unique purpose. The intent of “unique purpose” is within the domain of reality you are modeling not solely used to relate tables. “The key, the whole key, and nothing but some arbitrary value with the sole purpose of making joins easier”

—

My primary concern for any database is the validity of the data. There must be a balance between speed and validity but I tend to err on the side of validity. The more fields in your table the more code you must write to maintain it. More code — more bugs.

Simplicity is the essence of successful databases.
“Things should be made as simple as possible – but no simpler.”
-Dr. Einstein
mike Says:
February 26th, 2008 at 12:40 pm
@Azeem, in my opinion you should have a unique natural key for each column if at all possible. A surrogate key can be added to simplify queries, improve performance, etc. Again, this is my opinion.

@MarownIOM, I’m not going to disagree with you :). I think you’ve made a lot of good points. Hopefully anyone who reads through this post will make it this far in the comments.

You clearly know enough about DB design to make reasonable decisions. But, as you said, reasonable minds do differ on some of these points. The important thing is to understand the trade offs of whatever architectural decision you make. In my opinion, for the type of applications I typically work with (scalable, db intensive, web apps for the most part) surrogate keys simplify the code base and improve db performance.
Mikey Says:
April 21st, 2008 at 6:31 pm
So much of what I’ve seen in this debate hinges on composite PKs being replaced by single column numeric surrogate keys. I don’t know if our shop is different, but we have debated this ourselves for some time; the difference is that, while we certainly subscribe to the idea that a surrogate key is better than a 5-column composite key, what about those tables that contain a few rows of business-driven codes, where the code is a single column and a key candidate itself, yet some still insist on taking on a surrogate key onto it. Some have even claimed that this results in faster queries.

Example: There might be a Language table, and a Country table. Each one has an arbitrary surrogate key, and a meaningful column which serves as an alternate key. Those tables are referenced in several other places in the database.

Let’s say, on one of those tables, you have 1M rows. You want to get back all of those rows that have a language code of “ENG” and a country code of “US”. But because of these arbitrary keys, the table instead has meaningless numeric identifiers, and you have to add joins to both tables to correlate the numeric value to the actual meaningful code you’re looking for.

So, instead of:

select * from table_1 where language_cde = ‘ENG’ and country_cde = ‘US’

You have:

select * from table_1 t1, language lang, country ctry
where t1.language_id = lang.language_id
and t1.country_id = ctry.country_id
and lang.language_cde = ‘ENG’
and ctry.country_cde = ‘US’

I keep hearing the argument that surrogate keys reduce the coding involved. And, in the case of large composite keys, they absolutely do. But what about in cases like this? Here, obviously surrogate keys greatly increases the coding necessary to retrieve the data.

I also don’t see how this can’t be a performance hit. It depends obviously on how the particular DBMS interprets the join and optimizes the query plan, but it’s extra work nonetheless.

Even fully optimized… the actual FK value is looked up once and stored… what if you’re not searching on, but retreiving, several pieces of information from the master table, and for each row you have to perform a lookup on another table to determine what that numeric value means so I can display the code?

A simple test on a 50,000 row table, shows five times the I/O cost when pulling a business code from a lookup table via 1 simple join, as opposed to pulling the code directly off the parent table itself.

What’s the consensus here? Obviously there’s something to be said for consistency - the theory that if you implement surrogate keys, you should apply them consistently to every table in your model. There’s also the argument that this kind of ‘codec’ data shouldn’t even be in the database, but confined to the applications. Then one side argues that doing this eliminates the referential integrity you gain by having FKs, but then the other side argues that you have to be able to trust your application to supply valid values, implement a check constraint if you’re really concerned about invalid values, etc etc.

I get the feeling that you’re going to get a different answer to this from every DBA you ask, agreement on *composite* keys aside.

Thanks in advance for comments.
Mark Johansen Says:
August 20th, 2008 at 9:46 pm
In my humble opinion, advocates of surrogate keys get excessively dogmatic. I absolutely agree that there are many cases where a surrogate key is the best answer, but I don’t believe it’s a simple case of surrogate=good, natural=bad. Indeed, in my book — shameless plug coming — “A Sane Approach to Database Design”, available at “www.electrictactics.com”, I devote a sizable chunk of the chapter on keys to discussing the pros and cons of natural vs surrogate.

Yes, if your best natural key requires five fields, it’s going to be way awkward, and you should use a surrogate key. But there are lots of reasons to use a natural key. (a) Mikey makes one of the same points I make in my book: Use of a natural key can eliminate joins, thus improving performance. Do we really need to have a state_id and look it up in a state table to find the two-letter state abbreviation? What is gained by this? (b) We usually need to include natural keys in the system anyway, because users need them. Like, just because we create a state_id doesn’t mean that we don’t need to store the state postal abbreviation and name. So the surrogate key is one more field to lug around, increasing the size and complexity of our database. (c) If we display the surrogate key for the user when there is an obvious natural key, it is confusing. Why do we have both a part number and a part id? If we don’t show it, then curious anomolies become possible, like we can have the same part number for two different parts, or data corruption can result in the part number disappearing while the data about the part is still there.

I generally use surrogate keys in only two cases:

1. There is no good natural key. Some programmers try to make a natural key by cramming together several fields that they think probably should mostly sort of make a unique identifier. Like, I once saw a “natural key” of customer last name plus birth date plus city of residence. I can just imagine the programmer saying to himself, What are the chances that two people would have those three things all the same? But surely any such thrown-together combination could be non-unique by coincidence. In this case, even by non-coincidence: what about twin brothers who both still live in the city where they were born?

2. The natural key requires more than two or three fields. At that point it’s just getting too awkward.

I’m Mike