What should I choose for my primary key?
We all know there are very few scenarios where a table in your database should not have a primary key. Such a key is an essential tool for normalization of your data, and ensuring that entities are not erroneously duplicated.
It is clear from my observations that far too many people throw a database together and *then* realize that they should have had a primary key to begin with. A quick google on the subject demonstrates why you should choose your primary key(s) at the outset, rather than in retrospect; without a primary key or other unique constraint, you will certainly be going back at some point to clean up your data.
Natural vs. Surrogate
There are two fundamental types of primary keys: natural and surrogate / artificial. A natural key is a piece of data (one or more columns of an entity) that uniquely identify that entity, and distinguish it from any other row in the table. A natural key is controlled from outside the environment, so an example might be the VIN number for your car, when your vehicle is first added to your dealership's maintenance tracking system. Some will argue that a VIN number is really a surrogate, because it is arbitrarily assigned by the vehicle manufacturer. However, it is one piece of data associated with a car that is unique (unlike make, model and year, since your car is very unlikely to be one-of-a-kind), and will never change (unlike its owner's SSN, or its owner's name, or its license plate, or its registered address, or even its color).
A surrogate key is one which is not a part of the entity; usually assigned by the database or the application that first entered the row into the database, this is most often an IDENTITY or GUID() / UNIQUEIDENTIFIER column. Surrogate keys are useful because they are arbitrary. (One of the difficulties in choosing a natural key is that just about any natural key you can think of has the potential to change, and since the value of a surrogate key usually has no meaning to users, there is unlikely to be any demand for a change in the values represented.)
Ideally, a primary key is implemented using a natural key; however, this is rarely practical. Natural keys tend to be rather wide; for example, in a small company you might consider using Social Security Number (SSN) as a primary key. This data is usually stored as a CHAR(9) or CHAR(11), and therefore it is very inefficient to index, difficult to use in joins, increases storage space, and cumbersome to replicate across multiple tables as foreign keys. In addition, even though CASCADE can propogate primary key changes throughout your database, it is usually not wise to repeat data (e.g. use as a primary key in one table, and a foreign key in others) that is likely to change. Usually, it is better to store this data once, then updates only have to affect one place. A rule I have learned: if the data is meaningful, or even visible, to end users, or if it is in any way outside of the control of your database, then it will change at some point. So, from a stability standpoint, natural keys very rarely make a good choice.
Never mind that SSN is a poor choice for a primary key anyway. One problem is that, in many cases, people are not required to reveal their SSN — so they could opt out of that field, and leave your primary key NULL. For employment this is probably a non-issue, but you can imagine other systems where people are logged in a database but SSN is not a legal requirement. SSNs can be re-issued after death, so unless your system is constantly cleaning out data for people who are deceased, the chance of duplicates is there. And finally, many of your customers or users might not have an SSN, because they are from another country that does not have an equivalent; or, worse yet, they are from another country that *does* have an equivalent system, and their number is already represented in your system.
Still, SSN seems to be about the most reasonable choice for uniquely identifying an employee within a company, since names are not unique and can change, birthdates are not unique, and even the combination of name and birthdate is not necessarily unique (though this is unlikely in a small company). The next step would be to use name + birthdate + street address, and such a multiple-column primary key would be ridiculously large (and is still not impossible to logically duplicate; imagine the case of George Foreman's sons all working for the same company). You will likely want to avoid multiple-column primary keys if possible, because they make foreign key constraints unnecessarily complex, and as an added bonus, will not work in ASP.NET datagrids (which require a single-column primary key). And would you like to be responsible for designing and maintaining the child tables, where you must set up foreign key constraints and perhaps triggers and cascades for data integrity between the tables? Or how about programming the queries that JOIN the parent table to the child table based on a 12-column key relationship?
Enter the surrogate key. When an employee is added to the system, the system can automatically assign an EmployeeID value (either through an IDENTITY column or a GUID()). Now, there should still be some element of the above choices implemented as a unique constraint (e.g. forcing SSN to be unique), because the surrogate key alone should not be what makes a row unique. A surrogate key implemented as an IDENTITY or GUID() will make joins more efficient, will make the CASCADE issue irrelevant (since users neither know nor care about the IDENTITY value associated with Bob's Bait and Tackle, so long as the company name is forced to remain unique), and will keep the primary and foreign keys restricted to one column.
IDENTITY vs. GUID()
There have been plenty of pissing contests between proponents of each of these surrogate key choices. Like natural vs. artificial, don't let anybody tell you which is "right" — make an informed decision.
Personally, I like IDENTITY. It is a small column, is useful for data warehousing (unlike natural keys), guarantees uniqueness (only if you're careful!), and is completely automated by the system. The primary argument against IDENTITY is that it is proprietary and non-portable; however, we must make decisions sometimes where we weigh current performance and manageability against the likelihood that you will be porting to a different platform (this usually has more to do with the amount of hardware and licensing fees already invested, than anything else). And even still, most products have an equivalent, such as SEQUENCE/NEXTVAL in Oracle, and AUTOINCREMENT in Access.
Of course, these issues might not be paramount in your environment. So, here is a list of pros and cons of IDENTITY and GUID(), so you can weigh the issues effectively.
In the end, nobody can really tell you whether you should use an IDENTITY or a GUID(), or whether you should use a natural or artificial key; they can guide you, certainly, but nobody knows your data and intent better than you. Hopefully the examples and narrative above will help you make a more informed choice; on the other hand, you could always read Celko's advice.
Related ArticlesHow do I build a query with optional parameters?
How do I calculate the median in a table?
How do I create a store locator feature?
How do I deal with MEMO, TEXT, HYPERLINK, and CURRENCY columns?
How do I deal with multiple resultsets from a stored procedure?
How do I debug my SQL statements?
How do I determine if a column exists in a given table?
How do I enable or disable connection pooling?
How do I enumerate through the DSNs on a machine?
How do I find a stored procedure containing <text>?
How do I get a list of Access tables and their row counts?
How do I get the latest version of the JET OLEDB drivers?
How do I handle alphabetic paging?
How do I handle BIT / BOOLEAN columns?
How do I handle error checking in a stored procedure?
How do I ignore common words in a search?
How do I page through a recordset?
How do I present one-to-many relationships in my ASP page?
How do I prevent duplicates in a table?
How do I prevent my ASP pages from waiting for backend activity?
How do I prevent NULLs in my database from mucking up my HTML?
How do I protect my Access database (MDB file)?
How do I protect my stored procedure code?
How do I protect myself against the W32.Slammer worm?
How do I remove duplicates from a table?
How do I rename a column?
How do I retrieve a random record?
How do I return row numbers with my query?
How do I send a database query to a text file?
How do I simulate an array inside a stored procedure?
How do I solve 'Could not find installable ISAM' errors?
How do I solve 'Operation must use an updateable query' errors?
How do I temporarily disable a trigger?
How do I use a SELECT list alias in the WHERE or GROUP BY clause?
How do I use a variable in an ORDER BY clause?
Should I index my database table(s), and if so, how?
Should I store images in the database or the filesystem?
Should I use a #temp table or a @table variable?
Should I use a view, a stored procedure, or a user-defined function?
Should I use recordset iteration, or GetRows(), or GetString()?
What are all these dt_ stored procedures, and can I remove them?
What are the limitations of MS Access?
What are the limitations of MSDE?
What are the valid styles for converting datetime to string?
What datatype should I use for my character-based database columns?
What datatype should I use for numeric columns?
What does "ambiguous column name" mean?
What is this 'Multiple-step OLE DB' error?
What is wrong with 'SELECT *'?
What naming convention should I use in my database?
What should my connection string look like?
When should I use CreateObject to create my recordset objects?
Where can I get this 'Books Online' documentation?
Where do I get MSDE?
Which database platform should I use for my ASP application?
Which tool should I use: Enterprise Manager or Query Analyzer?
Why are there gaps in my IDENTITY / AUTOINCREMENT column?
Why can I not 'open a database created with a previous version...'?
Why can't I access a database or text file on another server?
Why can't I use the TOP keyword?
Why do I get 'Argument data type text is invalid for argument [...]'?
Why do I get 'Not enough space on temporary disk' errors?
Why does ASP give me ActiveX errors when connecting to a database?
Should I use COALESCE() or ISNULL()?
Where can I get basic info about using stored procedures?