Databases: Normalization or Denormalization. Which is the better technique?

This has really been a long debate as to which approach is more performance orientated, normalized databases or denormalized databases. So this article is a step on my part to figure out the right strategy, because neither one of these approaches can be rejected outright. I will start of by discussing the pros and cons of both the approaches.

Pros and Cons of a Normalized database design.

Normalized databases fair very well under conditions where the applications are write-intensive and the write-load is more than the read-load. This is because of the following reasons:

  • Normalized tables are usually smaller and have a smaller foot-print because the data is divided vertically among many tables. This allows them to perform better as they are small enough to get fit into the buffer.
  • The updates are very fast because the data to be updated is located at a single place and there are no duplicates.
  • Similarly the inserts are very fast because the data has to be inserted at a single place and does not have to be duplicated.
  • The selects are fast in cases where data has to be fetched from a single table, because normally normalized tables are small enough to get fit into the buffer.
  • Because the data is not duplicated so there is less need for heavy duty group by or distinct queries.

Although there seems to be much in favor of normalized tables, with all the pros outlined above, but the main cause of concern with fully normalized tables is that normalized data means joins between tables. And this joining means that read operations have to suffer because indexing strategies do not go well with table joins.

Now lets have a look at the pros and cons of a denormalized database design.

Pros and cons of denormalized database design.

Denormalized databases fair well under heavy read-load and when the application is read intensive. This is because of the following reasons:

  • The data is present in the same table so there is no need for any joins, hence the selects are very fast.
  • A single table with all the required data allows much more efficient index usage. If the columns are indexed properly, then results can be filtered and sorted by utilizing the same index. While in the case of a normalized table, since the data would be spread out in different tables, this would not be possible.

Although for reasons mentioned above selects can be very fast on denormalized tables, but because the data is duplicated, the updates and inserts become complex and costly.

Having said that neither one of the approach can be entirely neglected, because a real world application is going to have both read-loads and write-loads. Hence the correct way would be to utilize both the normalized and denormalized approaches depending on situations.

Using normalized and denormalized approaches together.

The most common way of mixing denormalized and normalized approaches is to duplicate related columns from one table into another table. Let me show you by example:

Suppose you have a products table and an orders table.
The normalized approach would be to only have the product_id in the orders table and all the other product related information in the products table.

But that would make the query that filters by product_name and sorts by order_date inefficient because both are stored in different tables.

In a fully normalized schema, such a query would be performed in the following manner:

SELECT product_name, order_date
FROM orders INNER JOIN products USING(product_id)
WHERE product_name like 'A%'
ORDER by order_date DESC 

As you can see MySQL here will have to scan the order_date index on the orders table and then compare the corresponding product_name in the products table to see if the name starts with A.

The above query can be drastically improved by denormalizing the schema a little bit, so that the orders table now includes the product_name column as well.

SELECT product_name, order_date
FROM orders
WHERE product_name like 'A%'
ORDER by order_date DESC 

See how the query has become much simpler, there is no join now and a single index on columns product_name, order_date can be used to do the filtering as well as the sorting.

So can both the techniques be used together? Yes they can be, because real word applications have a mix of read and write loads.

Final words.

Although, denormalized schema can greatly improve performance under extreme read-loads but the updates and inserts become complex as the data is duplicate and hence has to be updated/inserted in more than one places.

One clean way to go about solving this problem is through the use of triggers. For example in our case where the orders table has the product_name column as well, when the value of product_name has to be updated, then it can simply be done in the following way:

  • Have a trigger setup on the products table that updates the product_name on any update to the products table.
  • Execute the update query on the products table. The data would automatically be updated in the orders table because of the trigger.

However, when denormalizing the schema, do take into consideration, the number of times you would be updating records compared to the number of times you would be executing SELECTs. When mixing normalization and denormalization, focus on denormalizing tables that are read intensive, while tables that are write intensive keep them normalized.

Ovais is a storage architect with keen focus on reliability, efficiency and performance of OLTP databases, more specifically MySQL. He currently works at Uber on ensuring storage reliability and scalability. Previously, he helped different web properties scale their MySQL infrastructure. He also worked at Percona as a Senior MySQL consultant and at a few other startups in various capacities.

  • Pingback: There Is Low Cost Van Insurance Coverage Available – You Simply Must Look! | Atlanta Auto Insurance()

  • Samnan

    Hi Ovais,
    One of the most useful techniques to solve the same issue is to use replication. One copy for read operations while other for write operations not only ensures data backup but also performace for both read/write operations is optimal, if teh replication is configured properly and tables are indexes the right way.

  • Samnan

    Hi Ovais,
    One of the most useful techniques to solve the same issue is to use replication. One copy for read operations while other for write operations not only ensures data backup but also performace for both read/write operations is optimal, if teh replication is configured properly and tables are indexes the right way.

  • Replication can improve performance, but the point that I am trying to make here is that, there should be a mix of normalization and denormalization. Even if selects are made on replicated data, still there are going to be table joins in the example that I have given, and joins don’t really go well with indexes. So its best to mix and match.

  • Replication can improve performance, but the point that I am trying to make here is that, there should be a mix of normalization and denormalization. Even if selects are made on replicated data, still there are going to be table joins in the example that I have given, and joins don’t really go well with indexes. So its best to mix and match.

  • Greg

    It alls sounds good in theory, but how to actually implement these approaches behind some ORM library, like Hibernate or Active Record?

  • Greg

    It alls sounds good in theory, but how to actually implement these approaches behind some ORM library, like Hibernate or Active Record?

  • Pingback: Tweets that mention Databases: Normalization or Denormalization. Which is the better technique? — tech@ovais.tariq -- Topsy.com()

  • If we take a look at the definition of Active Record according to Martin Fowler then its as follows:
    “An object that wraps a row in a database table or view, encapsulates the database access, and adds domain logic on that data.”

    So similarly there can be an active record class that maps directly to the denormalized table.

    According to the example of orders table I have given in the post, an active record object would map directly to the row of the order’s table with one exception that the duplicate column “product_name” would only be read only and shouldn’t be updated from the active record class of the orders table.

    Then there would be another active record object that would map directly to a row of the products table.

    The duplicate column updates would be handled by a trigger on the products table as mentioned in my post.

  • If we take a look at the definition of Active Record according to Martin Fowler then its as follows:
    “An object that wraps a row in a database table or view, encapsulates the database access, and adds domain logic on that data.”

    So similarly there can be an active record class that maps directly to the denormalized table.

    According to the example of orders table I have given in the post, an active record object would map directly to the row of the order’s table with one exception that the duplicate column “product_name” would only be read only and shouldn’t be updated from the active record class of the orders table.

    Then there would be another active record object that would map directly to a row of the products table.

    The duplicate column updates would be handled by a trigger on the products table as mentioned in my post.

  • Pingback: Tweets that mention Databases: Normalization or Denormalization. Which is the better technique? | ovais.tariq -- Topsy.com()

  • Great site. A lot of useful information here. I’m sending it to some friends!

  • Great site. A lot of useful information here. I’m sending it to some friends!

  • M. Sohaib Shaikh

    As we know that there are several normal forms … so normalized the database up to a level which fits both for read and write operations … replication of data is not the best solution for heavy database ….

  • M. Sohaib shaikh

    Another thing i would like to point out here for the new comers and the readers that the coder/developer should not create/make complex queries … this will increase the load on the database server which reduces the database server performance …. so try to do the work through front end programming language (if using) …. i.e try to bring the load on the client side

  • Pingback: Advanced Article()

  • thanks for u r informaition

  • thanks for u r informaition

  • great article man, you indeed covered topic with details and sample query made my understanding easy and clear. agree nothing is absolutely gem we need to strike a balance between two.

    Javin

    10 basic mysql commands to remember

  • @openid-55717:disqus thanks mate,.

  • Guest

    Spelling correction: Note that “…databases fare well…” (not “fair well”).

  • Pingback: Databases: Normalization or Denormalization. Which is the better technique? | Fredonfire()

  • Carol Southern

    Excellent guidelines.  I always wondered about that: which to use.  I never thought of using both, let alone how/when to do it.

  • A lot of companies out there will provide you with the so called “best
    deal” upfront, but if you ever have to make a claim the prices will
    shoot through the roof.

  • James Attard

    In practice I use normalized tables in OLTP environments (complex joins should not be a concern for performance if you choose the correct indexes). I use non-normalized tables in datawarehouse environments. (www.jamesattard.com)

  • Guest

    Great post!

  • Oakley vault

    The eyewear brand companies Oakley vault sunglasses mainly for women with a couple of designs in the actual mens category. Known because of its colorful and unique sets, Oakley vault are easily recognizable using the embossed logo on the actual temples. The exquisite design is from the frames range from Oakley sunglasses outlet to classic aviator tones, with pastel and earthy colored lenses. The variety of colors the brand offers could be compared with the mood from the Cheap Oakley vault.

  • Pingback: Normalized database | Ztinfo()

  • Kelvin Jones

    A fantastic article. I’m left with no questions about how it all fits together and which route suits my needs. Many thanks.

  • Pingback: Denormalizing graph-shaped data | Linked Data()

  • Steve RD

    I found this article helpful, thank you. Therefore I made a reference in my blog to this article.

    http://soa-java.blogspot.nl/2013/01/database-guidelines-rdbmssql.html

  • Very interesting info!Perfect just what I was searching for!

  • Y8

    Well-formatted and written content like this affected ME.

  • I selected to marker this web site therefore I will return and skim it once more.

  • I’ve browse several alternative articles on this subject solely to be
    frustrated within the content. you must be proud that you are
    influencing the opinions of others.

  • You have created some nice points here that ar stimulating and intriguing.

  • Thanks article. I think i neet it

  • A fantastic article. I’m left with no questions about how it all fits together and which route suits my needs. Many thanks.

  • Great site. A lot of useful information here. Thanks!!

  • Thanks for giving me the useful information. I think I need it. Thank you

    http://www.kizifriv.co.uk

  • Pingback: The denormalizing sweet spot | snarfed.org()

  • HoangTung Ls

    Helpful information. Lucky me I found your website unintentionally,

    and I am shocked why this twist of fate didn’t took place earlier!

    http://www.kizi-2.co

  • Pingback: Build faster web apps with Denormalization()

  • hung

    It is the game in your favorite hobby? Would you like to play friv jogos cell phone during your trip? Do you like playing online role playing games? Are you after the best and newest Intel video game? This article is for you! All you want to know can be found here

  • baongoc

    When you save the game, do more than save in the same slot. Occasionally choose a new one. The time may come when you want to go a little back in the game and try something different. This option is closed to you if you just saved the game in a groove until the end.

    http://www.friv300game.com

  • Article interested me and I wanted to learn, to know more about the useful information there. Thank you

  • A lot of useful information here.

    friv/

  • ravi shanker

    Automatic Normalization DATABASE?

  • Pingback: Is a well indexed normalized database same in performance as a denormalized database [on hold] | DL-UAT()

  • thanks mate

  • Articles with many useful information. Thank you very much

  • Articles with useful content, I like it and read it regularly

  • wadesworld

    One facet you fail to point out is the data anomalies which can arise from denormalization. While denormalization sometimes cannot be prevented for performance reasons, it’s important for young developers to realize there are possible significant detrimental affects to their data as a result. They should not optimize prematurely and should only denormalize if they’ve done the testing to prove their performance concern is valid.

  • That’s a good point. As in every solution you choose there is a trade-off and it really depends on the use case. Should you use denormalization in everyday projects, I don’t think so. But then again when you hit big and have lots of queries that need to order data, you are probably better off by maintaining denormalized tables by either having the actual table denormalized or by maintaining materialized views. Sadly MySQL does not have materialized views, so it comes down to having separate tables that are denormalized and that duplicate the data from other tables.

  • ECM ECM

    First thing that came to my mind when reading Ovais topic is the point you raise here. However, the critical question should be how or why does writing complex queries create data load issues. In other words how can we correctly/scientifically measure the rate of loss of efficiency in data load using these queries in normalized and un-normalized databases.

    We should not confuse what is clear and simple versus long and complex queries for human understanding (when writing those queries) versus their efficiency when the computer reads and loads using these queries.

    I myself strongly support very normalized database, one of the reasons being the actual normalization resembles and helps better control the business rules and application logic that can be created for it (that way you less complicated application overall).

  • Duplicating data into denormalized tables is a general strategy used by read-heavy and large-scale web applications because its costly to join tables when you are joining thousands of thousands of rows. If you add sorting into the mix, then that’s another reason to denormalize because MySQL can only choose an index for sorting from the first table in the join. That’s not an issue if you denormalize because then you only fetch data from a single table.

  • Every technology has limitations and strong points, same is the case with MySQL. Joining many tables is a complex operation and is time consuming because you are traversing many trees and sometimes its not feasible for example when you have requirements such as “95% of the queries should finish in 5 milliseconds which is not uncommon for some of the web shops that I work with”. In such cases you have to be creative and employ other techniques. And denormalization is one. Materialized views is a sort of denormalization. Unfortunately MySQL does not have a notion of materialized views, hence you create denormalized tables together with normalized tables so that you can use them for queries that involve querying multiple tables and ordering data.

  • This is really a great blog.

    Best skin lightening cream

  • Zahid hussain

    this is agreat source to learn about DBMS

  • planecrash

    I agree completely. Start with the normal forms until you run into a reason not to.