Data Modeling 104: De-normalization
This article continues the series on basic data modeling techniques. I will be referencing data modeling often in the future, as it applies to various types of DBMS engines and specific approaches for scaling your database. If it’s not obvious already, I consider data modeling a fundamental for many aspects of making your database perform and scale the way you need it to; therefore consider these articles a reference foundation for many articles yet to come.
In the previous column, I covered database normalization techniques and examples, illustrating how you can improve the integrity and simplicity with this skill. Now that you have spent time learning about (and hopefully applying) this approach, let’s go through the first detailed examples on database de-normalization.
Why do you need to de-normalize your data model?
After all the stress I put on proper normalization of your model, why would you want to de-normalize it?
The answer is simple. Using database normalization you have eliminated redundancy and inappropriate relationships in your model. This is vital to ensure that your model design is correct , easy to understand, and as flexible as possible in meeting the needs of your application.
However, a fully normalized model does not always perform as fast as you would like, and sometimes is inconvenient to use given the number of joins that may be required for common functions. Therefore, as I touched upon in the last article, you de-normalize your model when required for performance and convenience. This will give you an even better data model, one that will perform the way you expect and be as natural as possible for your developers.
In general, database de-normalization means adding some redundant data elements or structures in order to match specific requirements of your application. There is no hard and fast rule you can apply as to when to de-normalize your data model, but I will cover some examples with the Angry Shards game data model to give you a feel for when and how to use this technique.
The Angry Shards Game Database
In the previous article, I extended the _Angry Shards _example database , resulting in Figure 1.
Figure 1: The Angry Shards database as we left it last issue.
The model so far supports the ability to have a player play a game, resulting in a player_game for each instance of a game that the player plays. In addition I added the player_game_round to support one or more rounds for a single game, enriching the game to support multiple round_level values to make the game more challenging and interesting.
Now I’ll extend the model in some new ways, features that can benefit from database de-normalization.
A Game Leaderboard
A Leader Board is an extremely common, yet challenging feature to add to a database in an efficient manner. The purpose of a leaderboard is to provide the top scoring player list so any player can see how they compare to others. Usually the list is limited to a specific number of player entries, but in some more advanced game databases, it is desirable to have a Leader Board allow any player to see where they rank.
For this first example, our leaderboard will be limited to the top 10 player scores in the database.
A typical approach to accomplishing a leaderboard is to utilize aggregation functions on the database. For example, in SQL we can do it with a query that looks like this:
SELECT player_id, MAX(player_score) AS TOP_SCORE FROM player_game GROUP BY player_id ORDER BY TOP_SCORE LIMiT 10
This certainly will do the trick, giving you a list of the Top 10 player_id key values and the related player_score. The output will look something like this:
player_id TOP_SCORE --------- --------- 101 4795 203 3850 557 3297 …
This is somewhat useful, but I doubt that users will know who is who when given just a player_id in the result. Let’s modify the query like this to provide the player screen_name:
SELECT pg.player_id, p.screen_name, MAX(player_score) AS TOP_SCORE FROM player_game pg, player p WHERE pg.player_id = p.player_id GROUP BY player_id, p.screen_name ORDER BY TOP_SCORE LIMiT 10
Now our output will look like this:
player_id screen_name TOP_SCORE --------- ----------- 101 joe 4795 203 mary 3850 557 sam 3297 … player_id screen_name TOP_SCORE --------- ----------- 101 joe 4795 203 mary 3850 557 sam 3297 …
Now this is more useful, we can easily display the screen_name with the results, far more meaningful.
So what is wrong with this approach, and how could de-normalization help?
As the player_game table grows, this query will get slower and slower. In an earlier article, Why is my Database Slowing Down?, I covered the main enemies of database performance. The most important enemy was the table scan. In a table scan the table must be searched from top to bottom – every single row – in order to satisfy the query. This query would certainly cause a table scan, and further it must compare every player_score value and sort the values in order to return the result. This is required to identify the MAX(player_score) value for each player_id. While an index on player_score might help, by itself it won’t lower the cost of the search much, since every row must be compared. These are extremely intensive operations for the DBMS engine to perform, and over time will certainly degrade in performance. Further, a query like this, if frequently run, could degrade the entire application.
There is a simple answer in this case, one that is already allowed for in our data model. We will use the high_score column on the player table, a simple de-normalization to solve the problem. Here is the logic that is required:
- Whenever a player completes a player_game_round, increment the player_score in the player_game table. This can be done with a simple UPDATE statement in a relational database.
- At the same time, increment the high_score on the player table, again with an UPDATE for that purpose.
Listing 1 shows what the SQL statements look like (assuming the player got a score of 1857 in the current player_game_round, and a total player_game score of 3277), but you can use the language of any DBMS engine to do something similar:
-- Insert the player_game_round row INSERT INTO player_game_round (status_code,... round_score,..., player_game_id) VALUES (10,..., 1857,..., 6093) ; -- Update player_game, incrementing the player_score UPDATE player_game SET player_score = player_score + 1857 WHERE player_game_id = 6093 ; -- Update player, incrementing the high_score, only if this -- score is higher than the last high_score recorded UPDATE player SET high_score = 3277 WHERE player_id = 7699 AND high_score < 3277 ; **End**
Using this simple de-normalization, now we can add an index on _highscore, and make a very efficient query to generate the leaderboard result:
SELECT screen_name, high_score FROM player ORDER BY high_score LIMIT 10 ;
This query, as long as the high_score index is in place, will avoid the table scan, and can easily retrieve the leaderboard with very little processing work. Further, it avoids the join and aggregate function used in the normalize example above. The main cost of this approach is with the extra UPDATE statements, but since they only are done for one player and player_game at a time, these operations are isolated and should not cause excessive locking on other concurrent player activity. This should show how de-normalization helps with both performance and convenience, making the job easier all the way around.
Other Reasons to De-Normalize
There are other reasons to de-normalize, and involve more advanced techniques that I will cover in future articles. Here are some of the many problems that can be solved with judicious use of database de-normalization:
- Resolving so-called self-joins, where a data table has an implicit parent-child relationship. This is very common, especially in social network type applications (which of course includes game applications, making it a natural follow-on to this series of articles).
- Resolving complex relationships, especially across many table structures or lists.
- How and why to integrate multiple types of DBMS engines into your application. With the plethora of great DBMS engines today, there are often major advantages to utilizing the right combination of tools. At first it may not appear that using multiple DBMS engines is de-normalization, but in fact this solution fits well into this category.
- Modeling structures for NoSQL/NewSQL DBMS engines, where de-normalization is often far more prevalent as a requirement.
- Scaling your database efficiently, using de-normalization to minimize distributed operations. Definitely the motto of database scalability is a _shard-nothing _environment, with a minimum of distributed reads and writes. There are specific de-normalization techniques developed for accomplishing this.
Wrapping it up
As you can see, database de-normalization is an important topic, with wide-reaching consequences both for performance and accessibility (i.e., convenience) of data access. This article covered one aspect of this technique, there are many more to come in future entries to the Scaling Big Data column.
Cory Isaacson is CEO/CTO of CodeFutures Corporation. Cory has authored numerous articles in a variety of publications including SOA Magazine, Database Trends and Applications, and recently authored the book Software Pipelines and SOA. Cory has more than twenty years experience with advanced software architectures, and has worked with many of the world’s brightest innovators in the field of high-performance computing. Cory has spoken at hundreds of public events and seminars, and assisting numerous organizations address the real-world challenges of application performance and scalability. In his prior position as president of Rogue Wave Software, he actively led the company back to a position of profitable growth, culminating in a successful acquisition by a leading private equity firm. Cory can be reached at: email@example.com