What is database denormalization?
Database denormalization is a process for designing a database to enable faster data reads. One way to do this is by introducing redundant data where necessary.
To explain the concept of denormalization further, consider a database table that stores a school’s student records. Each student belongs to either no department or to at most one department and a student offers a specific number of courses from a department they belong to. To design a denormalized student database table, you can add a field for department and a field for total_courses. For a student not assigned to a department, the value for the department and total_courses can be empty, or null. For students that are assigned to one department, the correct department and total_courses will be stored in those fields.
The following illustrates the students table:
Notice how the value for the department for Samuel is empty because he has no department. As a result, Samuel has 0 total courses.
A major advantage of this database design is that you can retrieve the department for a student and the total number of courses they offer with a single query.
In contrast with the above design, a table with a normalized design might have a separate departments table with each row representing the department and total courses for each student. Then to retrieve the name of a student and their department you would need to write a more complex query using JOIN. This would lead to slower reads.
Difference between denormalization and normalization
Denormalization is the opposite of normalization. The following table compares these database design processes side-by-side.
|In a SQL database, denormalization involves adding more columns to a table which may lead to data duplication.
|Normalization promotes creating more tables to reduce data redundancy and inconsistency.
|Writing and updating is harder as duplicate data may exist and the failure to update every instance of specific data can lead to inconsistency.
|Write and update operations are easier in normalized database design.
|Denormalization simplifies querying and speeds up read operations.
|A normalized database design often requires multiple JOIN queries to read data, which can lead to a drop in performance.
|Denormalization is a common practice in NoSQL database systems. Most NoSQL databases store data in documents instead of relational tables.
|Most relational database management systems (or SQL databases) promote normalization by default. Data is usually stored in multiple tables that can be joined using a primary key.
Why denormalize data?
A primary reason for denormalizing a database is to increase read speed in a large database. In addition, denormalization decreases the complexity of read queries by reducing the need to join multiple tables.
What data should be denormalized?
Good data to denormalize are values from a table that are required frequently in order to work with another table. You can normalize this type of data by duplicating it as a field in the current table.
Another kind of data to denormalize is the result of a calculation done on data in a large table, such as the total number of items in an inventory. Storing a duplicate of this result as a field in a table speeds up read operations, compared to querying the items in the table and counting all rows for a specific item each time that result is needed.
What data should not be denormalized?
As much as denormalization can improve the performance of read operations, in certain cases denormalization adds unnecessary complexity to a database design.
For example, in a database that stores a list of customers and what groceries they order, you wouldn’t want to denormalize the orders for each customer by adding extra columns in the customers table for each order. Denormalization may not work well here because a customer may order groceries an uncertain number of times and the table can only have a specific number of columns without altering the table every time. The following table shows an attempt to denormalize the customers table:
If Tom places a fifth order, you’d need to alter the entire customers table by adding a new column to store the new order. The value for the new column will also be empty or null for other customers that have no 5th order. In addition to altering the table, you may also have to update your application code to recognize the new column. This is clearly more expensive than having separate tables: one for customers and one for their orders.
Strategies for denormalization
There are many ways to design a denormalized table to improve performance. Here are three strategies.
1. Introduce duplicate or redundant data
This involves adding a field that stores a value already stored in another table to reduce the use of JOIN operations and make reads faster.
Take for example, a social media application that has a users table and a profile_pictures table. Displaying a user’s details and profile picture on a page would require a JOIN operation on both tables. This complexity can be eliminated with denormalization. Adding a profile_picture field to the users table to hold the path to the profile picture file would allow for a simpler query that should perform faster than using a JOIN operation.
2. Store precalculated data in a new field
It is possible to improve the performance of a database by storing the result of a calculation in a field. This eliminates the need to perform the calculation every time the data is queried. For example, in a normalized database that has a table for students and another table for courses, one way to calculate the total number of students taking a course would be to query the students table for students taking the course using a WHERE clause.
But denormalization simplifies the task by storing the total number of students taking a course as a column in the courses table. As a result, a single query can find the total number of students taking a course and also return other details about the course.
3. Create report generation tables
To gain a performance boost in systems that generate reports from large databases, creating extra tables that store summaries of data will enable faster performance than calculating from the actual data each time you need a report.
A good example is a table that stores monthly reports for sales based on data from a sales table that has thousands of records. This table can be normalized by creating another table with the name monthly_sales that stores a summary of sales from each month as a row. So, a row may represent a month like January and the columns can store how much sales were made that month.
1. Why denormalize data?
Data denormalization speeds up read operations and reduces complex JOIN operations in queries.
2. What’s an example of denormalization?
A good example of denormalization is introducing an extra field to a table to duplicate data available in another table.
3. When should data be denormalized?
Denormalization can benefit a database design whenever read speed is more important than data redundancies. In other words, whenever having duplicate data doesn’t really break things and it makes queries more efficient by eliminating an extra JOIN, reducing the need to read data from multiple tables.
To learn about time series databases, check out InfluxDB.