7 Data Modeling Best Practices
By Charles Mahler / Aug 04, 2023 / Developer, Community
This article was originally published in The New Stack and is reposted here with permission.
While you can often get away without much planning on smaller projects, when building complex software it becomes essential for success. And probably the most important thing you want to get right is the data model for your application.
The best-selling book Designing Data-Intensive Applications gives a solid rationale for why data models are so critical: “Data models have such a profound effect not only on how the software is written but also on how we think about the problem that we are solving.”
This quote gets at the root of why we even make software in the first place: to solve problems. And doing that properly requires communication between engineering and business people who know the users and problems that are being solved. From there software engineers can take that knowledge and transform it into a data model for the software that will help solve the problem.
Why is data modeling important?
Proper data modeling takes time, so a big question might be why it’s important and worth all the effort? Let’s look at some of the direct and indirect benefits of data modeling.
The most obvious benefits of data modeling is that by planning how you are going to organize your data you have the ability to improve performance, reduce errors and reduce the chances of designing things wrong and having to rewrite code. All of these things combined means you should save time and money in the long run.
Some secondary benefits of data modeling are that the end result should make data analysis easier for everybody in your organization, which will allow for better insights that will help your business. Another benefit of knowing your data model upfront is that you can choose the best type of database for your workload, which will make life easier for you in the future because you won’t have to worry about scaling or performance issues.
Data modeling best practices
1. Keep business objectives in mind
The most important thing when data modeling is to remember how it relates to the overall business. As a software engineer it can be easy to forget the bigger picture, so make sure you are working with business analysts and any other relevant stakeholders to make sure the data model reflects reality.
2. Properly document your data model
Having the best data model in the world is pointless if nobody knows how to use it. Having solid documentation in place will allow others to know what data is available in the system and make it easier to utilize that data to generate valuable insights. Proper documentation of your data model will also help with onboarding new team members faster and make maintaining or modifying the data model easier in the future because you have an up-to-date single source of truth for the current state of the data model via your documentation.
3. Design your data model to be adjustable over time
You should make sure to design your data model so that it can be updated and modified with minimal difficulty. The odds of nailing things on the first attempt are slim and even if you do, there are going to be changes in the future that require new types of data to be added over time. So taking this into consideration from the start will save you pain in the future.
4. Have different levels of abstraction for different people
The standard process for data modeling breaks down into 3 types: conceptual, logical and physical data modeling. These tiers all have different levels of abstraction that can be helpful for allowing even non-technical users to understand the data model. It also allows for technology decisions to be isolated from the higher-level conceptual data models, where that complexity isn’t needed. Here’s a bit more detail on the 3 different types of data modeling:
Conceptual - At the conceptual level objects are defined in relation to the business itself and how they interact or any relevant constraints and business rules.
Logical - At the logical level more detail is added to the objects that were determined at the conceptual stage of data modeling. Logical data models usually include what specific data types and attributes each object should have.
Physical - The physical data model defines the actual technical details for storing the data. This will include database specific implementation details. Using a relational database as an example, the physical model would include the SQL to create tables and define data types for each column in the tables representing the objects defined during the conceptual design stage.
5. Choose the right data modeling technique
There are many different data modeling techniques. Which one you choose should be based on what type of data you are working with and the type of workload. The simplest example would be whether you are creating a data model for an end user application or for data analysts who will be doing mostly analytics queries.
For the first situation you might want to go with relational data modeling, a common choice. For the second it might make more sense to use dimensional modeling. There are tradeoffs to every decision, the key is to figure out which data model provides the most benefits with downsides that are mitigated by the nature of your use case.
6. Think about data governance and security
In recent years data privacy and security have become major topics, with lots of regulation being put into place as well. As part of your data modeling process it makes sense to take the time to also consider how data will be protected, who will have access to it and your overall data governance strategy as a whole.
7. Avoid premature optimization
While this advice might seem to go against a lot of what was written above, the idea here is to not overthink things and slow down your development process. Diminishing returns are a thing, what’s important is getting some of the major decisions right and then iterating. Another thing to keep in mind is to not over-engineer your system, keep things as simple as possible and worry about scaling your system when it becomes an issue.
Data modeling tutorial
Now let’s look at how data modeling can be done with a practical example. To make it interesting we will use time series data because that adds another piece of complexity that needs to be taken into account when designing the data model. For this example, we will use a smart home that collects temperature data to automatically turn the A/C on and off to be more energy efficient.
First let’s define the entities and relationships at the conceptual level. The entities would be the following:
- Energy Consumption
The basic relationships between these entities would be the following:
- A Home contains multiple Rooms.
- Each Room contains one or more Sensors.
- A Sensor generates multiple Temperature readings.
- Energy Consumption is associated with each Home.
Now the extra dimension that comes with this being time series data is how to define the relationship between the entities with data that change over time. In this case it would be Temperature and Energy Consumption. Because the room temperature and energy consumption are obviously going to be related, this is another type of relationship that could be defined in the data model.
With time series data you may want to break down your entities into static entities that don’t change (the name of a room) and time-varying entities that do change (temperature). You then need to think about how to model the different types of relationships that can result from this:
Relationships between two static entities (room and home)
Relationships between time-varying entities and static entities (temperature and room)
Relationships between two time-varying entities (temperature and energy consumption)
How your data is related and what data your end users will be querying has a significant impact on how you will index your data at the physical level of data modeling, so taking this into consideration early is critical for a successful project. If you don’t get your indexes right because you don’t understand the data access patterns of your users, performance will suffer and users won’t be happy.
For more details on data modeling check out these articles:
- Data Modeling: Part 1 — Goals and Methodology
- Data Modeling: Part 2 — Method for Time Series Databases
Data modeling is a valuable tool for ensuring you get the most value possible out of your data. Following the best practices covered in this article should set you on the right path for getting started with data modeling on your own projects.
One of the most valuable aspects of data modeling is understanding how your data will be accessed. Once you know that you can make the right choice for the physical model in terms of which database makes sense for storing your data to optimize for cost and performance. For example, If you are working with time series data like application metrics, IoT sensor data, or real time analytics, a specialized time series database like InfluxDB might make sense. If you are working with highly connected data a graph database would be ideal. For situations where you need guarantees on data integrity a relational database with ACID transaction support would be a good choice.
The key takeaway here is that to make educated decisions you need to first put in the work to understand the data generated by your business or application, and data modeling is a crucial part of that process.