Database Normalisation: When to Follow the Rules and When to Break Them
Database Normalisation: When to Follow the Rules and When to Break Them
Database normalisation is the process of structuring tables to reduce redundancy and improve data integrity. Edgar Codd formalised the normal forms in the 1970s. By 1998 they were standard curriculum for anyone working with relational databases.
At Motorola the NMS database started fully normalised. By the end of the first year we had deliberately denormalised two tables for performance reasons. Understanding why normal forms exist — what problem they solve — is what lets you know when breaking them is the right call.
The Problem Normalisation Solves
Without normalisation, the same data appears in multiple places. When it changes, every copy must be updated. Miss one and the database contains contradictory facts.
Consider a table that stores device poll results with the device name repeated in every row:
ip | name | location | status | polled_at
192.168.1.1 | core-sw-01 | Server Room A | UP | 1998-02-08 14:00
192.168.1.1 | core-sw-01 | Server Room A | UP | 1998-02-08 14:30
192.168.1.1 | core-sw-01 | Server Room A | DOWN | 1998-02-08 15:00
If the device moves from Server Room A to Server Room B, you must update every row for that device. Miss some rows and the location is inconsistent.
First Normal Form (1NF)
Every column holds a single atomic value. No repeating groups.
Violation:
device_id | ip_addresses
1 | 192.168.1.1, 192.168.1.2
1NF:
device_id | ip_address
1 | 192.168.1.1
1 | 192.168.1.2
Storing comma-separated values in a column is a 1NF violation. It makes individual values impossible to query, index, or constrain with foreign keys.
Second Normal Form (2NF)
Every non-key column depends on the whole primary key, not just part of it. Only applies when the primary key is composite.
Violation (key is device_id, location_id):
device_id | location_id | location_name | last_seen
1 | 5 | Server Room A | 1998-02-08
location_name depends only on location_id, not on the composite key. If the location is renamed, every row with that location must be updated.
2NF — split the location into its own table:
CREATE TABLE locations (
id INT PRIMARY KEY,
name VARCHAR(64) NOT NULL
);
CREATE TABLE device_locations (
device_id INT REFERENCES devices(id),
location_id INT REFERENCES locations(id),
last_seen DATE NOT NULL,
PRIMARY KEY (device_id, location_id)
);
Third Normal Form (3NF)
No non-key column depends on another non-key column (no transitive dependencies).
Violation:
device_id | building_id | building_name
1 | 3 | HQ East
building_name depends on building_id, not on device_id. Same fix: separate table for buildings.
The NMS Schema in Practice
Our fully normalised schema had five tables:
CREATE TABLE devices (
id INT PRIMARY KEY AUTO_INCREMENT,
ip VARCHAR(15) UNIQUE NOT NULL,
name VARCHAR(64) NOT NULL,
type_id INT REFERENCES device_types(id)
);
CREATE TABLE device_types (
id INT PRIMARY KEY,
name VARCHAR(32) NOT NULL
);
CREATE TABLE locations (
id INT PRIMARY KEY,
name VARCHAR(64) NOT NULL,
site_id INT REFERENCES sites(id)
);
CREATE TABLE sites (
id INT PRIMARY KEY,
name VARCHAR(64) NOT NULL
);
CREATE TABLE poll_results (
id INT PRIMARY KEY AUTO_INCREMENT,
device_id INT REFERENCES devices(id),
status VARCHAR(10) NOT NULL,
polled_at BIGINT NOT NULL,
INDEX (device_id, polled_at)
);
This was clean and correct. Adding a device type or renaming a location required updating one row.
When We Denormalised
The poll_results table had a problem: displaying the current status of 500 devices on the dashboard required joining poll_results to devices and finding the most recent row per device. On MySQL 3.x this query was slow with 500k rows.
We added a current_status column directly to devices:
ALTER TABLE devices ADD COLUMN current_status VARCHAR(10) NOT NULL DEFAULT 'UNKNOWN';
Now the dashboard query was a single table scan. The trade-off: every poll had to update two places (poll_results insert + devices update), and a failed update could leave them inconsistent.
We accepted this trade-off explicitly: the dashboard was read far more often than it was written, and the inconsistency window was seconds at most. The key is making the decision consciously with the trade-offs understood, not by accident.
The Rule for Breaking Normal Forms
Denormalise when:
- A specific query is too slow and cannot be fixed with an index.
- The denormalised data can be kept consistent by code you control.
- The inconsistency window is acceptable for your use case.
Never denormalise to simplify queries or avoid joins. Joins on indexed foreign keys are fast. Inconsistent data is not.