On this page
Normalization
Database Normalization
Normalization is the process of organizing a relational database schema to reduce data redundancy and improve data integrity. It was formalized by Edgar F. Codd alongside the relational model itself, and the normal forms (1NF through BCNF and beyond) provide a rigorous framework for designing schemas that are free of update, insertion, and deletion anomalies.
Understanding normalization helps you avoid the most common database design mistakes and build schemas that are consistent, maintainable, and correct.
The Problem with Unnormalized Data
Consider an order management system stored in a single flat table:
| order_id | customer_name | customer_email | product_name | category | unit_price | qty |
|---|---|---|---|---|---|---|
| 1 | Alice | [email protected] | Laptop | Electronics | 1299.99 | 1 |
| 2 | Alice | [email protected] | Mouse | Electronics | 29.99 | 2 |
| 3 | Bob | [email protected] | Laptop | Electronics | 1299.99 | 1 |
This structure has three classes of data anomalies:
Update anomaly: if Alice changes her email address, you must update every row with her data. If you miss even one row, the database is now inconsistent — different rows claim Alice has different email addresses.
Insertion anomaly: you cannot record a product in the catalog until someone orders it, because the only place to store product data is alongside order data.
Deletion anomaly: if you delete order #3 (the only order for Bob), you lose all information about Bob as a customer entirely.
Normalization eliminates these anomalies by ensuring each fact is stored in exactly one place.
First Normal Form (1NF)
A table is in First Normal Form if:
- All values in a column are of the same data type
- Each column contains atomic (indivisible) values — no arrays or multi-valued fields
- Each row is unique (has a primary key)
Violating 1NF
-- NOT 1NF: the phones column contains multiple values
CREATE TABLE contacts_bad (
contact_id INTEGER PRIMARY KEY,
name VARCHAR(200),
phones TEXT -- stores "555-1234, 555-5678" — multiple values in one cell!
);Fixing to 1NF
-- 1NF: separate table for the multi-valued attribute
CREATE TABLE contacts (
contact_id INTEGER PRIMARY KEY,
name VARCHAR(200)
);
CREATE TABLE contact_phones (
contact_id INTEGER REFERENCES contacts(contact_id),
phone_number VARCHAR(20),
phone_type VARCHAR(20), -- 'mobile', 'home', 'work'
PRIMARY KEY (contact_id, phone_number)
);PostgreSQL's JSONB and array types technically allow 1NF violations. These are legitimate choices for semi-structured data, but for structured data with a predictable schema, always normalize.
Second Normal Form (2NF)
A table is in Second Normal Form if:
- It is in 1NF
- Every non-key column is fully functionally dependent on the entire primary key — not just part of it
2NF is only relevant for tables with composite primary keys. If your primary key is a single column, the table is automatically in 2NF if it is in 1NF.
Violating 2NF
-- NOT 2NF: primary key is (order_id, product_id)
-- But product_name and category depend only on product_id (partial dependency)
CREATE TABLE order_items_bad (
order_id INTEGER,
product_id INTEGER,
product_name VARCHAR(200), -- depends only on product_id!
category VARCHAR(100), -- depends only on product_id!
unit_price NUMERIC(10,2),
quantity INTEGER,
PRIMARY KEY (order_id, product_id)
);Fixing to 2NF
Move the partially dependent columns to a separate table:
-- Products table: product_name and category depend on product_id
CREATE TABLE products (
product_id SERIAL PRIMARY KEY,
name VARCHAR(200) NOT NULL,
category VARCHAR(100) NOT NULL
);
-- order_items: only columns that depend on BOTH order_id AND product_id
CREATE TABLE order_items (
order_id INTEGER REFERENCES orders(order_id),
product_id INTEGER REFERENCES products(product_id),
unit_price NUMERIC(10,2) NOT NULL,
quantity INTEGER NOT NULL,
PRIMARY KEY (order_id, product_id)
);Third Normal Form (3NF)
A table is in Third Normal Form if:
- It is in 2NF
- Every non-key column is directly dependent on the primary key — not transitively through another non-key column
Violating 3NF
-- NOT 3NF: zip_code determines city and state (transitive dependency)
CREATE TABLE employees_bad (
employee_id INTEGER PRIMARY KEY,
full_name VARCHAR(200),
zip_code VARCHAR(10),
city VARCHAR(100), -- depends on zip_code, not directly on employee_id!
state VARCHAR(50) -- depends on zip_code, not directly on employee_id!
);Fixing to 3NF
-- Separate the transitive dependency
CREATE TABLE zip_codes (
zip_code VARCHAR(10) PRIMARY KEY,
city VARCHAR(100) NOT NULL,
state VARCHAR(50) NOT NULL
);
CREATE TABLE employees (
employee_id SERIAL PRIMARY KEY,
full_name VARCHAR(200) NOT NULL,
zip_code VARCHAR(10) REFERENCES zip_codes(zip_code)
);Boyce-Codd Normal Form (BCNF)
BCNF is a slightly stronger version of 3NF. A table is in BCNF if, for every non-trivial functional dependency X → Y, X is a superkey (a key of the table).
Most 3NF tables are also in BCNF. Violations only occur when a table has multiple overlapping candidate keys. BCNF is rarely a practical concern for typical application schemas.
Practical Normalization Checklist
Use this checklist when designing or reviewing a schema:
- Does every table have a primary key? Yes → 1NF candidate
- Do any non-key columns depend on only part of a composite key? If yes, move them → fix 2NF
- Do any non-key columns depend on another non-key column? If yes, extract → fix 3NF
- Is the same data stored in multiple tables? If yes, consolidate → reduce redundancy
- Can you update one fact by changing exactly one row? If yes → well-normalized
When to Denormalize
Strict normalization is the right starting point, but there are legitimate reasons to denormalize selectively:
Historical snapshots: order_items.unit_price should be the price at purchase time, not the current product price. Storing it separately is correct denormalization.
Performance: heavily-read reporting tables can benefit from pre-computed aggregates or pre-joined data, avoiding expensive queries. Use materialized views as a first step.
Search optimization: full-text search columns often combine data from multiple columns into a single searchable field.
The rule: normalize first, denormalize only when you have measured a performance problem and confirmed denormalization solves it. Premature denormalization is a significant source of data quality bugs.
A well-normalized schema to 3NF typically results in more tables but far fewer data consistency bugs. Combined with the constraints you will learn in the next lesson, it forms the foundation of a trustworthy database.
-- BEFORE normalization: a flat, denormalized table
-- Problem: repeats customer and product data on every order row
CREATE TABLE orders_flat (
order_id INTEGER,
order_date DATE,
customer_id INTEGER,
customer_name VARCHAR(200), -- duplicated!
customer_email VARCHAR(320), -- duplicated!
customer_city VARCHAR(100), -- duplicated!
product_id INTEGER,
product_name VARCHAR(200), -- duplicated!
product_category VARCHAR(100), -- duplicated!
unit_price NUMERIC(10,2),
quantity INTEGER
);
-- AFTER normalization: 3NF schema
-- Each fact lives in exactly one place
CREATE TABLE customers (
customer_id SERIAL PRIMARY KEY,
full_name VARCHAR(200) NOT NULL,
email VARCHAR(320) NOT NULL UNIQUE,
city VARCHAR(100)
);
CREATE TABLE products (
product_id SERIAL PRIMARY KEY,
name VARCHAR(200) NOT NULL,
category_id INTEGER NOT NULL REFERENCES categories(category_id),
base_price NUMERIC(10,2) NOT NULL
);
CREATE TABLE categories (
category_id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL UNIQUE
);
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
customer_id INTEGER NOT NULL REFERENCES customers(customer_id),
ordered_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE order_items (
order_id INTEGER NOT NULL REFERENCES orders(order_id),
product_id INTEGER NOT NULL REFERENCES products(product_id),
quantity INTEGER NOT NULL CHECK (quantity > 0),
unit_price NUMERIC(10,2) NOT NULL, -- snapshot of price at time of sale
PRIMARY KEY (order_id, product_id)
);
Sign in to track your progress