On this page

Normalization

14 min read TextCh. 4 — Design and Modeling

Database Normalization

Normalization is the process of organizing a relational database schema to reduce data redundancy and improve data integrity. It was formalized by Edgar F. Codd alongside the relational model itself, and the normal forms (1NF through BCNF and beyond) provide a rigorous framework for designing schemas that are free of update, insertion, and deletion anomalies.

Understanding normalization helps you avoid the most common database design mistakes and build schemas that are consistent, maintainable, and correct.

The Problem with Unnormalized Data

Consider an order management system stored in a single flat table:

order_id customer_name customer_email product_name category unit_price qty
1 Alice [email protected] Laptop Electronics 1299.99 1
2 Alice [email protected] Mouse Electronics 29.99 2
3 Bob [email protected] Laptop Electronics 1299.99 1

This structure has three classes of data anomalies:

Update anomaly: if Alice changes her email address, you must update every row with her data. If you miss even one row, the database is now inconsistent — different rows claim Alice has different email addresses.

Insertion anomaly: you cannot record a product in the catalog until someone orders it, because the only place to store product data is alongside order data.

Deletion anomaly: if you delete order #3 (the only order for Bob), you lose all information about Bob as a customer entirely.

Normalization eliminates these anomalies by ensuring each fact is stored in exactly one place.

First Normal Form (1NF)

A table is in First Normal Form if:

  1. All values in a column are of the same data type
  2. Each column contains atomic (indivisible) values — no arrays or multi-valued fields
  3. Each row is unique (has a primary key)

Violating 1NF

-- NOT 1NF: the phones column contains multiple values
CREATE TABLE contacts_bad (
    contact_id INTEGER PRIMARY KEY,
    name       VARCHAR(200),
    phones     TEXT  -- stores "555-1234, 555-5678" — multiple values in one cell!
);

Fixing to 1NF

-- 1NF: separate table for the multi-valued attribute
CREATE TABLE contacts (
    contact_id INTEGER PRIMARY KEY,
    name       VARCHAR(200)
);

CREATE TABLE contact_phones (
    contact_id   INTEGER REFERENCES contacts(contact_id),
    phone_number VARCHAR(20),
    phone_type   VARCHAR(20),  -- 'mobile', 'home', 'work'
    PRIMARY KEY (contact_id, phone_number)
);

PostgreSQL's JSONB and array types technically allow 1NF violations. These are legitimate choices for semi-structured data, but for structured data with a predictable schema, always normalize.

Second Normal Form (2NF)

A table is in Second Normal Form if:

  1. It is in 1NF
  2. Every non-key column is fully functionally dependent on the entire primary key — not just part of it

2NF is only relevant for tables with composite primary keys. If your primary key is a single column, the table is automatically in 2NF if it is in 1NF.

Violating 2NF

-- NOT 2NF: primary key is (order_id, product_id)
-- But product_name and category depend only on product_id (partial dependency)
CREATE TABLE order_items_bad (
    order_id      INTEGER,
    product_id    INTEGER,
    product_name  VARCHAR(200),  -- depends only on product_id!
    category      VARCHAR(100),  -- depends only on product_id!
    unit_price    NUMERIC(10,2),
    quantity      INTEGER,
    PRIMARY KEY (order_id, product_id)
);

Fixing to 2NF

Move the partially dependent columns to a separate table:

-- Products table: product_name and category depend on product_id
CREATE TABLE products (
    product_id  SERIAL PRIMARY KEY,
    name        VARCHAR(200) NOT NULL,
    category    VARCHAR(100) NOT NULL
);

-- order_items: only columns that depend on BOTH order_id AND product_id
CREATE TABLE order_items (
    order_id    INTEGER REFERENCES orders(order_id),
    product_id  INTEGER REFERENCES products(product_id),
    unit_price  NUMERIC(10,2) NOT NULL,
    quantity    INTEGER NOT NULL,
    PRIMARY KEY (order_id, product_id)
);

Third Normal Form (3NF)

A table is in Third Normal Form if:

  1. It is in 2NF
  2. Every non-key column is directly dependent on the primary key — not transitively through another non-key column

Violating 3NF

-- NOT 3NF: zip_code determines city and state (transitive dependency)
CREATE TABLE employees_bad (
    employee_id  INTEGER PRIMARY KEY,
    full_name    VARCHAR(200),
    zip_code     VARCHAR(10),
    city         VARCHAR(100),  -- depends on zip_code, not directly on employee_id!
    state        VARCHAR(50)    -- depends on zip_code, not directly on employee_id!
);

Fixing to 3NF

-- Separate the transitive dependency
CREATE TABLE zip_codes (
    zip_code  VARCHAR(10) PRIMARY KEY,
    city      VARCHAR(100) NOT NULL,
    state     VARCHAR(50) NOT NULL
);

CREATE TABLE employees (
    employee_id  SERIAL PRIMARY KEY,
    full_name    VARCHAR(200) NOT NULL,
    zip_code     VARCHAR(10) REFERENCES zip_codes(zip_code)
);

Boyce-Codd Normal Form (BCNF)

BCNF is a slightly stronger version of 3NF. A table is in BCNF if, for every non-trivial functional dependency X → Y, X is a superkey (a key of the table).

Most 3NF tables are also in BCNF. Violations only occur when a table has multiple overlapping candidate keys. BCNF is rarely a practical concern for typical application schemas.

Practical Normalization Checklist

Use this checklist when designing or reviewing a schema:

  1. Does every table have a primary key? Yes → 1NF candidate
  2. Do any non-key columns depend on only part of a composite key? If yes, move them → fix 2NF
  3. Do any non-key columns depend on another non-key column? If yes, extract → fix 3NF
  4. Is the same data stored in multiple tables? If yes, consolidate → reduce redundancy
  5. Can you update one fact by changing exactly one row? If yes → well-normalized

When to Denormalize

Strict normalization is the right starting point, but there are legitimate reasons to denormalize selectively:

Historical snapshots: order_items.unit_price should be the price at purchase time, not the current product price. Storing it separately is correct denormalization.

Performance: heavily-read reporting tables can benefit from pre-computed aggregates or pre-joined data, avoiding expensive queries. Use materialized views as a first step.

Search optimization: full-text search columns often combine data from multiple columns into a single searchable field.

The rule: normalize first, denormalize only when you have measured a performance problem and confirmed denormalization solves it. Premature denormalization is a significant source of data quality bugs.

A well-normalized schema to 3NF typically results in more tables but far fewer data consistency bugs. Combined with the constraints you will learn in the next lesson, it forms the foundation of a trustworthy database.

Why Store unit_price in order_items?
Even though there is already a base_price in the products table, order_items stores unit_price separately. This is intentional — it is a historical snapshot of the price at the time the order was placed. If you change the product price next month, you do not want last month's orders to show the new price. This is a deliberate, justified denormalization.
Normalization Is a Tool, Not a Religion
Normalize your schema to eliminate data anomalies, then selectively denormalize for performance when you have measured evidence that it is needed. Never denormalize prematurely. The most common mistake is building a flat denormalized schema from the start and paying the price in data inconsistency later.
BCNF vs 3NF in Practice
Most production databases target 3NF. BCNF is theoretically stricter but can force you to split tables in ways that complicate queries or lose meaningful functional dependencies. If your schema is in 3NF and you have no anomalies, you do not need to push further to BCNF in most applications.
sql
-- BEFORE normalization: a flat, denormalized table
-- Problem: repeats customer and product data on every order row
CREATE TABLE orders_flat (
    order_id        INTEGER,
    order_date      DATE,
    customer_id     INTEGER,
    customer_name   VARCHAR(200),  -- duplicated!
    customer_email  VARCHAR(320),  -- duplicated!
    customer_city   VARCHAR(100),  -- duplicated!
    product_id      INTEGER,
    product_name    VARCHAR(200),  -- duplicated!
    product_category VARCHAR(100), -- duplicated!
    unit_price      NUMERIC(10,2),
    quantity        INTEGER
);

-- AFTER normalization: 3NF schema
-- Each fact lives in exactly one place

CREATE TABLE customers (
    customer_id  SERIAL PRIMARY KEY,
    full_name    VARCHAR(200) NOT NULL,
    email        VARCHAR(320) NOT NULL UNIQUE,
    city         VARCHAR(100)
);

CREATE TABLE products (
    product_id   SERIAL PRIMARY KEY,
    name         VARCHAR(200) NOT NULL,
    category_id  INTEGER      NOT NULL REFERENCES categories(category_id),
    base_price   NUMERIC(10,2) NOT NULL
);

CREATE TABLE categories (
    category_id  SERIAL PRIMARY KEY,
    name         VARCHAR(100) NOT NULL UNIQUE
);

CREATE TABLE orders (
    order_id     SERIAL PRIMARY KEY,
    customer_id  INTEGER     NOT NULL REFERENCES customers(customer_id),
    ordered_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE order_items (
    order_id    INTEGER     NOT NULL REFERENCES orders(order_id),
    product_id  INTEGER     NOT NULL REFERENCES products(product_id),
    quantity    INTEGER     NOT NULL CHECK (quantity > 0),
    unit_price  NUMERIC(10,2) NOT NULL,  -- snapshot of price at time of sale
    PRIMARY KEY (order_id, product_id)
);