Normalization (1NF-BCNF)

intermediate normalization 1nf 2nf 3nf bcnf functional-dependency

Normalization is the process of organizing a database to reduce redundancy and prevent update anomalies. We break one messy table into multiple clean tables, each representing a single concept.

Think of it like organizing a messy closet. Everything is shoved in one pile (the flat table), and we’re sorting it into separate drawers (normalized tables) so things don’t get lost or duplicated.

Why Normalize?

Without normalization, we get three kinds of anomalies:

  • Insert anomaly — we can’t add data without unrelated data. (Can’t add a new department without assigning an employee to it.)
  • Update anomaly — changing one fact requires updating multiple rows. (Department name changes? Update every employee row.)
  • Delete anomaly — deleting data accidentally removes unrelated data. (Delete the last employee in a department? The department itself disappears.)

The Starting Point: A Flat Table

Let’s normalize this mess step by step:

-- The "everything in one table" approach
-- student_id | student_name | course_id | course_name | instructor | instructor_phone
-- 1          | Alice        | CS101     | Databases   | Dr. Smith  | 555-1234
-- 1          | Alice        | CS102     | Networks    | Dr. Jones  | 555-5678
-- 2          | Bob          | CS101     | Databases   | Dr. Smith  | 555-1234
-- 2          | Bob          | CS103     | OS          | Dr. Smith  | 555-1234
-- 3          | Charlie      | CS102     | Networks    | Dr. Jones  | 555-5678

Problems everywhere: Alice’s name is stored twice. Dr. Smith’s phone number is stored three times. Change one, forget the other — data becomes inconsistent.

The Normalization Progression

UNF
Unnormalized
1NF
Atomic values
2NF
No partial deps
3NF
No transitive deps
BCNF
Stricter 3NF
Each level builds on the previous. Most real databases aim for 3NF.

1NF — First Normal Form

Rule: Every cell must contain a single, atomic value. No lists, no repeating groups, no arrays stuffed into one column.

-- VIOLATES 1NF (courses is a comma-separated list)
-- student_id | student_name | courses
-- 1          | Alice        | CS101, CS102

-- SATISFIES 1NF (one value per cell, one row per enrollment)
-- student_id | student_name | course_id
-- 1          | Alice        | CS101
-- 1          | Alice        | CS102

Also: every table must have a primary key. In our example, the primary key is the combination (student_id, course_id).

2NF — Second Normal Form

Rule: Must be in 1NF AND no partial dependencies. Every non-key column must depend on the entire primary key, not just part of it.

This only matters when we have a composite primary key (two or more columns). If our primary key is a single column, we’re automatically in 2NF.

Look at our table with composite key (student_id, course_id):

-- student_id | student_name | course_id | course_name | instructor
-- PK: (student_id, course_id)

-- student_name depends on student_id ALONE (partial dependency!)
-- course_name depends on course_id ALONE (partial dependency!)
-- Both violate 2NF because they don't need the FULL composite key

To fix: move partially dependent columns to their own tables.

-- Students table (student_name depends only on student_id)
CREATE TABLE students (
    student_id INT PRIMARY KEY,
    student_name VARCHAR(100)
);

-- Courses table (course_name, instructor depend only on course_id)
CREATE TABLE courses (
    course_id VARCHAR(10) PRIMARY KEY,
    course_name VARCHAR(100),
    instructor VARCHAR(100)
);

-- Enrollments table (the relationship)
CREATE TABLE enrollments (
    student_id INT REFERENCES students(student_id),
    course_id VARCHAR(10) REFERENCES courses(course_id),
    PRIMARY KEY (student_id, course_id)
);

Now each non-key column depends on the full primary key of its table.

3NF — Third Normal Form

Rule: Must be in 2NF AND no transitive dependencies. Non-key columns must depend directly on the primary key, not through another non-key column.

In simple language: if A determines B, and B determines C, then C shouldn’t be in the same table as A. Move C to B’s table.

-- Our courses table still has a problem:
-- course_id → instructor → instructor_phone
-- instructor_phone depends on instructor, NOT on course_id directly
-- That's a transitive dependency!

-- course_id | course_name | instructor | instructor_phone
-- CS101     | Databases   | Dr. Smith  | 555-1234
-- CS103     | OS          | Dr. Smith  | 555-1234   ← phone duplicated!

To fix: move instructor details to their own table.

-- Instructors table
CREATE TABLE instructors (
    instructor_id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    phone VARCHAR(20)
);

-- Courses table (references instructor by FK)
CREATE TABLE courses (
    course_id VARCHAR(10) PRIMARY KEY,
    course_name VARCHAR(100),
    instructor_id INT REFERENCES instructors(instructor_id)
);

Now instructor_phone depends directly on instructor_id in its own table. No transitive dependency.

BCNF — Boyce-Codd Normal Form

Rule: Must be in 3NF AND every determinant must be a candidate key. This is a stricter version of 3NF that handles some edge cases.

In simple language: for every functional dependency X -> Y, X must be a superkey (capable of being a primary key).

BCNF and 3NF are the same in most practical cases. The difference only shows up with overlapping composite candidate keys — which is pretty rare.

-- A rare BCNF violation example:
-- A student can only have one advisor per subject
-- An advisor teaches only one subject

-- student | subject | advisor
-- Alice   | DB      | Dr. Smith     (advisor → subject)
-- Bob     | DB      | Dr. Smith
-- Alice   | Net     | Dr. Jones

-- Candidate keys: (student, subject) and (student, advisor)
-- advisor → subject is a dependency where advisor is NOT a superkey
-- This violates BCNF!

-- Fix: split into two tables
-- student_advisor: student | advisor
-- advisor_subject:  advisor | subject

The Full Normalized Schema

After all that work, our original flat table is now four clean tables:

CREATE TABLE students (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL
);

CREATE TABLE instructors (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    phone VARCHAR(20)
);

CREATE TABLE courses (
    id VARCHAR(10) PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    instructor_id INT REFERENCES instructors(id)
);

CREATE TABLE enrollments (
    student_id INT REFERENCES students(id),
    course_id VARCHAR(10) REFERENCES courses(id),
    enrolled_at TIMESTAMP DEFAULT NOW(),
    PRIMARY KEY (student_id, course_id)
);

No redundancy. Change an instructor’s phone in one place, it’s updated everywhere. Delete a student, courses still exist. Add a new course without needing a student. All anomalies gone.

How Far Should We Normalize?

Most real-world databases aim for 3NF. Going beyond that (BCNF, 4NF, 5NF) is rarely needed and can make queries slower due to excessive JOINs.

The practical rule: normalize until the redundancy is gone, then stop. If performance suffers from too many JOINs, we can strategically denormalize (which we cover in the next note).

Interview Tip

Normalization is a classic DBMS interview topic. Walk through the example step by step: start with a messy flat table, explain the anomalies, then normalize through 1NF, 2NF, and 3NF. The key phrases to remember: “atomic values” (1NF), “no partial dependencies” (2NF), “no transitive dependencies” (3NF). If asked about BCNF, mention it’s a stricter 3NF where every determinant must be a candidate key.