DBMS

All 47 notes on one page

Database Fundamentals

What Is a DBMS?

beginner dbms rdbms nosql fundamentals

A Database Management System (DBMS) is software that sits between our application and the actual data stored on disk. It handles all the messy details — storing, retrieving, updating, and deleting data — so we don’t have to manage raw files ourselves.

Think of it like a librarian. We don’t walk into a library and randomly dig through shelves. We ask the librarian (DBMS), and they know exactly where everything is, who borrowed what, and how to keep things organized.

Why Not Just Use Files?

Before databases existed, people stored data in flat files — CSV files, text files, whatever. And it worked… until it didn’t. Here’s what goes wrong with flat files:

No concurrent access — two users editing the same file at once? Chaos.
No data integrity — nothing stops us from putting “banana” in an age field.
No relationships — linking data across files is a nightmare of manual bookkeeping.
No security — the whole file is either accessible or it isn’t. No fine-grained control.
No crash recovery — if the system crashes mid-write, our data could be half-written garbage.

A DBMS solves all of these problems. It gives us structured storage, transactions, access control, crash recovery, and a query language to talk to our data.

How a DBMS Sits in the Architecture

Application

Node, Python, Java

→

DBMS

PostgreSQL, MongoDB

→

Disk Storage

Files, Pages, Blocks

Our app never touches disk directly — it always goes through the DBMS

Our application sends queries (SQL or API calls) to the DBMS. The DBMS figures out the most efficient way to read/write the data on disk, handles locking, caching, and returns the results. We never deal with file I/O directly.

Types of DBMS

Relational (RDBMS)

This is the classic. Data lives in tables with rows and columns. We use SQL to query it. Every row has a fixed schema — same columns, same data types.

Best for: structured data with clear relationships. Banking, e-commerce, anything where data integrity matters.

Popular ones: PostgreSQL, MySQL, SQLite, Oracle, SQL Server

NoSQL

A broad category for “anything that’s not a traditional relational table.” These databases trade some of the strictness of RDBMS for flexibility and scale.

Document stores (MongoDB, CouchDB) — data as JSON-like documents
Key-Value stores (Redis, DynamoDB) — simple key-to-value lookups, blazing fast
Wide-Column stores (Cassandra, HBase) — rows can have different columns
Graph databases (Neo4j, ArangoDB) — nodes and edges for relationship-heavy data

NewSQL

The best of both worlds — SQL interface and ACID guarantees, but with the horizontal scalability of NoSQL. These are newer and designed for modern distributed systems.

Popular ones: CockroachDB, TiDB, Google Spanner, YugabyteDB

The Big Players at a Glance

-- PostgreSQL: The "Swiss Army Knife" of databases
-- Open source, feature-rich, great for almost anything
-- Supports JSON, full-text search, extensions

-- MySQL: The most widely deployed open-source RDBMS
-- Powers WordPress, Facebook (with modifications), many web apps
-- Simple to set up, huge community

-- MongoDB: The go-to document database
-- Schema-flexible, stores BSON (binary JSON)
-- Great for rapid prototyping and semi-structured data

-- Redis: In-memory key-value store
-- Sub-millisecond reads, used as cache or session store
-- Also supports pub/sub, streams, and more

-- SQLite: Serverless, file-based database
-- No separate server process — the DB is just a file
-- Perfect for mobile apps, embedded systems, local dev

What a DBMS Actually Does Under the Hood

When we fire a query, here’s what happens inside:

Parser — checks if our SQL is syntactically correct
Query Optimizer — figures out the best execution plan (which indexes to use, join order, etc.)
Execution Engine — actually runs the plan
Storage Engine — reads/writes data from/to disk
Transaction Manager — ensures ACID properties
Buffer Manager — caches frequently accessed data in memory

We don’t need to know all these internals for most interviews, but knowing the DBMS isn’t just “a black box” shows deeper understanding.

Interview Tip

When asked “What is a DBMS?”, don’t just give the textbook definition. Mention the problems it solves (concurrency, integrity, recovery, security) and name a few types. That shows we understand the why, not just the what.

References

Types of Databases

beginner database relational nosql graph columnar time-series

Not all databases are built the same. The type of database we pick depends entirely on the shape of our data and the access patterns we expect. Picking the wrong type is like using a screwdriver to hammer a nail — it technically works, but it’s painful.

Let’s walk through every major type.

The Seven Main Types

Relational

Tables, rows, SQL

PostgreSQL, MySQL, SQLite

Document

JSON docs, flexible schema

MongoDB, CouchDB, Firestore

Key-Value

Simple key → value pairs

Redis, DynamoDB, Memcached

Wide-Column

Row key + column families

Cassandra, HBase, ScyllaDB

Graph

Nodes + edges + properties

Neo4j, ArangoDB, Amazon Neptune

Columnar

Column-oriented storage

ClickHouse, BigQuery, Redshift

Time-Series

Timestamped data points

InfluxDB, TimescaleDB, Prometheus

1. Relational Databases (RDBMS)

The most widely used type. Data is organized into tables (relations) with a fixed schema. Every row follows the same structure. We query using SQL.

The big strength? Relationships. We can link tables together with foreign keys and use JOINs to pull related data in a single query.

-- Classic relational structure
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(255) UNIQUE
);

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    user_id INT REFERENCES users(id),  -- foreign key relationship
    total DECIMAL(10,2),
    created_at TIMESTAMP DEFAULT NOW()
);

Best for: Banking, e-commerce, any app with structured data and complex relationships.

2. Document Databases

Instead of tables, data lives as JSON-like documents. Each document can have a completely different structure — no rigid schema required.

Think of it like a filing cabinet. Each folder (document) can contain whatever we want. One folder might have 5 pages, another might have 20 with different fields.

-- MongoDB-style document (not SQL, but shown for clarity)
-- A user document might look like:
-- {
--   "_id": "abc123",
--   "name": "Manish",
--   "email": "manish@example.com",
--   "addresses": [
--     { "city": "Mumbai", "zip": "400001" },
--     { "city": "Delhi", "zip": "110001" }
--   ]
-- }
-- No need to create a separate "addresses" table!

Best for: Content management, user profiles, catalogs — anything with semi-structured or frequently changing schemas.

3. Key-Value Stores

The simplest model. Every piece of data is stored as a key-value pair. Give it a key, get back a value. That’s it. No complex queries, no joins, no relationships.

Because it’s so simple, it’s incredibly fast. Most key-value stores live entirely in memory.

Best for: Caching, session storage, feature flags, rate limiting, leaderboards.

4. Wide-Column Stores

Think of it like a relational table, but each row can have different columns. Data is grouped into column families, and we query by row key.

These are designed for massive scale — billions of rows, thousands of columns, distributed across many nodes.

Best for: IoT data, event logging, recommendation engines, anything with huge write throughput.

5. Graph Databases

Data is stored as nodes (entities) and edges (relationships between them). Each node and edge can have properties.

When our queries are all about “how is X connected to Y?”, graph databases destroy relational databases in performance. A query that needs 10 JOINs in SQL might be a simple traversal in a graph.

Best for: Social networks, fraud detection, recommendation engines, knowledge graphs.

6. Columnar Databases

In a normal row-based database, data for a single row is stored together. In a columnar database, data for a single column is stored together.

Why does this matter? When we run analytics queries like “what’s the average order total?”, we only need the total column. A columnar store reads just that column — not every field of every row.

Best for: Analytics, data warehousing, business intelligence, reporting.

7. Time-Series Databases

Optimized for data that comes in as timestamped data points — metrics, sensor readings, stock prices. They handle massive write throughput and have built-in functions for time-based aggregation (averages per hour, rolling windows, etc.).

Best for: Monitoring/observability, IoT sensor data, financial data, log analytics.

How to Pick the Right Type

There’s no “best” database. It depends on what we’re building:

Question	If Yes…
Is our data structured with clear relationships?	Relational
Do we need flexible schema that changes often?	Document
Do we need blazing fast lookups by key?	Key-Value
Are we dealing with massive write volumes?	Wide-Column
Is the query all about connections between entities?	Graph
Are we doing heavy analytics on large datasets?	Columnar
Is the data timestamped metrics/events?	Time-Series

The Real World: Polyglot Persistence

Most production systems don’t use just one database. They use multiple types together:

PostgreSQL as the primary data store
Redis for caching and sessions
Elasticsearch for full-text search
ClickHouse for analytics

This is called polyglot persistence — picking the right tool for each job instead of forcing everything into one database.

Interview Tip

If an interviewer asks “which database would you pick?”, never answer with just a name. Always explain the why — what access patterns, what scale, what consistency requirements. That’s what separates a good answer from a great one.

References

SQL vs NoSQL

beginner sql nosql scaling schema acid base

This is probably the most common database interview question. “When would you pick SQL vs NoSQL?” Let’s break it down properly.

In simple language, SQL databases store data in structured tables with a predefined schema. NoSQL databases store data in flexible formats — documents, key-value pairs, graphs, or wide columns.

The Core Differences

SQL (Relational)

NoSQL

Fixed schema (columns defined upfront)

Dynamic schema (fields can vary)

Tables with rows and columns

Documents, key-value, graph, wide-column

Vertical scaling (bigger machine)

Horizontal scaling (more machines)

ACID transactions

BASE (eventually consistent)

JOINs across tables

Denormalized, embedded data

PostgreSQL, MySQL, SQLite

MongoDB, Redis, Cassandra, Neo4j

Schema: Rigid vs Flexible

With SQL, we define the schema upfront. Every row in a table must follow it. Want to add a new field? We need to run an ALTER TABLE migration.

-- SQL: Schema is defined upfront
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(255) UNIQUE,
    age INT
);

-- Every user MUST have these columns. No exceptions.
-- Adding a "phone" field later requires a migration:
ALTER TABLE users ADD COLUMN phone VARCHAR(20);

With NoSQL (say MongoDB), documents in the same collection can have completely different fields. One user document might have a phone field, another might not. No migration needed.

-- MongoDB: No predefined schema
-- Document 1:
-- { "name": "Manish", "email": "manish@example.com", "age": 26 }
--
-- Document 2 (same collection, different fields):
-- { "name": "Rahul", "email": "rahul@example.com", "phone": "+91..." }
--
-- Both live happily in the same collection

Scaling: Vertical vs Horizontal

Vertical scaling means getting a bigger, faster machine. More RAM, more CPU, faster disk. This is how most SQL databases scale. There’s a ceiling though — machines can only get so big.

Horizontal scaling means adding more machines to share the load. NoSQL databases are designed for this from the ground up. Data gets distributed (sharded) across multiple nodes.

Now, this isn’t a hard rule anymore. PostgreSQL can scale horizontally with tools like Citus. MongoDB added ACID transactions. The lines are blurring. But the default design philosophy still differs.

ACID vs BASE

SQL databases follow ACID — every transaction is Atomic, Consistent, Isolated, and Durable. When we transfer money between accounts, either both updates happen or neither does. No partial state.

NoSQL databases typically follow BASE — Basically Available, Soft state, Eventually consistent. The data might be slightly stale for a moment, but the system stays available and eventually catches up.

In simple language: ACID says “everything is perfectly correct, always.” BASE says “things will be correct eventually, but we prioritize staying online.”

We cover both in detail in their own notes.

When to Pick SQL

Our data has clear, well-defined relationships (users → orders → products)
We need strong consistency and ACID transactions (banking, payments)
We know our schema and it won’t change dramatically
We need complex queries with JOINs, aggregations, subqueries
Our data fits on one (admittedly large) machine

When to Pick NoSQL

Our schema evolves rapidly (startup MVP, experimental features)
We need to scale to massive read/write throughput across many servers
Our data is semi-structured or varies by record (user profiles, product catalogs)
We need low-latency reads and can tolerate eventual consistency
We’re building real-time features like chat, feeds, or gaming leaderboards

The Dirty Secret: Most Apps Don’t Need NoSQL

Here’s the thing most people won’t tell us in an interview — for 90% of applications, PostgreSQL is more than enough. It handles JSON, full-text search, and can scale pretty far vertically.

NoSQL makes sense at massive scale (think Twitter, Netflix, Uber) or when the data genuinely doesn’t fit a relational model. For a typical web app with a few million rows? PostgreSQL all day.

Hybrid Approaches

Modern databases are blurring the lines:

PostgreSQL has jsonb columns for document-style storage inside a relational database
MongoDB added multi-document ACID transactions (since v4.0)
CockroachDB and YugabyteDB give us SQL + horizontal scaling (NewSQL)
FaunaDB combines document flexibility with relational guarantees

The real skill isn’t knowing “SQL good, NoSQL bad” (or vice versa). It’s knowing when each approach makes sense and being able to justify the decision.

Interview Tip

Never say “NoSQL is better because it scales.” That’s a red flag. Instead, talk about the specific trade-offs for the specific use case. Interviewers love hearing “for this use case, I’d pick X because…” followed by concrete reasoning.

References

ACID Properties

beginner acid transactions atomicity consistency isolation durability

ACID is the set of four guarantees that make database transactions reliable. Every time we hear “transaction” in the database world, ACID is what makes it trustworthy.

Think of it like a contract. The database promises: “If you wrap your operations in a transaction, I guarantee these four things will hold true.”

The Classic Bank Transfer Example

Let’s say Manish wants to transfer Rs 1000 from Account A to Account B. This involves two operations:

Subtract 1000 from Account A
Add 1000 to Account B

What could go wrong? The system crashes after step 1 but before step 2. Now Rs 1000 has vanished into thin air. ACID prevents this.

A — Atomicity

All or nothing. Either the entire transaction succeeds, or the entire transaction is rolled back. No partial updates.

In our bank transfer: either BOTH the debit and credit happen, or NEITHER happens. If the system crashes mid-transaction, the database rolls back everything to the state before the transaction started.

Think of it like an “undo” button that triggers automatically on failure.

-- Atomicity in action
BEGIN;

UPDATE accounts SET balance = balance - 1000 WHERE id = 'A';
-- What if the system crashes RIGHT HERE?
-- With atomicity, the first UPDATE gets rolled back automatically.
UPDATE accounts SET balance = balance + 1000 WHERE id = 'B';

COMMIT;
-- Only at COMMIT do both changes become permanent

C — Consistency

The database must move from one valid state to another valid state. All rules, constraints, and triggers must be satisfied before and after the transaction.

If we have a rule that says “balance cannot be negative,” and Manish tries to transfer Rs 5000 from an account with only Rs 1000 — the entire transaction is rejected. The database stays consistent.

-- Consistency: constraints are always enforced
ALTER TABLE accounts ADD CONSTRAINT positive_balance CHECK (balance >= 0);

BEGIN;
UPDATE accounts SET balance = balance - 5000 WHERE id = 'A';
-- If Account A only has 1000, this violates the CHECK constraint
-- The transaction FAILS, nothing changes
COMMIT;
-- ERROR: new row violates check constraint "positive_balance"

I — Isolation

Concurrent transactions don’t interfere with each other. Each transaction runs as if it’s the only one running on the database, even if thousands are running simultaneously.

Without isolation, two people transferring money at the same time could see each other’s half-finished work, leading to incorrect balances.

-- Without isolation (the problem):
-- Transaction 1: reads balance of A = 5000
-- Transaction 2: reads balance of A = 5000
-- Transaction 1: sets A = 5000 - 1000 = 4000
-- Transaction 2: sets A = 5000 - 2000 = 3000
-- Result: A = 3000, but it should be 2000! (one debit was "lost")

-- With isolation, Transaction 2 waits until Transaction 1 finishes
-- or the database detects the conflict and handles it properly

The level of isolation can be tuned (we’ll cover isolation levels in a separate note). More isolation = safer but slower.

D — Durability

Once a transaction is committed, it’s permanent. Even if the server crashes, loses power, or catches fire — the committed data survives.

The database achieves this by writing to a Write-Ahead Log (WAL) before confirming the commit. Even if the main data files are corrupted, the WAL can replay the changes.

In simple language: once the database says “COMMIT successful,” we can trust that the data is safely stored. No take-backsies.

All Four Together

A — Atomicity

All or nothing. No partial updates.

Crash mid-transaction? Everything rolls back.

C — Consistency

Valid state to valid state.

All constraints and rules are respected.

I — Isolation

Transactions don't see each other's work.

Each runs as if it's the only one.

D — Durability

Committed = permanent.

Survives crashes via Write-Ahead Log.

A Full Transaction Example

-- Transfer Rs 1000 from Account A to Account B
BEGIN;  -- Start the transaction (Atomicity boundary)

-- Check balance first (Consistency)
DO $$
BEGIN
    IF (SELECT balance FROM accounts WHERE id = 'A') < 1000 THEN
        RAISE EXCEPTION 'Insufficient funds';
    END IF;
END $$;

-- Perform the transfer
UPDATE accounts SET balance = balance - 1000 WHERE id = 'A';
UPDATE accounts SET balance = balance + 1000 WHERE id = 'B';

COMMIT;  -- Make it permanent (Durability)
-- If anything failed above, ROLLBACK undoes everything (Atomicity)

Which Databases Support ACID?

Full ACID: PostgreSQL, MySQL (InnoDB), SQLite, Oracle, SQL Server
Document-level ACID: MongoDB (multi-document since v4.0)
Typically not ACID: Cassandra, DynamoDB, Redis (they follow BASE instead)
NewSQL (ACID + distributed): CockroachDB, TiDB, Google Spanner

Interview Tip

ACID is one of the most asked DBMS topics. Don’t just list the four words — explain each with the bank transfer example. Interviewers want to see that we understand why each property matters, not just what the acronym stands for.

References

BASE Properties

intermediate base eventual-consistency nosql distributed-systems

If ACID is the strict, rule-following accountant, BASE is the laid-back startup founder who says “it’ll all work out eventually.” And honestly? For many systems, that’s perfectly fine.

BASE is the consistency model used by most NoSQL and distributed databases. It prioritizes availability and partition tolerance over strict consistency.

The acronym stands for:

Basically Available
Soft state
Eventually consistent

Basically Available

The system guarantees a response to every request — even if the data might be slightly stale. It won’t just refuse to answer because one node is down.

In simple language: “We’ll always give you something, even if it’s not the absolute latest version.”

Compare this to ACID where a transaction might block or fail if it can’t guarantee perfect consistency.

Real-world example: When we scroll Instagram and the like count shows 1,042 instead of 1,043 (because the latest like hasn’t propagated yet), that’s basic availability in action. The app didn’t crash or show a loading spinner — it gave us a slightly stale number.

Soft State

The state of the system can change over time, even without new input. Background processes (replication, anti-entropy repairs) are constantly syncing data between nodes.

This is the opposite of ACID where the state only changes when we explicitly run a transaction.

Think of it like this: we write data to Node A, and for a brief moment, Node B has stale data. But behind the scenes, the system is syncing. The state is “soft” — it’s in flux until everything converges.

Eventually Consistent

Given enough time with no new updates, all replicas will converge to the same value. We just can’t guarantee it happens immediately.

This is the key trade-off. ACID says “after my transaction commits, everyone sees the new data instantly.” BASE says “everyone will see it… eventually.”

How long is “eventually”? In most modern systems, we’re talking milliseconds to a few seconds. It’s not like we wait days.

ACID vs BASE Comparison

ACID

BASE

Strong consistency

Eventual consistency

Pessimistic (lock first, ask later)

Optimistic (resolve conflicts later)

Availability can suffer under load

Prioritizes availability

Vertical scaling typical

Horizontal scaling native

PostgreSQL, MySQL, Oracle

Cassandra, DynamoDB, CouchDB

When BASE Makes Sense

Not everything needs bank-level consistency. Here are some cases where eventual consistency is totally fine:

Social media feeds — if a new post shows up 2 seconds late, nobody notices
Like/view counters — exact real-time accuracy doesn’t matter
Product catalog browsing — a price update can take a moment to propagate
DNS — the entire internet runs on eventual consistency (TTL-based caching)
Shopping cart — items might be slightly out of sync across devices for a moment

When BASE Does NOT Make Sense

Financial transactions — we absolutely need ACID here
Inventory/booking systems — double-selling a concert ticket is unacceptable
User authentication — a password change must be immediately consistent
Anything where “slightly wrong” causes real harm

How Eventual Consistency Works in Practice

-- Imagine a 3-node Cassandra cluster
-- We write: INSERT INTO posts (id, content) VALUES (1, 'Hello World');

-- Step 1: Write hits Node A (coordinator), saved locally
-- Step 2: Node A sends the write to Node B and Node C (async replication)
-- Step 3: Node A confirms success to the client

-- For a brief moment:
-- Node A: has the data ✓
-- Node B: syncing... (stale)
-- Node C: syncing... (stale)

-- After a few milliseconds, all three nodes have the data
-- That gap is "eventual consistency"

Tunable Consistency

Many NoSQL databases let us tune the consistency level per query. In Cassandra, for example:

ONE — read/write to one node (fastest, least consistent)
QUORUM — majority of nodes must agree (balanced)
ALL — every node must respond (slowest, most consistent)

So BASE isn’t always “fully eventually consistent.” We can dial it up when we need stronger guarantees for specific operations.

Interview Tip

When talking about BASE, always contrast it with ACID. Show that we understand it’s not “worse” — it’s a deliberate trade-off for distributed systems. The key insight: we sacrifice strict consistency to gain availability and partition tolerance, and for many use cases that’s the right call.

References

CAP Theorem

intermediate cap-theorem distributed-systems consistency availability partition-tolerance

The CAP theorem is one of the most famous (and most misunderstood) concepts in distributed systems. It was proposed by Eric Brewer in 2000 and formally proved in 2002.

In simple language: in a distributed system, when a network partition happens, we have to choose between consistency and availability. We can’t have both.

The Three Letters

C — Consistency: Every read receives the most recent write. All nodes see the same data at the same time. If we write “balance = 500” and immediately read, we get 500 — not some stale value.

A — Availability: Every request receives a response (even if it’s not the latest data). The system never refuses to answer.

P — Partition Tolerance: The system continues to work even when network messages between nodes are lost or delayed. In a distributed system, network partitions will happen — it’s not a matter of “if” but “when.”

The Triangle

Consistency

Availability

Partition Tolerance

Pick two. During a partition, we must choose C or A.

Why “Pick 2 out of 3”?

Here’s the key insight: partition tolerance isn’t really optional in a distributed system. Networks fail. Cables get cut. Data centers lose connectivity. It happens.

So the real choice is: when a partition happens, do we sacrifice Consistency or Availability?

CP (Consistency + Partition Tolerance): During a partition, the system refuses to respond rather than return stale data. We get correct data or no data.
AP (Availability + Partition Tolerance): During a partition, the system always responds, even if the data might be stale.
CA (Consistency + Availability): This only works on a single node (no partitions possible). The moment we go distributed, we lose this option.

Where Do Popular Databases Fall?

CP — Consistency + Partition Tolerance

These databases will reject requests or become read-only during a partition rather than serve stale data.

MongoDB (with majority write concern)
HBase
Redis Cluster (when a partition is detected, minority partition stops serving)
etcd / ZooKeeper (consensus-based)

AP — Availability + Partition Tolerance

These databases always respond, even if the data might be temporarily inconsistent.

Cassandra (tunable, but AP by default)
DynamoDB
CouchDB
Riak

CA — Consistency + Availability (Single Node Only)

Not truly distributed. Works only when there’s no possibility of a network partition.

Single-node PostgreSQL
Single-node MySQL
SQLite

A Practical Example

Imagine we have a distributed database with two nodes, Node A and Node B. A network cable between them gets cut (partition!).

A user writes balance = 500 to Node A. Now another user reads from Node B.

CP choice: Node B says “Sorry, I can’t serve this read because I haven’t synced with Node A. I might give you wrong data.” The system is consistent but unavailable.

AP choice: Node B says “Here’s the last value I have: balance = 300 (stale).” The system is available but inconsistent.

Neither is “wrong.” It depends on what we’re building.

-- CP behavior (MongoDB with majority read concern):
-- If the primary is unreachable, reads FAIL
-- db.accounts.find({id: "A"}).readConcern("majority")
-- Result during partition: Error - primary unavailable

-- AP behavior (Cassandra with consistency level ONE):
-- Reads succeed from any available node, even with stale data
-- SELECT balance FROM accounts WHERE id = 'A';  -- consistency ONE
-- Result during partition: Returns stale value (but doesn't fail!)

Common Misconceptions

”We have to give up one of the three permanently”

Nope. The trade-off only kicks in during a network partition. When everything is running smoothly (no partition), we can have all three. CAP is about what happens when things go wrong.

”CA databases exist in distributed systems”

Not really. Any multi-node system can experience a network partition. CA only works for single-node setups.

”CAP means NoSQL is better”

CAP doesn’t say one is better. It says distributed systems have trade-offs. Plenty of CP systems use SQL (CockroachDB, Google Spanner).

PACELC: The Extended Version

CAP only tells us what happens during a partition. But what about normal operation? The PACELC theorem extends CAP:

Partition: choose A (availability) or C (consistency)
Else (no partition): choose L (latency) or C (consistency)

For example, DynamoDB is PA/EL (available during partition, low latency normally). MongoDB is PC/EC (consistent during partition, consistent normally but with higher latency).

Interview Tip

CAP comes up in almost every system design interview. Don’t just recite the definition — give a concrete example like the two-node partition scenario above. Bonus points for mentioning that the trade-off only applies during a partition, and that PACELC gives us a more complete picture.

References

SQL & Querying

DDL, DML & DCL

beginner sql ddl dml dcl tcl create insert grant

SQL commands aren’t all the same kind. They fall into four categories based on what they do. Knowing these categories helps us understand what a SQL statement is actually doing at a higher level.

The Four Categories

DDL

Data Definition Language

Defines structure

CREATE, ALTER, DROP, TRUNCATE

DML

Data Manipulation Language

Works with data

SELECT, INSERT, UPDATE, DELETE

DCL

Data Control Language

Manages permissions

GRANT, REVOKE

TCL

Transaction Control Language

Controls transactions

BEGIN, COMMIT, ROLLBACK, SAVEPOINT

DDL — Data Definition Language

DDL commands define or modify the structure of our database — tables, columns, indexes, schemas. They don’t touch the actual data. Think of DDL as building the shelves, not putting books on them.

DDL commands are auto-committed. Once we run them, there’s no rolling back (in most databases).

CREATE

-- Create a new table
CREATE TABLE employees (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(255) UNIQUE,
    department_id INT,
    salary DECIMAL(10,2) DEFAULT 0,
    hired_at TIMESTAMP DEFAULT NOW()
);

-- Create an index
CREATE INDEX idx_employees_department ON employees(department_id);

-- Create a schema (namespace)
CREATE SCHEMA hr;

ALTER

-- Add a column
ALTER TABLE employees ADD COLUMN phone VARCHAR(20);

-- Change a column's data type
ALTER TABLE employees ALTER COLUMN salary TYPE NUMERIC(12,2);

-- Rename a column
ALTER TABLE employees RENAME COLUMN phone TO mobile;

-- Add a constraint
ALTER TABLE employees ADD CONSTRAINT fk_department
    FOREIGN KEY (department_id) REFERENCES departments(id);

-- Drop a column
ALTER TABLE employees DROP COLUMN mobile;

DROP

-- Drop a table entirely (structure + data, gone forever)
DROP TABLE employees;

-- Drop only if it exists (avoids error)
DROP TABLE IF EXISTS employees;

-- Drop a table and everything that depends on it
DROP TABLE departments CASCADE;

TRUNCATE

-- Remove ALL rows but keep the table structure
TRUNCATE TABLE employees;

-- Faster than DELETE because it doesn't log individual row deletions
-- Resets auto-increment counters
-- Cannot be rolled back in most databases (it's DDL, not DML)

The key difference: DROP removes the table itself. TRUNCATE empties it but keeps the structure. DELETE removes specific rows (and is DML, not DDL).

DML — Data Manipulation Language

DML commands work with the actual data — the rows inside our tables. These are the commands we use 95% of the time.

DML commands are transactional. We can wrap them in BEGIN/COMMIT and roll them back if something goes wrong.

SELECT

-- Basic query
SELECT name, email FROM employees WHERE department_id = 5;

-- With sorting and limiting
SELECT name, salary
FROM employees
ORDER BY salary DESC
LIMIT 10;

-- With joins (more on this in the Joins note)
SELECT e.name, d.name AS department
FROM employees e
JOIN departments d ON e.department_id = d.id;

INSERT

-- Insert a single row
INSERT INTO employees (name, email, department_id, salary)
VALUES ('Manish', 'manish@example.com', 3, 75000);

-- Insert multiple rows
INSERT INTO employees (name, email, department_id, salary)
VALUES
    ('Alice', 'alice@example.com', 1, 80000),
    ('Bob', 'bob@example.com', 2, 70000),
    ('Charlie', 'charlie@example.com', 1, 90000);

-- Insert from a query
INSERT INTO archived_employees
SELECT * FROM employees WHERE hired_at < '2020-01-01';

UPDATE

-- Update specific rows
UPDATE employees SET salary = 85000 WHERE id = 1;

-- Update with a calculation
UPDATE employees SET salary = salary * 1.10 WHERE department_id = 3;
-- 10% raise for everyone in department 3

-- Update with a JOIN (PostgreSQL syntax)
UPDATE employees e
SET department_id = d.id
FROM departments d
WHERE d.name = 'Engineering' AND e.name = 'Manish';

DELETE

-- Delete specific rows
DELETE FROM employees WHERE id = 5;

-- Delete with a condition
DELETE FROM employees WHERE hired_at < '2015-01-01';

-- Delete all rows (but table structure remains — unlike TRUNCATE, this is logged)
DELETE FROM employees;

DCL — Data Control Language

DCL manages who can do what in the database. It’s about permissions and access control.

-- Grant SELECT permission on a table
GRANT SELECT ON employees TO readonly_user;

-- Grant all permissions
GRANT ALL PRIVILEGES ON employees TO admin_user;

-- Grant permission on a schema
GRANT USAGE ON SCHEMA hr TO analyst_role;

-- Revoke a permission
REVOKE DELETE ON employees FROM intern_user;

-- Revoke all
REVOKE ALL PRIVILEGES ON employees FROM ex_employee;

In real projects, we usually manage permissions through roles, not individual users:

-- Create a role and assign permissions
CREATE ROLE analyst;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO analyst;

-- Assign the role to a user
GRANT analyst TO manish;

TCL — Transaction Control Language

TCL commands manage transactions — grouping multiple DML operations into an all-or-nothing unit.

-- Start a transaction
BEGIN;

-- Do some work
UPDATE accounts SET balance = balance - 1000 WHERE id = 1;
UPDATE accounts SET balance = balance + 1000 WHERE id = 2;

-- Save a checkpoint (we can rollback to here instead of the beginning)
SAVEPOINT after_transfer;

-- Do more work that might fail
UPDATE audit_log SET last_transfer = NOW() WHERE account_id = 1;

-- Oops, something went wrong? Rollback to the savepoint
ROLLBACK TO SAVEPOINT after_transfer;

-- Or commit everything
COMMIT;

DROP vs TRUNCATE vs DELETE

This comparison comes up a lot in interviews:

	DROP	TRUNCATE	DELETE
Type	DDL	DDL	DML
Removes	Table + data	All rows	Specific rows
Rollback?	No	No (usually)	Yes
WHERE clause?	No	No	Yes
Speed	Fast	Very fast	Slow (row-by-row)
Triggers fire?	No	No	Yes
Resets auto-increment?	N/A	Yes	No

Interview Tip

Interviewers love asking “What’s the difference between DELETE and TRUNCATE?” or “Is TRUNCATE DDL or DML?” The answer is that TRUNCATE is DDL (it’s a structural operation that deallocates pages), DELETE is DML (it logs each row deletion and fires triggers). Knowing this distinction shows we understand what’s happening under the hood.

References

Joins

beginner sql joins inner-join left-join right-join cross-join

Joins are how we combine data from two or more tables based on a related column. This is one of the most fundamental operations in SQL, and easily the most asked topic in database interviews.

Let’s set up our example tables first:

-- Employees table
-- id | name    | dept_id
-- 1  | Alice   | 1
-- 2  | Bob     | 2
-- 3  | Charlie | NULL    (no department assigned)
-- 4  | Diana   | 1

-- Departments table
-- id | name
-- 1  | Engineering
-- 2  | Marketing
-- 3  | HR          (no employees in this dept)

Visual Overview

INNER JOIN

Only matching rows from both

LEFT JOIN

All left rows + matching right

RIGHT JOIN

All right rows + matching left

FULL OUTER JOIN

All rows from both tables

INNER JOIN

Returns only the rows where there’s a match in both tables. If an employee has no department, or a department has no employees — they’re excluded.

SELECT e.name AS employee, d.name AS department
FROM employees e
INNER JOIN departments d ON e.dept_id = d.id;

-- Result:
-- employee | department
-- Alice    | Engineering
-- Bob      | Marketing
-- Diana    | Engineering

-- Charlie is excluded (dept_id is NULL, no match)
-- HR is excluded (no employees have dept_id = 3)

This is the most common join. When people just say “JOIN” without a prefix, they usually mean INNER JOIN.

LEFT JOIN (LEFT OUTER JOIN)

Returns all rows from the left table, plus matching rows from the right table. If there’s no match, the right side gets NULLs.

Think of it like: “Give me all employees, and their department if they have one.”

SELECT e.name AS employee, d.name AS department
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.id;

-- Result:
-- employee | department
-- Alice    | Engineering
-- Bob      | Marketing
-- Charlie  | NULL          ← included! (no matching dept)
-- Diana    | Engineering

LEFT JOIN is incredibly useful for finding “missing” data:

-- Find employees with no department
SELECT e.name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.id
WHERE d.id IS NULL;
-- Returns: Charlie

RIGHT JOIN (RIGHT OUTER JOIN)

The mirror of LEFT JOIN. Returns all rows from the right table, plus matching rows from the left. If there’s no match, the left side gets NULLs.

SELECT e.name AS employee, d.name AS department
FROM employees e
RIGHT JOIN departments d ON e.dept_id = d.id;

-- Result:
-- employee | department
-- Alice    | Engineering
-- Diana    | Engineering
-- Bob      | Marketing
-- NULL     | HR            ← included! (no employees in HR)

In practice, most developers just use LEFT JOIN and flip the table order instead of using RIGHT JOIN. It reads more naturally.

FULL OUTER JOIN

Returns all rows from both tables. Matching rows are combined, non-matching rows get NULLs on the missing side.

SELECT e.name AS employee, d.name AS department
FROM employees e
FULL OUTER JOIN departments d ON e.dept_id = d.id;

-- Result:
-- employee | department
-- Alice    | Engineering
-- Bob      | Marketing
-- Charlie  | NULL          ← no department
-- Diana    | Engineering
-- NULL     | HR            ← no employees

Note: MySQL doesn’t support FULL OUTER JOIN directly. We’d simulate it with a UNION of LEFT and RIGHT JOIN.

CROSS JOIN

Returns the cartesian product — every row from the left table paired with every row from the right table. No ON condition needed.

If table A has 4 rows and table B has 3 rows, the result has 4 x 3 = 12 rows.

SELECT e.name, d.name
FROM employees e
CROSS JOIN departments d;

-- Every employee paired with every department (12 rows)
-- Alice    | Engineering
-- Alice    | Marketing
-- Alice    | HR
-- Bob      | Engineering
-- Bob      | Marketing
-- Bob      | HR
-- ... and so on

Cross joins are rarely used in practice, but they’re useful for generating combinations (like all dates crossed with all products for a report).

SELF JOIN

A table joined with itself. We use aliases to treat the same table as two different tables. Super useful for hierarchical data.

-- Find employees and their managers
-- (assuming employees table has a manager_id column)
SELECT
    e.name AS employee,
    m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

-- Result:
-- employee | manager
-- Alice    | Bob       (Alice reports to Bob)
-- Bob      | NULL      (Bob is the top boss)
-- Charlie  | Alice     (Charlie reports to Alice)

Multiple Joins

We often chain joins together to pull data from 3+ tables:

-- Get order details with customer and product info
SELECT
    c.name AS customer,
    p.name AS product,
    o.quantity,
    o.total_price
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id
INNER JOIN products p ON o.product_id = p.id
WHERE o.created_at > '2024-01-01'
ORDER BY o.created_at DESC;

JOIN Performance Tips

Always join on indexed columns — joining on non-indexed columns causes full table scans.
Use INNER JOIN when possible — it’s typically faster than OUTER JOINs because the optimizer has more flexibility.
Filter early — put WHERE conditions to reduce rows before the join, not after.
Avoid joining too many tables — each join multiplies complexity. If we’re joining 7+ tables, we might want to rethink the query or use CTEs.

Interview Tip

The classic interview question is “explain the different types of joins.” Draw the Venn diagram on a whiteboard, use a concrete example like employees/departments, and show the actual result sets. Knowing what each join returns (and what it excludes) is more impressive than reciting definitions.

References

Subqueries & CTEs

intermediate sql subquery cte exists correlated-subquery recursive-cte

A subquery is simply a query inside another query. Instead of hardcoding a value, we compute it on the fly. And a CTE (Common Table Expression) is a way to write subqueries that are actually readable.

Types of Subqueries

Scalar Subquery — Returns a Single Value

This goes wherever a single value is expected — in a SELECT, WHERE, or HAVING clause.

-- Find employees who earn more than the average salary
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

-- The subquery returns one number (e.g., 75000)
-- The outer query then filters: WHERE salary > 75000

Row Subquery — Returns a Single Row

Returns one row with multiple columns. Useful for comparing against a set of values.

-- Find the employee with the highest salary in each department
SELECT name, salary, department_id
FROM employees
WHERE (department_id, salary) IN (
    SELECT department_id, MAX(salary)
    FROM employees
    GROUP BY department_id
);

Table Subquery — Returns Multiple Rows

Returns a full result set that we can use as a virtual table.

-- Use a subquery as a temporary table in FROM
SELECT dept_name, avg_salary
FROM (
    SELECT d.name AS dept_name, AVG(e.salary) AS avg_salary
    FROM employees e
    JOIN departments d ON e.department_id = d.id
    GROUP BY d.name
) AS dept_stats
WHERE avg_salary > 80000;

Correlated vs Non-Correlated

This distinction is crucial and comes up in interviews a lot.

Non-Correlated Subquery

The inner query runs once, independently of the outer query. It doesn’t reference any columns from the outer query.

-- Non-correlated: inner query runs ONCE
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

-- The database computes the average once, then uses that value for every row

Correlated Subquery

The inner query runs once for each row of the outer query. It references a column from the outer query. This makes it slower but more powerful.

-- Correlated: inner query runs ONCE PER ROW
-- Find employees who earn more than their department's average
SELECT e.name, e.salary, e.department_id
FROM employees e
WHERE e.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2
    WHERE e2.department_id = e.department_id  -- references outer query!
);

-- For each employee, the database computes their department's average
-- Then checks if that employee's salary is above it

In simple language: if the subquery can run on its own (paste it in a SQL editor and it works), it’s non-correlated. If it references the outer query and can’t run alone, it’s correlated.

EXISTS vs IN

Both check if something exists, but they work differently.

IN — Checks Against a List

-- Find employees in departments that are in New York
SELECT name FROM employees
WHERE department_id IN (
    SELECT id FROM departments WHERE location = 'New York'
);

-- The subquery returns a list of IDs: (1, 3, 7)
-- Then it checks: WHERE department_id IN (1, 3, 7)

EXISTS — Checks for Existence

-- Same logic, different approach
SELECT e.name FROM employees e
WHERE EXISTS (
    SELECT 1 FROM departments d
    WHERE d.id = e.department_id AND d.location = 'New York'
);

-- For each employee, it asks: "Does a matching department exist?"
-- Returns TRUE/FALSE, doesn't actually return the rows

When to Use Which?

Use IN when the subquery returns a small result set. It materializes the whole list first.
Use EXISTS when the subquery might return a large result set. It stops as soon as it finds one match (short-circuits).
EXISTS is usually faster for correlated subqueries with large tables.
IN can have issues with NULL values (NULL IN (…) returns NULL, not FALSE).

-- Gotcha with IN and NULLs
SELECT name FROM employees
WHERE department_id NOT IN (SELECT id FROM departments);
-- If departments has a NULL id, this returns NOTHING!
-- Because NOT IN with NULLs evaluates to UNKNOWN

-- EXISTS doesn't have this problem
SELECT e.name FROM employees e
WHERE NOT EXISTS (
    SELECT 1 FROM departments d WHERE d.id = e.department_id
);
-- Works correctly even with NULLs

CTEs — Common Table Expressions

CTEs let us name a subquery and reference it by name. They use the WITH keyword and make complex queries way more readable.

Think of it like: “Let me define a temporary result, give it a name, and use it below.”

-- Without CTE (messy nested subqueries)
SELECT dept_name, avg_salary
FROM (
    SELECT d.name AS dept_name, AVG(e.salary) AS avg_salary
    FROM employees e
    JOIN departments d ON e.department_id = d.id
    GROUP BY d.name
) AS dept_stats
WHERE avg_salary > (SELECT AVG(salary) FROM employees);

-- With CTE (clean and readable)
WITH dept_stats AS (
    SELECT d.name AS dept_name, AVG(e.salary) AS avg_salary
    FROM employees e
    JOIN departments d ON e.department_id = d.id
    GROUP BY d.name
),
company_avg AS (
    SELECT AVG(salary) AS avg_salary FROM employees
)
SELECT ds.dept_name, ds.avg_salary
FROM dept_stats ds, company_avg ca
WHERE ds.avg_salary > ca.avg_salary;

We can chain multiple CTEs and each one can reference the previous ones.

Recursive CTEs

Recursive CTEs are powerful for hierarchical data — org charts, folder structures, category trees.

They have two parts:

Base case — the starting rows
Recursive case — how to find the next level

-- Find all subordinates of a manager (org chart traversal)
WITH RECURSIVE subordinates AS (
    -- Base case: start with the manager
    SELECT id, name, manager_id, 1 AS level
    FROM employees
    WHERE id = 1  -- Start from employee #1

    UNION ALL

    -- Recursive case: find people who report to current level
    SELECT e.id, e.name, e.manager_id, s.level + 1
    FROM employees e
    INNER JOIN subordinates s ON e.manager_id = s.id
)
SELECT name, level FROM subordinates ORDER BY level;

-- Result:
-- name    | level
-- Alice   | 1       (the manager)
-- Bob     | 2       (reports to Alice)
-- Charlie | 2       (reports to Alice)
-- Diana   | 3       (reports to Bob)

Another classic example — generating a series of dates:

-- Generate all dates in January 2024
WITH RECURSIVE dates AS (
    SELECT DATE '2024-01-01' AS dt
    UNION ALL
    SELECT dt + INTERVAL '1 day'
    FROM dates
    WHERE dt < '2024-01-31'
)
SELECT dt FROM dates;

CTE vs Subquery — When to Use What

	Subquery	CTE
Readability	Can get messy when nested	Clean and named
Reusability	Must duplicate if used twice	Define once, use many times
Recursion	Not possible	Supported with RECURSIVE
Performance	Sometimes faster (inlined)	Sometimes materialized (check EXPLAIN)

In general: use CTEs for readability and when we need to reference the same derived table more than once. Use subqueries for simple, one-off filters.

Interview Tip

If asked to write a complex SQL query in an interview, use CTEs liberally. It shows clean thinking and makes the query much easier for the interviewer to follow. Also, knowing the difference between correlated and non-correlated subqueries is a common “gotcha” question — always mention the performance implication (correlated = once per row).

References

Aggregations & GROUP BY

beginner sql aggregation group-by having count sum avg

Aggregate functions take a bunch of rows and squash them down into a single value. Think of it like summarizing — instead of looking at 1,000 individual salaries, we want one number: the average.

The Five Core Aggregates

-- Sample data: employees table
-- name    | department | salary
-- Alice   | Eng        | 90000
-- Bob     | Eng        | 80000
-- Charlie | Marketing  | 70000
-- Diana   | Marketing  | 75000
-- Eve     | Eng        | 85000

SELECT
    COUNT(*) AS total_employees,     -- 5
    SUM(salary) AS total_payroll,    -- 400000
    AVG(salary) AS avg_salary,       -- 80000
    MIN(salary) AS lowest_salary,    -- 70000
    MAX(salary) AS highest_salary    -- 90000
FROM employees;

Quick Notes on COUNT

-- COUNT(*) counts ALL rows (including NULLs)
SELECT COUNT(*) FROM employees;  -- 5

-- COUNT(column) counts non-NULL values in that column
SELECT COUNT(manager_id) FROM employees;  -- might be 4 if one is NULL

-- COUNT(DISTINCT column) counts unique non-NULL values
SELECT COUNT(DISTINCT department) FROM employees;  -- 2 (Eng, Marketing)

GROUP BY — Splitting Into Buckets

Without GROUP BY, aggregates work on the entire table. With GROUP BY, we split rows into groups and aggregate each group separately.

Think of it like sorting physical papers into piles by department, then counting each pile.

-- Average salary PER department
SELECT
    department,
    COUNT(*) AS headcount,
    AVG(salary) AS avg_salary,
    MAX(salary) AS top_salary
FROM employees
GROUP BY department;

-- Result:
-- department | headcount | avg_salary | top_salary
-- Eng        | 3         | 85000      | 90000
-- Marketing  | 2         | 72500      | 75000

The Golden Rule of GROUP BY

Every column in SELECT must either be:

In the GROUP BY clause, OR
Inside an aggregate function

This is the rule that trips up beginners:

-- This FAILS (name is not grouped or aggregated)
SELECT name, department, AVG(salary)
FROM employees
GROUP BY department;
-- ERROR: column "name" must appear in GROUP BY or be in an aggregate

-- This WORKS
SELECT department, AVG(salary)
FROM employees
GROUP BY department;

Why? Because if we’re grouping by department, there are multiple names per group. The database doesn’t know which name to show.

HAVING vs WHERE

This is probably the most asked question about GROUP BY. Here’s the simple answer:

WHERE filters rows before grouping
HAVING filters groups after grouping

All Rows

→

WHERE

Filter rows

→

GROUP BY

Make groups

→

HAVING

Filter groups

→

Result

-- WHERE: filter individual rows BEFORE grouping
-- "Only look at employees hired after 2020"
SELECT department, AVG(salary) AS avg_salary
FROM employees
WHERE hired_at > '2020-01-01'    -- filters ROWS first
GROUP BY department;

-- HAVING: filter groups AFTER grouping
-- "Only show departments where the average salary is above 80K"
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) > 80000;      -- filters GROUPS

-- Both together:
-- "Among employees hired after 2020, show departments with avg salary > 80K"
SELECT department, AVG(salary) AS avg_salary
FROM employees
WHERE hired_at > '2020-01-01'
GROUP BY department
HAVING AVG(salary) > 80000;

The key thing: WHERE cannot use aggregate functions because it runs before grouping. HAVING can.

-- This FAILS
SELECT department, COUNT(*) FROM employees
WHERE COUNT(*) > 2 GROUP BY department;
-- ERROR: aggregate functions not allowed in WHERE

-- This WORKS
SELECT department, COUNT(*) FROM employees
GROUP BY department
HAVING COUNT(*) > 2;

SQL Execution Order

Understanding the order SQL actually executes helps a lot:

FROM — pick the table(s)
WHERE — filter individual rows
GROUP BY — form groups
HAVING — filter groups
SELECT — pick columns and compute aggregates
ORDER BY — sort the results
LIMIT — cap the output

This is why we can’t use a column alias from SELECT in a WHERE clause — SELECT hasn’t run yet!

-- This FAILS in most databases
SELECT department, AVG(salary) AS avg_sal
FROM employees
GROUP BY department
HAVING avg_sal > 80000;  -- Can't use alias in HAVING (in strict SQL)

-- This WORKS
SELECT department, AVG(salary) AS avg_sal
FROM employees
GROUP BY department
HAVING AVG(salary) > 80000;  -- Repeat the expression
-- (PostgreSQL and MySQL are more lenient here, but standard SQL requires this)

Practical Examples

-- Top 3 departments by headcount
SELECT department, COUNT(*) AS headcount
FROM employees
GROUP BY department
ORDER BY headcount DESC
LIMIT 3;

-- Departments where everyone earns at least 60K
SELECT department, MIN(salary) AS min_salary
FROM employees
GROUP BY department
HAVING MIN(salary) >= 60000;

-- Monthly revenue breakdown
SELECT
    DATE_TRUNC('month', order_date) AS month,
    COUNT(*) AS total_orders,
    SUM(amount) AS revenue,
    AVG(amount) AS avg_order_value
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;

Grouping by Multiple Columns

We can group by more than one column. Each unique combination of values becomes a group.

-- Average salary by department AND job level
SELECT department, job_level, AVG(salary) AS avg_salary
FROM employees
GROUP BY department, job_level
ORDER BY department, job_level;

-- Result:
-- department | job_level | avg_salary
-- Eng        | Junior    | 70000
-- Eng        | Senior    | 95000
-- Marketing  | Junior    | 60000
-- Marketing  | Senior    | 85000

Interview Tip

The WHERE vs HAVING question is guaranteed in DBMS interviews. The one-line answer: “WHERE filters rows before grouping, HAVING filters groups after grouping.” But go further — explain that WHERE can’t use aggregates because it runs before GROUP BY in the execution order. That shows deep understanding.

References

Window Functions

intermediate sql window-functions rank row-number partition-by lag lead

Window functions are one of the most powerful features in SQL, and once we learn them, we wonder how we ever lived without them. They let us perform calculations across a set of rows without collapsing them into a single row like GROUP BY does.

In simple language: GROUP BY squashes rows together. Window functions compute values across rows but keep every row in the result.

The Syntax

Every window function uses the OVER() clause:

function_name() OVER (
    PARTITION BY column    -- divide rows into groups (optional)
    ORDER BY column        -- order within each group (optional)
)

PARTITION BY vs GROUP BY

This is the fundamental difference to understand:

-- GROUP BY: collapses rows (one row per department)
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;
-- Result: 2 rows (one per department)

-- PARTITION BY: keeps all rows, adds the aggregate as a new column
SELECT
    name,
    department,
    salary,
    AVG(salary) OVER (PARTITION BY department) AS dept_avg
FROM employees;
-- Result: all 5 rows, each with their department's average salary
-- name    | department | salary | dept_avg
-- Alice   | Eng        | 90000  | 85000
-- Bob     | Eng        | 80000  | 85000
-- Eve     | Eng        | 85000  | 85000
-- Charlie | Marketing  | 70000  | 72500
-- Diana   | Marketing  | 75000  | 72500

See how every row is preserved? That’s the magic.

Ranking Functions

ROW_NUMBER()

Assigns a unique sequential number to each row within a partition. No ties — if two rows have the same value, they still get different numbers.

-- Number employees by salary within each department
SELECT
    name, department, salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
FROM employees;

-- Result:
-- name    | department | salary | rn
-- Alice   | Eng        | 90000  | 1
-- Eve     | Eng        | 85000  | 2
-- Bob     | Eng        | 80000  | 3
-- Diana   | Marketing  | 75000  | 1
-- Charlie | Marketing  | 70000  | 2

Super useful for “top N per group” queries:

-- Top 2 highest paid employees per department
WITH ranked AS (
    SELECT
        name, department, salary,
        ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
    FROM employees
)
SELECT name, department, salary
FROM ranked
WHERE rn <= 2;

RANK()

Like ROW_NUMBER, but ties get the same rank. After a tie, it skips numbers.

-- If Alice and Eve both earn 90000:
-- RANK: 1, 1, 3 (skips 2)
SELECT
    name, salary,
    RANK() OVER (ORDER BY salary DESC) AS rnk
FROM employees;
-- name    | salary | rnk
-- Alice   | 90000  | 1
-- Eve     | 90000  | 1   ← same rank (tie)
-- Bob     | 80000  | 3   ← skips 2!

DENSE_RANK()

Same as RANK, but doesn’t skip numbers after ties.

-- DENSE_RANK: 1, 1, 2 (no gaps)
SELECT
    name, salary,
    DENSE_RANK() OVER (ORDER BY salary DESC) AS drnk
FROM employees;
-- name    | salary | drnk
-- Alice   | 90000  | 1
-- Eve     | 90000  | 1   ← same rank (tie)
-- Bob     | 80000  | 2   ← no gap!

Salary ROW_NUMBER RANK DENSE_RANK

90000111

90000211

8000033 ← skip2 ← no skip

70000443

LAG and LEAD — Looking at Neighboring Rows

LAG looks at the previous row. LEAD looks at the next row. These are incredibly useful for comparisons over time.

-- Compare each month's revenue with the previous month
SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) AS prev_month,
    revenue - LAG(revenue) OVER (ORDER BY month) AS change
FROM monthly_revenue;

-- Result:
-- month   | revenue | prev_month | change
-- Jan     | 10000   | NULL       | NULL
-- Feb     | 12000   | 10000      | 2000
-- Mar     | 11000   | 12000      | -1000
-- Apr     | 15000   | 11000      | 4000

-- LAG with a default value (avoid NULLs for the first row)
LAG(revenue, 1, 0) OVER (ORDER BY month)
--            ^  ^
--            |  └── default value if no previous row
--            └── how many rows back to look

Running Totals with SUM() OVER()

This is a classic use case — compute a cumulative sum as we move through rows.

-- Running total of daily sales
SELECT
    order_date,
    amount,
    SUM(amount) OVER (ORDER BY order_date) AS running_total
FROM orders;

-- Result:
-- order_date | amount | running_total
-- Jan 1      | 100    | 100
-- Jan 2      | 250    | 350
-- Jan 3      | 175    | 525
-- Jan 4      | 300    | 825

We can also do running totals per group:

-- Running total per department
SELECT
    department,
    order_date,
    amount,
    SUM(amount) OVER (
        PARTITION BY department
        ORDER BY order_date
    ) AS dept_running_total
FROM orders;

Moving Averages with Frame Clauses

We can define a “window frame” — a sliding window of rows:

-- 3-day moving average
SELECT
    order_date,
    amount,
    AVG(amount) OVER (
        ORDER BY order_date
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW  -- current + 2 previous
    ) AS moving_avg_3day
FROM orders;

Common Real-World Patterns

-- Percentage of total
SELECT
    department,
    salary,
    salary * 100.0 / SUM(salary) OVER () AS pct_of_total
FROM employees;

-- Find duplicates (keep one, mark the rest)
WITH numbered AS (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
    FROM users
)
DELETE FROM users WHERE id IN (
    SELECT id FROM numbered WHERE rn > 1
);

Interview Tip

Window functions are a favorite topic in SQL interviews because they separate people who know basic SQL from those who know it well. Practice the “top N per group” pattern (ROW_NUMBER + CTE + WHERE rn <= N) — it comes up constantly. Also, be ready to explain the difference between ROW_NUMBER, RANK, and DENSE_RANK with a tie scenario.

References

Views & Materialized Views

intermediate sql views materialized-views performance caching

A view is a saved SQL query that we can use like a table. It doesn’t store any data — it just runs the query every time we access it. A materialized view actually stores the result on disk, like a cached snapshot.

Think of it like this: a view is a saved recipe (we cook fresh every time), and a materialized view is a meal-prepped container (already cooked, just reheat).

Regular Views

Creating a View

-- Create a view that shows employee details with department names
CREATE VIEW employee_details AS
SELECT
    e.id,
    e.name,
    e.email,
    e.salary,
    d.name AS department
FROM employees e
LEFT JOIN departments d ON e.department_id = d.id;

-- Now we can query it like a regular table
SELECT * FROM employee_details WHERE department = 'Engineering';

-- Under the hood, the database runs the full JOIN query every time

Why Use Views?

Simplify complex queries — write the complex JOIN once, use the simple view name everywhere
Security — expose only certain columns to certain users (hide salary, SSN, etc.)
Abstraction — if the underlying table structure changes, we update the view definition and all dependent queries keep working
Consistency — everyone uses the same business logic (e.g., “active users” always means the same thing)

-- Security: create a view that hides sensitive data
CREATE VIEW public_employees AS
SELECT id, name, department_id
FROM employees;
-- No salary, no email, no SSN

GRANT SELECT ON public_employees TO analyst_role;
-- Analysts can query this view but not the underlying table

Updating Through Views

Simple views can sometimes be updated (INSERT, UPDATE, DELETE through the view), but it only works if the view maps directly to a single underlying table without aggregations or JOINs.

-- This view is updatable (simple, single table)
CREATE VIEW active_users AS
SELECT * FROM users WHERE is_active = true;

-- We can insert through it
INSERT INTO active_users (name, email, is_active)
VALUES ('Manish', 'manish@example.com', true);

-- WITH CHECK OPTION prevents inserting rows that don't match the view's filter
CREATE VIEW active_users AS
SELECT * FROM users WHERE is_active = true
WITH CHECK OPTION;

-- This would FAIL (violates the WHERE condition)
INSERT INTO active_users (name, email, is_active)
VALUES ('Ghost', 'ghost@example.com', false);

Dropping and Modifying Views

-- Replace an existing view (or create if it doesn't exist)
CREATE OR REPLACE VIEW employee_details AS
SELECT e.id, e.name, e.salary, d.name AS department
FROM employees e
JOIN departments d ON e.department_id = d.id;

-- Drop a view
DROP VIEW IF EXISTS employee_details;

Materialized Views

A materialized view stores the query result physically on disk. When we query it, the database reads from the stored result — no recomputation needed.

This is huge for performance when we have expensive queries (complex joins, aggregations over millions of rows) that don’t need to be real-time.

Creating a Materialized View

-- Create a materialized view for a slow analytics query
CREATE MATERIALIZED VIEW monthly_revenue AS
SELECT
    DATE_TRUNC('month', order_date) AS month,
    COUNT(*) AS total_orders,
    SUM(amount) AS revenue,
    AVG(amount) AS avg_order_value
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;

-- This query runs ONCE and stores the result
-- Subsequent reads are instant — just reading from stored data
SELECT * FROM monthly_revenue WHERE month >= '2024-01-01';

Refreshing Materialized Views

The data in a materialized view gets stale. We need to refresh it periodically.

-- Full refresh: recomputes the entire view
REFRESH MATERIALIZED VIEW monthly_revenue;

-- Concurrent refresh: allows reads during refresh (needs a UNIQUE index)
CREATE UNIQUE INDEX idx_monthly_revenue_month ON monthly_revenue(month);
REFRESH MATERIALIZED VIEW CONCURRENTLY monthly_revenue;

Refresh Strategies

There are several ways to keep materialized views up to date:

Manual refresh — trigger it from our app or a cron job after a data load
Scheduled refresh — use pg_cron or a similar scheduler to refresh every N minutes/hours
On-demand refresh — refresh when a user requests the dashboard (with a staleness check)
Trigger-based — use a database trigger to refresh after inserts (careful with performance)

-- Example: refresh every hour using pg_cron (PostgreSQL extension)
-- SELECT cron.schedule('refresh-monthly-revenue', '0 * * * *',
--     'REFRESH MATERIALIZED VIEW CONCURRENTLY monthly_revenue');

View vs Materialized View

View

Materialized View

Storage

No data stored

Result stored on disk

Freshness

Always current

Stale until refreshed

Read Speed

Slow (runs query)

Fast (reads stored data)

Use Case

Abstraction, security

Expensive queries, dashboards

When to Use Each

Use a regular view when:

We want to simplify a complex query for reuse
We need real-time, always-current data
We want to restrict column access for security
The underlying query is fast enough

Use a materialized view when:

The query is expensive (complex joins, heavy aggregations)
We can tolerate slightly stale data
It’s for dashboards, reports, or analytics
The underlying data doesn’t change every second

Interview Tip

Know the difference between views and materialized views cold — it’s a common question. The key distinction: views are virtual (re-run every time), materialized views are physical (stored, need refreshing). Bonus: mention REFRESH MATERIALIZED VIEW CONCURRENTLY as a way to avoid blocking reads during refresh.

References

Stored Procedures & Triggers

intermediate sql stored-procedures triggers plpgsql orm

Stored procedures and triggers let us run logic inside the database rather than in our application code. They’ve been around for decades, and while modern apps don’t use them as heavily as before, they’re still important to understand.

Stored Procedures

A stored procedure is a block of SQL code that we save in the database and call by name. Think of it like a function — it takes parameters, does some work, and optionally returns results.

Why Use Them?

Reduce network roundtrips — instead of sending 10 queries from our app, send one procedure call
Enforce business logic at the database level — the logic runs no matter which app connects
Security — grant users permission to execute a procedure without giving them direct table access
Reusability — write once, call from anywhere

Basic Syntax (PostgreSQL)

-- Create a stored procedure to transfer funds
CREATE OR REPLACE PROCEDURE transfer_funds(
    sender_id INT,
    receiver_id INT,
    amount DECIMAL
)
LANGUAGE plpgsql
AS $$
BEGIN
    -- Debit the sender
    UPDATE accounts SET balance = balance - amount
    WHERE id = sender_id;

    -- Check if sender had enough balance
    IF NOT FOUND OR (SELECT balance FROM accounts WHERE id = sender_id) < 0 THEN
        RAISE EXCEPTION 'Insufficient funds';
    END IF;

    -- Credit the receiver
    UPDATE accounts SET balance = balance + amount
    WHERE id = receiver_id;

    -- Log the transaction
    INSERT INTO transaction_log (from_id, to_id, amount, created_at)
    VALUES (sender_id, receiver_id, amount, NOW());
END;
$$;

-- Call the procedure
CALL transfer_funds(1, 2, 500.00);

Functions vs Procedures

In PostgreSQL, there’s a distinction:

-- FUNCTION: returns a value, can be used in SELECT
CREATE OR REPLACE FUNCTION get_employee_count(dept_id INT)
RETURNS INT
LANGUAGE plpgsql
AS $$
DECLARE
    emp_count INT;
BEGIN
    SELECT COUNT(*) INTO emp_count
    FROM employees WHERE department_id = dept_id;
    RETURN emp_count;
END;
$$;

-- Use in a query
SELECT department_id, get_employee_count(department_id) FROM departments;

-- PROCEDURE: doesn't return a value, called with CALL
-- Can manage transactions (COMMIT/ROLLBACK inside)
CALL transfer_funds(1, 2, 500.00);

Triggers

A trigger is a stored procedure that fires automatically when a specific event happens on a table — INSERT, UPDATE, or DELETE. We don’t call it manually; it runs on its own.

BEFORE vs AFTER Triggers

BEFORE triggers run before the operation. We can modify the data or cancel the operation entirely.
AFTER triggers run after the operation. Good for logging, auditing, or cascading updates.

-- Trigger function: automatically set updated_at on every UPDATE
CREATE OR REPLACE FUNCTION update_timestamp()
RETURNS TRIGGER
LANGUAGE plpgsql
AS $$
BEGIN
    NEW.updated_at = NOW();  -- NEW refers to the row being inserted/updated
    RETURN NEW;
END;
$$;

-- Attach the trigger to the employees table
CREATE TRIGGER set_updated_at
    BEFORE UPDATE ON employees        -- fires BEFORE every UPDATE
    FOR EACH ROW                      -- runs once per row
    EXECUTE FUNCTION update_timestamp();

-- Now any UPDATE on employees automatically sets updated_at
UPDATE employees SET salary = 90000 WHERE id = 1;
-- updated_at is automatically set to NOW() — we didn't have to include it

INSERT, UPDATE, DELETE Triggers

-- Audit log: track who deleted what
CREATE OR REPLACE FUNCTION log_deletion()
RETURNS TRIGGER
LANGUAGE plpgsql
AS $$
BEGIN
    INSERT INTO audit_log (table_name, record_id, action, old_data, deleted_at)
    VALUES (
        TG_TABLE_NAME,           -- name of the table that fired the trigger
        OLD.id,                  -- OLD refers to the row being deleted/updated
        'DELETE',
        row_to_json(OLD),        -- save the entire deleted row as JSON
        NOW()
    );
    RETURN OLD;
END;
$$;

-- Fire AFTER delete on any important table
CREATE TRIGGER audit_employee_delete
    AFTER DELETE ON employees
    FOR EACH ROW
    EXECUTE FUNCTION log_deletion();

Conditional Triggers

We can add a WHEN clause to only fire the trigger under certain conditions:

-- Only fire when salary actually changes (not on every update)
CREATE TRIGGER salary_change_alert
    AFTER UPDATE ON employees
    FOR EACH ROW
    WHEN (OLD.salary IS DISTINCT FROM NEW.salary)
    EXECUTE FUNCTION notify_salary_change();

Common Trigger Use Cases

Auto-timestamps — set created_at on INSERT, updated_at on UPDATE
Audit logging — track all changes to sensitive tables
Validation — reject invalid data before it’s written
Cascading updates — update a denormalized counter when related data changes
Notifications — send a NOTIFY event when something changes (PostgreSQL)

The Downsides

Stored procedures and triggers have fallen out of favor in modern application development. Here’s why:

Hidden logic — business rules live in two places (app + database). Debugging is harder.
Hard to test — unit testing a trigger is much harder than testing application code.
Version control — database code is harder to track in git compared to app code.
Portability — PL/pgSQL doesn’t work in MySQL. T-SQL doesn’t work in PostgreSQL. Stored procedure languages are vendor-specific.
Scaling — putting logic in the database means the database does more work. It’s harder to scale a database than stateless app servers.
ORM friction — ORMs (Prisma, SQLAlchemy, Sequelize) manage schema and queries from the application layer, making stored procedures redundant for most use cases.

Why Modern Apps Use ORMs Instead

ORMs like Prisma, SQLAlchemy, and Sequelize handle most of what stored procedures used to do — all from application code:

-- Instead of a stored procedure for creating an order,
-- the application code handles it:
--
-- async function createOrder(userId, items) {
--   const tx = await prisma.$transaction([
--     prisma.order.create({ data: { userId, total } }),
--     prisma.inventory.updateMany({ ... }),
--     prisma.auditLog.create({ data: { ... } }),
--   ]);
--   return tx;
-- }
--
-- Benefits: testable, in git, uses the same language as the rest of the app

When Stored Procedures/Triggers Still Make Sense

Despite the trend away from them, there are valid use cases:

Multi-application databases — when multiple apps share one database, enforcing rules at the database level ensures consistency
Performance-critical operations — eliminating network roundtrips matters for bulk operations
Audit requirements — regulatory compliance may require database-level audit triggers that can’t be bypassed by application code
Auto-timestamps — even ORM-heavy apps often use triggers for updated_at because it catches manual SQL updates too

Interview Tip

Interviewers often ask “What are triggers? Give an example.” The updated_at auto-timestamp trigger is the perfect example — it’s simple, practical, and universally understood. If they ask about pros and cons, mentioning the testability and portability issues shows we’ve thought about this in a real-world context, not just theoretically.

References

Schema Design & Normalization

ER Diagrams

beginner er-diagram schema-design cardinality relationships entities

Before we write a single line of SQL, we need a blueprint. That blueprint is the Entity-Relationship (ER) Diagram. It’s a visual representation of our data model — what entities exist, what attributes they have, and how they relate to each other.

Think of it like an architect’s floor plan. We don’t start building walls before we know where the rooms, doors, and hallways go.

The Three Building Blocks

1. Entities

An entity is a “thing” we want to store data about. Users, products, orders, departments — these are all entities. Each entity becomes a table in our database.

2. Attributes

Attributes are the properties of an entity. A User entity might have: id, name, email, created_at. Each attribute becomes a column in the table.

Primary Key — uniquely identifies each row (usually id)
Required vs Optional — NOT NULL vs nullable
Derived — computed from other attributes (like age from birth_date)

3. Relationships

How entities connect to each other. A User places Orders. An Order contains Products. These connections are the relationships.

A Simple ER Model

Users

PK id

name

created_at

——————

places

——————

Orders

PK id

FK user_id

total

status

ordered_at

——————

contains

——————

Products

PK id

name

price

Cardinality — How Many on Each Side?

Cardinality describes how many instances of one entity relate to another. There are three main types:

One-to-One (1:1)

Each row in Table A relates to exactly one row in Table B, and vice versa.

Example: Each user has exactly one profile. Each profile belongs to exactly one user.

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100)
);

CREATE TABLE user_profiles (
    id SERIAL PRIMARY KEY,
    user_id INT UNIQUE REFERENCES users(id),  -- UNIQUE makes it 1:1
    bio TEXT,
    avatar_url VARCHAR(500)
);

One-to-one relationships are rare. They’re usually used to split a wide table into two (for performance or security reasons).

One-to-Many (1:N)

One row in Table A can relate to many rows in Table B, but each row in Table B relates to only one row in Table A.

Example: One department has many employees. Each employee belongs to one department.

CREATE TABLE departments (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100)
);

CREATE TABLE employees (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    department_id INT REFERENCES departments(id)  -- FK on the "many" side
);

This is the most common relationship type. The foreign key always goes on the “many” side.

Many-to-Many (M:N)

Rows in Table A can relate to multiple rows in Table B, and vice versa.

Example: One order can contain many products. One product can appear in many orders.

We can’t directly represent this with a foreign key. We need a junction table (also called a bridge table or join table).

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    user_id INT REFERENCES users(id),
    total DECIMAL(10,2)
);

CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    price DECIMAL(10,2)
);

-- Junction table: connects orders and products
CREATE TABLE order_items (
    id SERIAL PRIMARY KEY,
    order_id INT REFERENCES orders(id),
    product_id INT REFERENCES products(id),
    quantity INT DEFAULT 1,
    unit_price DECIMAL(10,2)  -- price at time of purchase (denormalized)
);

Reading an ER Diagram

When we see an ER diagram, here’s how to translate it:

Each box becomes a table
Each attribute becomes a column
Lines between boxes become foreign keys
The “1” side doesn’t need anything special
The “N” side gets the foreign key column
“M:N” relationships need a junction table

Common ER Diagram Notations

There are several notation styles. The two most common are:

Chen notation — uses diamonds for relationships, ovals for attributes (more academic)
Crow’s foot notation — uses fork symbols to show cardinality (more practical, industry standard)

In crow’s foot notation:

A single line means “one”
A fork (crow’s foot) means “many”
A circle means “optional” (zero)
A dash means “mandatory” (one)

Practical Tips for Drawing ER Diagrams

Start with entities — identify the nouns (User, Product, Order, Payment)
Add relationships — identify the verbs (places, contains, pays for)
Determine cardinality — ask “can one X have many Y?” for both directions
Add attributes — what do we need to know about each entity?
Identify keys — every entity needs a primary key
Normalize — check for redundancy and split tables if needed

Tools for Drawing ER Diagrams

dbdiagram.io — write schema in code, get a visual diagram (free, great for quick work)
Lucidchart — drag and drop, good for presentations
draw.io (diagrams.net) — free, versatile, works offline
pgAdmin — can generate ER diagrams from existing PostgreSQL databases

Interview Tip

If an interviewer asks us to design a schema, always start with an ER diagram (even a rough one on the whiteboard). It shows structured thinking. Identify entities first, then relationships, then attributes. Don’t jump straight into writing CREATE TABLE statements.

References

Normalization (1NF-BCNF)

intermediate normalization 1nf 2nf 3nf bcnf functional-dependency

Normalization is the process of organizing a database to reduce redundancy and prevent update anomalies. We break one messy table into multiple clean tables, each representing a single concept.

Think of it like organizing a messy closet. Everything is shoved in one pile (the flat table), and we’re sorting it into separate drawers (normalized tables) so things don’t get lost or duplicated.

Why Normalize?

Without normalization, we get three kinds of anomalies:

Insert anomaly — we can’t add data without unrelated data. (Can’t add a new department without assigning an employee to it.)
Update anomaly — changing one fact requires updating multiple rows. (Department name changes? Update every employee row.)
Delete anomaly — deleting data accidentally removes unrelated data. (Delete the last employee in a department? The department itself disappears.)

The Starting Point: A Flat Table

Let’s normalize this mess step by step:

-- The "everything in one table" approach
-- student_id | student_name | course_id | course_name | instructor | instructor_phone
-- 1          | Alice        | CS101     | Databases   | Dr. Smith  | 555-1234
-- 1          | Alice        | CS102     | Networks    | Dr. Jones  | 555-5678
-- 2          | Bob          | CS101     | Databases   | Dr. Smith  | 555-1234
-- 2          | Bob          | CS103     | OS          | Dr. Smith  | 555-1234
-- 3          | Charlie      | CS102     | Networks    | Dr. Jones  | 555-5678

Problems everywhere: Alice’s name is stored twice. Dr. Smith’s phone number is stored three times. Change one, forget the other — data becomes inconsistent.

The Normalization Progression

UNF

Unnormalized

→

1NF

Atomic values

→

2NF

No partial deps

→

3NF

No transitive deps

→

BCNF

Stricter 3NF

Each level builds on the previous. Most real databases aim for 3NF.

1NF — First Normal Form

Rule: Every cell must contain a single, atomic value. No lists, no repeating groups, no arrays stuffed into one column.

-- VIOLATES 1NF (courses is a comma-separated list)
-- student_id | student_name | courses
-- 1          | Alice        | CS101, CS102

-- SATISFIES 1NF (one value per cell, one row per enrollment)
-- student_id | student_name | course_id
-- 1          | Alice        | CS101
-- 1          | Alice        | CS102

Also: every table must have a primary key. In our example, the primary key is the combination (student_id, course_id).

2NF — Second Normal Form

Rule: Must be in 1NF AND no partial dependencies. Every non-key column must depend on the entire primary key, not just part of it.

This only matters when we have a composite primary key (two or more columns). If our primary key is a single column, we’re automatically in 2NF.

Look at our table with composite key (student_id, course_id):

-- student_id | student_name | course_id | course_name | instructor
-- PK: (student_id, course_id)

-- student_name depends on student_id ALONE (partial dependency!)
-- course_name depends on course_id ALONE (partial dependency!)
-- Both violate 2NF because they don't need the FULL composite key

To fix: move partially dependent columns to their own tables.

-- Students table (student_name depends only on student_id)
CREATE TABLE students (
    student_id INT PRIMARY KEY,
    student_name VARCHAR(100)
);

-- Courses table (course_name, instructor depend only on course_id)
CREATE TABLE courses (
    course_id VARCHAR(10) PRIMARY KEY,
    course_name VARCHAR(100),
    instructor VARCHAR(100)
);

-- Enrollments table (the relationship)
CREATE TABLE enrollments (
    student_id INT REFERENCES students(student_id),
    course_id VARCHAR(10) REFERENCES courses(course_id),
    PRIMARY KEY (student_id, course_id)
);

Now each non-key column depends on the full primary key of its table.

3NF — Third Normal Form

Rule: Must be in 2NF AND no transitive dependencies. Non-key columns must depend directly on the primary key, not through another non-key column.

In simple language: if A determines B, and B determines C, then C shouldn’t be in the same table as A. Move C to B’s table.

-- Our courses table still has a problem:
-- course_id → instructor → instructor_phone
-- instructor_phone depends on instructor, NOT on course_id directly
-- That's a transitive dependency!

-- course_id | course_name | instructor | instructor_phone
-- CS101     | Databases   | Dr. Smith  | 555-1234
-- CS103     | OS          | Dr. Smith  | 555-1234   ← phone duplicated!

To fix: move instructor details to their own table.

-- Instructors table
CREATE TABLE instructors (
    instructor_id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    phone VARCHAR(20)
);

-- Courses table (references instructor by FK)
CREATE TABLE courses (
    course_id VARCHAR(10) PRIMARY KEY,
    course_name VARCHAR(100),
    instructor_id INT REFERENCES instructors(instructor_id)
);

Now instructor_phone depends directly on instructor_id in its own table. No transitive dependency.

BCNF — Boyce-Codd Normal Form

Rule: Must be in 3NF AND every determinant must be a candidate key. This is a stricter version of 3NF that handles some edge cases.

In simple language: for every functional dependency X -> Y, X must be a superkey (capable of being a primary key).

BCNF and 3NF are the same in most practical cases. The difference only shows up with overlapping composite candidate keys — which is pretty rare.

-- A rare BCNF violation example:
-- A student can only have one advisor per subject
-- An advisor teaches only one subject

-- student | subject | advisor
-- Alice   | DB      | Dr. Smith     (advisor → subject)
-- Bob     | DB      | Dr. Smith
-- Alice   | Net     | Dr. Jones

-- Candidate keys: (student, subject) and (student, advisor)
-- advisor → subject is a dependency where advisor is NOT a superkey
-- This violates BCNF!

-- Fix: split into two tables
-- student_advisor: student | advisor
-- advisor_subject:  advisor | subject

The Full Normalized Schema

After all that work, our original flat table is now four clean tables:

CREATE TABLE students (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL
);

CREATE TABLE instructors (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    phone VARCHAR(20)
);

CREATE TABLE courses (
    id VARCHAR(10) PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    instructor_id INT REFERENCES instructors(id)
);

CREATE TABLE enrollments (
    student_id INT REFERENCES students(id),
    course_id VARCHAR(10) REFERENCES courses(id),
    enrolled_at TIMESTAMP DEFAULT NOW(),
    PRIMARY KEY (student_id, course_id)
);

No redundancy. Change an instructor’s phone in one place, it’s updated everywhere. Delete a student, courses still exist. Add a new course without needing a student. All anomalies gone.

How Far Should We Normalize?

Most real-world databases aim for 3NF. Going beyond that (BCNF, 4NF, 5NF) is rarely needed and can make queries slower due to excessive JOINs.

The practical rule: normalize until the redundancy is gone, then stop. If performance suffers from too many JOINs, we can strategically denormalize (which we cover in the next note).

Interview Tip

Normalization is a classic DBMS interview topic. Walk through the example step by step: start with a messy flat table, explain the anomalies, then normalize through 1NF, 2NF, and 3NF. The key phrases to remember: “atomic values” (1NF), “no partial dependencies” (2NF), “no transitive dependencies” (3NF). If asked about BCNF, mention it’s a stricter 3NF where every determinant must be a candidate key.

References

Denormalization

intermediate denormalization performance schema-design redundancy analytics

We just spent an entire note learning how to normalize — remove redundancy, split tables, eliminate anomalies. And now we’re going to talk about intentionally adding redundancy back in. Welcome to denormalization.

In simple language: denormalization is the trade-off of faster reads at the cost of slower writes and more storage. We duplicate data so we don’t have to JOIN five tables for a simple query.

Why Denormalize?

Normalization is great for data integrity. But it creates a lot of tables, which means a lot of JOINs. JOINs are expensive. When our app needs to serve a dashboard that joins 8 tables with millions of rows, normalization can kill performance.

Denormalization helps when:

Read performance matters more than write performance (most web apps)
Query complexity is getting out of hand (too many JOINs)
Analytics/reporting needs fast aggregations
The data doesn’t change often (or we can tolerate slightly stale data)

The key insight: normalize first, then denormalize strategically where needed. Never start denormalized — we’ll end up with a mess.

Common Denormalization Patterns

1. Duplicated Fields

Store a frequently accessed value from a related table directly in the current table.

-- Normalized: need to JOIN to get the department name
SELECT o.id, o.total, c.name AS customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.id;

-- Denormalized: customer name stored directly on the order
ALTER TABLE orders ADD COLUMN customer_name VARCHAR(100);

-- Now we can query without a JOIN
SELECT id, total, customer_name FROM orders;

-- Trade-off: if the customer changes their name, we need to
-- update it in both the customers table AND every order

2. Precomputed/Cached Counters

Instead of running COUNT(*) every time, maintain a counter column.

-- Normalized: count posts every time (slow on large tables)
SELECT u.name, COUNT(p.id) AS post_count
FROM users u
LEFT JOIN posts p ON p.user_id = u.id
GROUP BY u.name;

-- Denormalized: maintain a counter
ALTER TABLE users ADD COLUMN post_count INT DEFAULT 0;

-- Increment on new post
UPDATE users SET post_count = post_count + 1 WHERE id = :user_id;

-- Decrement on deleted post
UPDATE users SET post_count = post_count - 1 WHERE id = :user_id;

-- Now the count is instant — no JOIN, no aggregation
SELECT name, post_count FROM users WHERE id = 1;

This is extremely common. Think of follower counts on social media — nobody runs COUNT(*) on a 100M row followers table for every profile view.

3. Summary Tables

Pre-aggregate data into a separate table for reporting.

-- Instead of computing daily revenue from millions of orders each time:
CREATE TABLE daily_revenue (
    date DATE PRIMARY KEY,
    total_orders INT,
    total_revenue DECIMAL(12,2),
    avg_order_value DECIMAL(10,2)
);

-- Populate it nightly (or in real-time with triggers)
INSERT INTO daily_revenue (date, total_orders, total_revenue, avg_order_value)
SELECT
    DATE(ordered_at),
    COUNT(*),
    SUM(total),
    AVG(total)
FROM orders
WHERE DATE(ordered_at) = CURRENT_DATE - 1
GROUP BY DATE(ordered_at);

4. Computed Columns

Store a value that could be derived, to avoid computing it every time.

-- Instead of computing full_name every time:
ALTER TABLE employees ADD COLUMN full_name VARCHAR(200);

-- Or instead of computing age from birth_date:
-- (This is a bad example actually — age changes daily,
-- so we'd just compute it. But for things like total_price
-- on an order line item, it makes sense.)

ALTER TABLE order_items ADD COLUMN total_price DECIMAL(10,2);
-- total_price = quantity * unit_price
-- Computed once at insert time, read millions of times

5. Embedding/Flattening Relationships

Instead of a junction table, store related data directly (common in NoSQL but applicable to SQL too).

-- Instead of: orders → order_items → products (3 tables, 2 JOINs)
-- Store order items as JSON:
ALTER TABLE orders ADD COLUMN items JSONB;

-- Insert:
-- UPDATE orders SET items = '[
--   {"product_id": 1, "name": "Widget", "qty": 2, "price": 9.99},
--   {"product_id": 3, "name": "Gadget", "qty": 1, "price": 24.99}
-- ]' WHERE id = 1;

-- Query without JOINs:
SELECT id, total, items FROM orders WHERE id = 1;

The Trade-offs

Benefits

Costs

Faster reads (fewer JOINs)

Slower writes (update multiple places)

Simpler queries

Data inconsistency risk

Great for dashboards/reports

More storage used

Reduces database load

More complex app logic to keep in sync

When Denormalization Makes Sense

Read-heavy workloads — 99% reads, 1% writes (most web apps)
Analytics and reporting — dashboards that aggregate millions of rows
Caching layers — materialized views and summary tables act as “database-level caches”
NoSQL databases — document stores are denormalized by design (embed related data)
Search/listing pages — product listings, search results, feeds

When It Doesn’t Make Sense

Write-heavy workloads — every write needs to update multiple places
Frequently changing data — keeping denormalized copies in sync becomes a nightmare
Small datasets — JOINs on small tables are fast. Don’t optimize prematurely.
Data integrity is critical — banking, healthcare, anything where inconsistency is unacceptable

How to Keep Denormalized Data in Sync

Triggers — automatically update denormalized fields when source data changes
Application logic — update both the source and the copy in the same transaction
Background jobs — periodically refresh summary tables (materialized views)
Event-driven — publish change events, subscribers update denormalized copies

Interview Tip

Interviewers love asking “when would you denormalize?” The answer is always: “when read performance is more important than write simplicity, and we’ve already normalized properly.” Give a concrete example like cached counters (follower count, like count) or summary tables for dashboards. Show that we understand it’s a deliberate trade-off, not laziness.

References

Relationships & Keys

beginner primary-key foreign-key composite-key uuid referential-integrity cascade

Keys are the backbone of relational databases. They uniquely identify rows, link tables together, and enforce data integrity. Without keys, a relational database is just a bunch of spreadsheets with no connections.

Primary Keys

A primary key uniquely identifies each row in a table. No two rows can have the same primary key, and it can never be NULL.

Natural Keys vs Surrogate Keys

A natural key uses a real-world value that’s already unique — like an email address or a Social Security Number.

A surrogate key is an artificial value we generate — like an auto-incrementing integer or a UUID.

-- Natural key: using email (it's already unique)
CREATE TABLE users (
    email VARCHAR(255) PRIMARY KEY,
    name VARCHAR(100)
);

-- Surrogate key: auto-incrementing integer
CREATE TABLE users (
    id SERIAL PRIMARY KEY,  -- 1, 2, 3, 4...
    email VARCHAR(255) UNIQUE,
    name VARCHAR(100)
);

-- Surrogate key: UUID
CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email VARCHAR(255) UNIQUE,
    name VARCHAR(100)
);

Auto-increment vs UUID — When to Use Which?

Auto-increment (SERIAL)

UUID

Small (4 or 8 bytes)

Large (16 bytes)

Sequential (great for B-tree indexes)

Random (can fragment indexes)

Predictable (security concern in URLs)

Unpredictable (safe in URLs)

Single DB only (conflicts in distributed)

Works in distributed systems

Rule of thumb: Use auto-increment for internal IDs. Use UUIDs when IDs are exposed in URLs or when working with distributed databases.

Foreign Keys

A foreign key is a column that references the primary key of another table. It’s how we create relationships between tables and enforce referential integrity.

Referential integrity means: if a row in the orders table says user_id = 5, there MUST be a user with id = 5 in the users table. The database enforces this automatically.

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100)
);

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    user_id INT NOT NULL,
    total DECIMAL(10,2),
    -- Foreign key constraint
    CONSTRAINT fk_user FOREIGN KEY (user_id) REFERENCES users(id)
);

-- This WORKS (user 1 exists)
INSERT INTO users (id, name) VALUES (1, 'Manish');
INSERT INTO orders (user_id, total) VALUES (1, 99.99);

-- This FAILS (user 999 doesn't exist)
INSERT INTO orders (user_id, total) VALUES (999, 49.99);
-- ERROR: insert or update violates foreign key constraint "fk_user"

ON DELETE Behavior

What happens when we delete a parent row that has child rows referencing it? We have several options:

-- RESTRICT (default): prevent deletion if references exist
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE RESTRICT
-- "Can't delete this user — they have orders!"

-- CASCADE: delete child rows too
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE
-- Delete the user → all their orders are automatically deleted too

-- SET NULL: set the FK to NULL
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE SET NULL
-- Delete the user → orders remain but user_id becomes NULL

-- SET DEFAULT: set the FK to its default value
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE SET DEFAULT
-- Delete the user → user_id is set to whatever DEFAULT is defined

Similarly, ON UPDATE CASCADE is useful when the referenced key might change:

-- If user's id changes, update all orders that reference it
FOREIGN KEY (user_id) REFERENCES users(id)
    ON DELETE CASCADE
    ON UPDATE CASCADE;

Which ON DELETE to Use?

CASCADE — when child data is meaningless without the parent (order items when an order is deleted)
RESTRICT — when deletion should be blocked (can’t delete a department if employees are assigned)
SET NULL — when the relationship is optional (the comment still exists, but the author is gone)

Composite Keys

A composite key is a primary key made of two or more columns together. Neither column is unique on its own, but the combination is.

-- A student can enroll in many courses
-- A course can have many students
-- The combination (student_id, course_id) is unique

CREATE TABLE enrollments (
    student_id INT REFERENCES students(id),
    course_id INT REFERENCES courses(id),
    enrolled_at TIMESTAMP DEFAULT NOW(),
    grade CHAR(2),
    PRIMARY KEY (student_id, course_id)  -- composite key
);

-- Student 1 can enroll in Course 101 and Course 102
-- But Student 1 can't enroll in Course 101 twice

Junction Tables for Many-to-Many

We can’t directly represent a many-to-many relationship with a foreign key. We need a junction table (also called a bridge table, join table, or linking table).

-- Students ←→ Courses (many-to-many)
-- A student takes many courses. A course has many students.

CREATE TABLE students (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100)
);

CREATE TABLE courses (
    id SERIAL PRIMARY KEY,
    title VARCHAR(200)
);

-- Junction table: the "glue" between them
CREATE TABLE student_courses (
    student_id INT REFERENCES students(id) ON DELETE CASCADE,
    course_id INT REFERENCES courses(id) ON DELETE CASCADE,
    enrolled_at TIMESTAMP DEFAULT NOW(),
    PRIMARY KEY (student_id, course_id)
);

-- Enroll student 1 in courses 1 and 2
INSERT INTO student_courses (student_id, course_id) VALUES (1, 1);
INSERT INTO student_courses (student_id, course_id) VALUES (1, 2);

-- Find all courses for a student
SELECT c.title
FROM courses c
JOIN student_courses sc ON c.id = sc.course_id
WHERE sc.student_id = 1;

Junction tables can also carry extra data about the relationship itself — like when the enrollment happened, what grade was received, etc.

Unique Constraints

Besides the primary key, we often need other columns to be unique:

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255) UNIQUE,              -- single column unique
    username VARCHAR(50) UNIQUE,
    UNIQUE (first_name, last_name)          -- composite unique (rare)
);

The difference between a UNIQUE constraint and a PRIMARY KEY:

A table can have many UNIQUE constraints but only one PRIMARY KEY
UNIQUE columns can be NULL (and multiple NULLs are allowed in most databases)
PRIMARY KEY = UNIQUE + NOT NULL

Check Constraints

While we’re talking about constraints, CHECK constraints are worth mentioning:

CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    price DECIMAL(10,2) CHECK (price > 0),           -- must be positive
    discount DECIMAL(3,2) CHECK (discount BETWEEN 0 AND 1),  -- 0-100%
    status VARCHAR(20) CHECK (status IN ('active', 'archived', 'draft'))
);

Interview Tip

When designing a schema in an interview, always start by identifying the relationships and their cardinalities. For 1:N, put the FK on the “many” side. For M:N, create a junction table. Know the ON DELETE options (CASCADE, RESTRICT, SET NULL) and when to use each. And be ready to explain natural vs surrogate keys — most interviewers have a preference and like to debate it.

References

Schema Design Patterns

intermediate schema-design star-schema snowflake-schema eav soft-delete polymorphic

Schema design isn’t just about normalization. There are well-known patterns that solve specific problems — analytics, flexibility, deletion, and polymorphism. Let’s walk through the most important ones.

Star Schema

The star schema is the bread and butter of data warehousing and analytics. It has two types of tables:

Fact table — stores the events/transactions (sales, clicks, orders). Contains foreign keys to dimension tables and numeric measures (amount, quantity, duration).
Dimension tables — stores the context (who, what, where, when). Contains descriptive attributes.

The fact table sits in the center, surrounded by dimension tables — like a star.

dim_date

year, month, quarter, day_of_week

dim_customer

name, city, segment

fact_sales

FK: date_id, customer_id,
product_id, store_id

quantity, amount, discount

dim_product

name, category, brand

dim_store

name, city, region, country

-- Fact table: one row per sale
CREATE TABLE fact_sales (
    id SERIAL PRIMARY KEY,
    date_id INT REFERENCES dim_date(id),
    customer_id INT REFERENCES dim_customer(id),
    product_id INT REFERENCES dim_product(id),
    store_id INT REFERENCES dim_store(id),
    quantity INT,
    amount DECIMAL(10,2),
    discount DECIMAL(5,2)
);

-- Dimension table: date details (precomputed for fast queries)
CREATE TABLE dim_date (
    id INT PRIMARY KEY,           -- e.g., 20240315 for March 15, 2024
    full_date DATE,
    year INT,
    quarter INT,
    month INT,
    month_name VARCHAR(20),
    day_of_week VARCHAR(10),
    is_weekend BOOLEAN
);

-- Queries are fast — just one JOIN per dimension
SELECT
    d.year, d.month_name,
    p.category,
    SUM(f.amount) AS total_revenue
FROM fact_sales f
JOIN dim_date d ON f.date_id = d.id
JOIN dim_product p ON f.product_id = p.id
GROUP BY d.year, d.month_name, p.category;

When to use: Data warehouses, BI dashboards, analytics systems. Denormalized by design for query speed.

Snowflake Schema

A snowflake schema is like a star schema, but the dimension tables are normalized — they reference additional lookup tables.

Instead of dim_product having a category column, it has a category_id that references a separate dim_category table.

-- Star: product has category as a string
-- dim_product: id, name, category, brand

-- Snowflake: product references a category table
CREATE TABLE dim_category (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    department VARCHAR(100)
);

CREATE TABLE dim_product (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    category_id INT REFERENCES dim_category(id),
    brand VARCHAR(100)
);

Star vs Snowflake: Star is simpler and faster (fewer JOINs). Snowflake saves storage and avoids redundancy. Most data warehouses prefer star for its simplicity.

EAV — Entity-Attribute-Value

The EAV pattern stores data as rows of (entity, attribute, value) triples instead of fixed columns. It’s the most flexible schema possible — we can add any attribute to any entity without changing the table structure.

-- Traditional approach: every product type needs different columns
-- Electronics: screen_size, battery, processor
-- Clothing: size, color, material
-- Books: author, isbn, pages
-- Maintaining one table per type (or one wide table) gets messy fast

-- EAV approach: three columns handle everything
CREATE TABLE product_attributes (
    id SERIAL PRIMARY KEY,
    product_id INT REFERENCES products(id),
    attribute_name VARCHAR(100),   -- "screen_size", "color", "author"
    attribute_value TEXT            -- everything stored as text
);

-- Insert attributes for a laptop
INSERT INTO product_attributes (product_id, attribute_name, attribute_value) VALUES
(1, 'screen_size', '15.6'),
(1, 'processor', 'M2'),
(1, 'ram', '16GB');

-- Insert attributes for a t-shirt
INSERT INTO product_attributes (product_id, attribute_name, attribute_value) VALUES
(2, 'size', 'L'),
(2, 'color', 'Blue'),
(2, 'material', 'Cotton');

The downsides of EAV are significant:

No type safety (everything is a string)
Can’t add constraints (can’t enforce “screen_size must be a number”)
Queries become ugly (need to PIVOT to get a row-per-entity view)
Performance suffers at scale

When to use: Truly dynamic attributes where the schema is unknowable upfront. Product catalogs with wildly different product types. User-defined custom fields. Consider JSONB columns in PostgreSQL as a better modern alternative.

Polymorphic Associations

When multiple tables need to relate to the same table. For example: both posts and comments can have “likes.” Instead of a separate likes table for each, we use one table with a type column.

-- Polymorphic likes table
CREATE TABLE likes (
    id SERIAL PRIMARY KEY,
    user_id INT REFERENCES users(id),
    likeable_type VARCHAR(50) NOT NULL,  -- 'post' or 'comment'
    likeable_id INT NOT NULL,            -- id in the respective table
    created_at TIMESTAMP DEFAULT NOW()
);

-- Like a post
INSERT INTO likes (user_id, likeable_type, likeable_id) VALUES (1, 'post', 42);

-- Like a comment
INSERT INTO likes (user_id, likeable_type, likeable_id) VALUES (1, 'comment', 99);

-- Get all likes on a post
SELECT * FROM likes WHERE likeable_type = 'post' AND likeable_id = 42;

The problem: We can’t use a foreign key constraint because likeable_id could reference different tables. Referential integrity relies on application code.

Alternatives:

Separate tables: post_likes and comment_likes (cleaner, FK constraints work)
Shared parent: make both posts and comments inherit from a content table

Soft Deletes

Instead of actually deleting a row with DELETE, we mark it as deleted with a timestamp column. The data stays in the table but is hidden from normal queries.

-- Add a soft delete column
ALTER TABLE users ADD COLUMN deleted_at TIMESTAMP DEFAULT NULL;

-- "Delete" a user (soft)
UPDATE users SET deleted_at = NOW() WHERE id = 5;

-- All queries must filter out deleted rows
SELECT * FROM users WHERE deleted_at IS NULL;

-- Create a view for convenience
CREATE VIEW active_users AS
SELECT * FROM users WHERE deleted_at IS NULL;

-- To "undelete" — just clear the timestamp
UPDATE users SET deleted_at = NULL WHERE id = 5;

Benefits:

Data is recoverable (undo delete)
Audit trail (when was it deleted?)
Referential integrity preserved (foreign keys don’t break)
Legal/compliance requirements (retain data for N years)

Downsides:

Every query needs the WHERE deleted_at IS NULL filter (easy to forget)
Table grows forever (need periodic cleanup)
Unique constraints get tricky (two “deleted” users can have the same email)

-- Fix the unique constraint issue with a partial unique index
CREATE UNIQUE INDEX idx_users_email_active
    ON users (email)
    WHERE deleted_at IS NULL;
-- Only enforces uniqueness among non-deleted rows

When to Use Each Pattern

Pattern	Best For
Star Schema	Data warehouses, analytics, BI dashboards
Snowflake Schema	When storage matters, highly normalized analytics
EAV	User-defined fields, wildly varying attributes
Polymorphic	Multiple entity types sharing a feature (likes, comments, tags)
Soft Deletes	Audit requirements, undo capability, data retention compliance

Interview Tip

Schema design patterns come up in system design interviews. When designing a data model, mention the pattern by name — “I’d use a star schema for the analytics layer” or “I’d implement soft deletes here for audit compliance.” It shows we’ve seen real production systems and know the trade-offs of each approach.

References

Database Migrations

intermediate migrations schema-changes flyway prisma zero-downtime expand-contract

A database migration is a versioned, trackable change to our database schema. Instead of manually running ALTER TABLE on production (please don’t), we write migration files that describe exactly what changed and in what order.

Think of it like git for our database schema. Each migration is a commit — we can see the history, roll forward, and roll back.

Why We Need Migrations

Without migrations, schema changes become chaos:

“Did you run the ALTER TABLE on production?”
“Which version of the schema does this server have?”
“Someone added a column but didn’t tell the team.”
“The staging database has a different schema than production.”

Migrations solve all of this by making schema changes code. They live in git, get reviewed in PRs, and run in order.

Up and Down Migrations

Every migration has two parts:

Up — applies the change (add column, create table, create index)
Down — reverses the change (drop column, drop table, drop index)

-- Migration: 001_create_users_table.up.sql
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

-- Migration: 001_create_users_table.down.sql
DROP TABLE IF EXISTS users;

-- Migration: 002_add_phone_to_users.up.sql
ALTER TABLE users ADD COLUMN phone VARCHAR(20);

-- Migration: 002_add_phone_to_users.down.sql
ALTER TABLE users DROP COLUMN phone;

The migration tool tracks which migrations have been applied (usually in a schema_migrations table) and only runs the ones that haven’t been applied yet.

How Migration Tools Work

001

create_users

applied

→

002

add_phone

applied

→

003

add_orders

pending

→

004

future

not yet written

The tool runs 003 (pending) and skips 001, 002 (already applied)

The migration tool:

Checks the schema_migrations table to see what’s been run
Finds new migration files that haven’t been applied
Runs them in order
Records each successful migration

Popular Migration Tools

Tool	Language/Framework	Notes
Flyway	Java (works with any DB)	Industry standard, plain SQL files
Liquibase	Java (works with any DB)	XML/YAML/JSON or SQL, rollback support
Alembic	Python (SQLAlchemy)	Auto-generates migrations from model changes
Prisma Migrate	Node.js (Prisma ORM)	Schema-first, generates SQL from `.prisma` file
Knex.js	Node.js	JavaScript migration files, up/down functions
golang-migrate	Go	CLI tool, supports many databases
Rails Migrations	Ruby on Rails	The original — popularized the pattern

-- Example: Knex.js migration file
-- exports.up = function(knex) {
--   return knex.schema.createTable('users', (table) => {
--     table.increments('id');
--     table.string('name').notNullable();
--     table.string('email').unique().notNullable();
--     table.timestamps(true, true);
--   });
-- };
--
-- exports.down = function(knex) {
--   return knex.schema.dropTable('users');
-- };

Zero-Downtime Migrations

The hardest part of migrations isn’t writing them — it’s running them on a production database with live traffic and zero downtime.

The Expand-Contract Pattern

This is the gold standard for zero-downtime schema changes. It works in three phases:

Phase 1: Expand — add the new structure without removing the old one

-- We want to rename "name" to "full_name"
-- DON'T: ALTER TABLE users RENAME COLUMN name TO full_name;
-- (This breaks all queries using "name" instantly)

-- DO: Add the new column alongside the old one
ALTER TABLE users ADD COLUMN full_name VARCHAR(100);

-- Backfill existing data
UPDATE users SET full_name = name WHERE full_name IS NULL;

Phase 2: Migrate — update application code to use the new column. Deploy the app.

Phase 3: Contract — once all code uses the new column, remove the old one.

-- Only after the app is fully migrated:
ALTER TABLE users DROP COLUMN name;

Adding a Column

Adding a nullable column is usually safe. Adding a NOT NULL column or a column with a DEFAULT can lock the table in some databases.

-- Safe: add a nullable column (instant in PostgreSQL 11+)
ALTER TABLE users ADD COLUMN bio TEXT;

-- Dangerous in older databases: adding with DEFAULT
-- This rewrites every row in the table!
ALTER TABLE users ADD COLUMN status VARCHAR(20) DEFAULT 'active';

-- Safe alternative: add nullable first, then backfill
ALTER TABLE users ADD COLUMN status VARCHAR(20);
UPDATE users SET status = 'active' WHERE status IS NULL;
ALTER TABLE users ALTER COLUMN status SET DEFAULT 'active';
ALTER TABLE users ALTER COLUMN status SET NOT NULL;

Adding an Index

Creating an index on a large table can lock it for minutes or hours. Use CONCURRENTLY:

-- Blocks reads/writes during creation (bad!)
CREATE INDEX idx_users_email ON users(email);

-- Doesn't block other operations (good! But slower to build)
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);

Common Pitfalls

1. Dropping a Column That’s Still Being Read

If the old code is still running when we drop a column, queries break. Always deploy the code change first, then drop the column in a later migration.

2. Renaming a Table or Column

A rename is effectively a drop + create. The old name stops working instantly. Use the expand-contract pattern instead.

3. Not Testing Migrations on a Copy of Production

A migration that works on a dev database with 100 rows might lock a production table with 50 million rows for 30 minutes. Always test with production-like data volumes.

4. Making Migrations Non-Idempotent

A migration should be safe to run twice. Use IF NOT EXISTS and IF EXISTS:

-- Idempotent: safe to run multiple times
CREATE TABLE IF NOT EXISTS users (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100)
);

ALTER TABLE users ADD COLUMN IF NOT EXISTS phone VARCHAR(20);

DROP INDEX IF EXISTS idx_users_email;

5. Mixing Schema Changes and Data Changes

Keep schema migrations and data migrations separate. Schema changes are usually fast and reversible. Data backfills can be slow and should run as background jobs.

Migration Best Practices

One change per migration — don’t create 5 tables in one file
Always write a down migration — even if we think we’ll never need it
Never modify an already-applied migration — create a new one instead
Test on staging first — with production-like data volumes
Use transactions — wrap the migration in BEGIN/COMMIT so it’s atomic (where supported)
Review migrations in PRs — schema changes deserve as much scrutiny as code changes

Interview Tip

Migrations come up in system design and backend interviews. The key thing to demonstrate is awareness of the zero-downtime problem — that you can’t just ALTER TABLE on a live database with millions of users. Mention the expand-contract pattern and CREATE INDEX CONCURRENTLY. That shows real production experience.

References

Indexing & Query Optimization

How Indexes Work

intermediate indexes performance clustered non-clustered

An index in a database is like the index at the back of a textbook. Instead of flipping through every single page to find “B-Tree”, we look it up in the index, get the page number, and jump straight there.

Without an index, the database has to scan every row in the table to find what we’re looking for. That’s called a full table scan (or sequential scan). For a table with millions of rows, that’s painfully slow.

With an index, the database maintains a separate data structure (usually a B+ Tree) that maps column values to the physical location of the rows. So instead of checking every row, it narrows down to just the matching ones.

The Trade-off: Reads vs Writes

Here’s the thing — indexes aren’t free. Every time we INSERT, UPDATE, or DELETE a row, the database also has to update every index on that table.

Think of it like maintaining a book index. If we add a new chapter, we have to update the index too. More indexes = more maintenance work.

✓ With Index

SELECT → Fast (index lookup)

INSERT → Slower (update index too)

UPDATE → Slower (update index too)

DELETE → Slower (update index too)

✗ Without Index

SELECT → Slow (full table scan)

INSERT → Faster (no index to update)

UPDATE → Faster (no index to update)

DELETE → Faster (no index to update)

Clustered vs Non-Clustered Indexes

A clustered index determines the physical order of data on disk. Think of it like a phone book — the entries are physically sorted by last name. A table can have only one clustered index because the data can only be physically sorted one way.

In most databases, the primary key automatically creates a clustered index. In PostgreSQL, the table data isn’t automatically kept in order, but we can use CLUSTER to physically reorder it.

A non-clustered index is a separate structure that points back to the actual rows. Think of it like the index at the back of a book — it’s separate from the content, and just tells us where to look.

-- Primary key creates a clustered index automatically (in MySQL InnoDB)
CREATE TABLE users (
    id SERIAL PRIMARY KEY,  -- clustered index on id
    email VARCHAR(255),
    name VARCHAR(100)
);

-- Non-clustered index on email
CREATE INDEX idx_users_email ON users(email);

-- Now this query uses the index instead of scanning every row
SELECT * FROM users WHERE email = 'manish@example.com';

When to Create Indexes

Good candidates for indexing:

Columns used in WHERE clauses frequently
Columns used in JOIN conditions
Columns used in ORDER BY or GROUP BY
Foreign key columns (these often don’t get indexed automatically)

When NOT to create indexes:

Small tables (sequential scan is already fast)
Columns with very low cardinality (e.g., a gender column with only 2-3 values)
Tables that are write-heavy with few reads
Columns that are rarely queried

-- Good: frequently searched column with high cardinality
CREATE INDEX idx_orders_customer_id ON orders(customer_id);

-- Good: composite index for a common query pattern
CREATE INDEX idx_orders_status_date ON orders(status, created_at);

-- Bad: low cardinality column
-- CREATE INDEX idx_users_is_active ON users(is_active);  -- only true/false

-- Drop an index we don't need anymore
DROP INDEX idx_orders_status_date;

Checking If an Index Is Being Used

We can use EXPLAIN to see if our query actually uses the index:

-- Check the query plan
EXPLAIN SELECT * FROM users WHERE email = 'manish@example.com';

-- If we see "Index Scan" or "Index Only Scan" — the index is working
-- If we see "Seq Scan" — the database is ignoring our index

In simple language, indexes are like shortcuts. They make finding data much faster, but we pay a small price on every write. The key is to index the columns we actually query, and not go overboard with indexes on every column.

References

B-Tree and B+ Tree

intermediate b-tree b-plus-tree data-structures indexes

When a database creates an index, it doesn’t just make a sorted list. It builds a tree-like structure that allows it to find any value in logarithmic time. The most common structures are B-Trees and B+ Trees.

B-Tree

A B-Tree is a self-balancing tree where each node can have multiple keys and multiple children. Unlike a binary tree (which has at most 2 children per node), a B-Tree node can have hundreds of children.

Why does that matter? Because every level of the tree means one disk read. With hundreds of children per node, we only need 3-4 levels to index millions of rows. Fewer levels = fewer disk reads = faster lookups.

Key properties of a B-Tree:

All leaf nodes are at the same depth (perfectly balanced)
Each node stores multiple keys in sorted order
Each node can have up to m children (where m is the tree’s “order”)
Data pointers can live in both internal nodes AND leaf nodes

B+ Tree

A B+ Tree is a variation of the B-Tree, and it’s what most databases actually use (PostgreSQL, MySQL InnoDB, SQLite, etc.).

The key differences from a B-Tree:

All actual data lives only in leaf nodes — internal nodes just store keys for navigation
Leaf nodes are linked together — they form a linked list, making range scans super efficient

Think of it like a library catalog system. The internal nodes are like signs pointing us to the right aisle. The leaf nodes are the actual shelves where the books (data) sit. And the shelves are linked together so we can walk from one to the next.

B+ Tree Structure

Root (navigation only) [30 | 60]

↙ ↓ ↘

Internal [10 | 20]

Internal [40 | 50]

Internal [70 | 80]

↓ ↓ ↓ ↓ ↓ ↓

5,8 →

10,15 →

20,25 →

30,35 →

40,45 →

...

↑ Leaf nodes contain actual data + pointers to next leaf (linked list)

Why B+ Tree Wins for Databases

Range queries are fast. If we run SELECT * FROM users WHERE age BETWEEN 20 AND 30, the database finds the leaf node for 20, then just follows the linked list pointers until it hits 30. No need to go back up and down the tree.

Internal nodes are smaller. Since internal nodes only store keys (no data), more keys fit in a single node, which means more children per node, which means a shorter tree, which means fewer disk reads.

Consistent performance. Every lookup takes the same number of steps (root → leaf), because all leaf nodes are at the same depth. No surprises.

How a Lookup Works

Let’s say we’re searching for the value 45 in our B+ Tree:

Step 1: Start at root [30 | 60]
        45 > 30 and 45 < 60 → go to middle child

Step 2: Arrive at internal node [40 | 50]
        45 > 40 and 45 < 50 → go to middle child

Step 3: Arrive at leaf node [40, 42, 45, 48]
        Found 45! Return the row pointer.

That’s just 3 disk reads to find one row among potentially millions. Compare that to scanning every single row in a table.

B-Tree vs B+ Tree Summary

Feature	B-Tree	B+ Tree
Data stored in	All nodes	Leaf nodes only
Leaf nodes linked	No	Yes (linked list)
Range queries	Slower (tree traversal)	Fast (follow leaf pointers)
Internal node size	Larger (has data)	Smaller (keys only)
Used by	Some older systems	PostgreSQL, MySQL, SQLite

In simple language, B+ Trees are the backbone of database indexes. They keep data sorted, allow lookups in just a few disk reads, and make range queries efficient by linking leaf nodes together. When someone says “database index”, they almost always mean a B+ Tree under the hood.

References

Types of Indexes

intermediate indexes hash-index composite-index covering-index partial-index

Not all indexes are the same. Different query patterns need different index types. Let’s walk through each one and when it makes sense.

Hash Indexes

A hash index uses a hash function to map values to buckets. It’s incredibly fast for exact equality lookups (=), but completely useless for range queries (>, <, BETWEEN).

Think of it like a dictionary where we hash the word to find its page. Great for “find this exact word”, terrible for “find all words between A and D”.

-- PostgreSQL supports hash indexes
CREATE INDEX idx_users_email_hash ON users USING hash (email);

-- This uses the hash index (equality)
SELECT * FROM users WHERE email = 'manish@example.com';

-- This CANNOT use the hash index (range query)
SELECT * FROM users WHERE email > 'a' AND email < 'm';

In practice, B-Tree indexes handle equality lookups almost as fast AND support range queries too. So hash indexes are rarely used in PostgreSQL. But they’re the default in Redis and other key-value stores.

Composite (Compound) Indexes

A composite index is an index on multiple columns. The column order matters a lot because of the leftmost prefix rule.

The leftmost prefix rule says: a composite index on (a, b, c) can be used for queries filtering on:

a alone
a and b together
a, b, and c together

But NOT for:

b alone
c alone
b and c together

Leftmost Prefix Rule: INDEX(a, b, c)

WHERE a = 1

✓ Uses

WHERE a = 1 AND b = 2

✓ Uses

WHERE a = 1 AND b = 2 AND c = 3

✓ Uses

WHERE b = 2

✗ Skips

WHERE c = 3

✗ Skips

WHERE b = 2 AND c = 3

✗ Skips

-- Composite index on status and created_at
CREATE INDEX idx_orders_status_date ON orders(status, created_at);

-- Uses the index (leftmost prefix)
SELECT * FROM orders WHERE status = 'shipped';
SELECT * FROM orders WHERE status = 'shipped' AND created_at > '2024-01-01';

-- Does NOT use the index (skips leftmost column)
SELECT * FROM orders WHERE created_at > '2024-01-01';

Pro tip: Put the most selective column (the one that filters out the most rows) first.

Covering Indexes

A covering index includes all the columns that a query needs. The database can answer the query entirely from the index without going to the actual table. This is called an index-only scan and it’s the fastest type of query.

-- If we frequently run this query:
SELECT email, name FROM users WHERE email = 'manish@example.com';

-- We can create a covering index using INCLUDE
CREATE INDEX idx_users_email_cover ON users(email) INCLUDE (name);

-- Now the database doesn't need to read the table at all
-- Everything it needs is in the index

Partial Indexes

A partial index only indexes a subset of rows based on a condition. Smaller index = faster lookups + less storage.

-- Only index active users (skip the millions of deactivated accounts)
CREATE INDEX idx_users_active ON users(email) WHERE is_active = true;

-- Only index unprocessed orders
CREATE INDEX idx_orders_pending ON orders(created_at) WHERE status = 'pending';

This is perfect when we have a large table but only query a small subset of rows frequently.

Unique Indexes

A unique index ensures no duplicate values exist in the indexed column(s). Primary keys automatically create a unique index.

-- Ensure no duplicate emails
CREATE UNIQUE INDEX idx_users_unique_email ON users(email);

-- Composite unique index (combination must be unique)
CREATE UNIQUE INDEX idx_enrollment ON enrollments(student_id, course_id);

-- Attempting a duplicate insert will fail
INSERT INTO users (email) VALUES ('manish@example.com');  -- OK
INSERT INTO users (email) VALUES ('manish@example.com');  -- ERROR: duplicate

Full-Text Indexes (GIN / GiST)

Regular B-Tree indexes can’t search inside text efficiently. For full-text search, we need specialized indexes.

GIN (Generalized Inverted Index) — best for full-text search, JSONB, and array columns. It builds an inverted index (word → list of documents containing it).

GiST (Generalized Search Tree) — best for geometric data, range types, and fuzzy text search.

-- PostgreSQL: GIN index for full-text search
CREATE INDEX idx_articles_search ON articles
  USING gin(to_tsvector('english', title || ' ' || body));

-- PostgreSQL: GIN index on a JSONB column
CREATE INDEX idx_products_metadata ON products USING gin(metadata);

-- MySQL: FULLTEXT index
CREATE FULLTEXT INDEX idx_articles_ft ON articles(title, body);

Quick Reference

Index Type	Best For	Limitations
B-Tree	General purpose, range queries	Default, no special limitations
Hash	Exact equality only	No range queries, no sorting
Composite	Multi-column filters	Column order matters (leftmost prefix)
Covering	Avoiding table lookups	Larger index size
Partial	Querying subsets of data	Only helps queries matching the condition
Unique	Enforcing uniqueness	Slightly slower writes
GIN	Full-text, JSONB, arrays	Slower writes, larger index

In simple language, choosing the right index type for our query pattern can make a massive difference. Don’t just slap a B-Tree on everything — think about what queries we’re running and pick the right tool.

References

EXPLAIN and Query Plans

intermediate explain query-plan performance optimization

Writing a query is one thing. Knowing how the database actually executes it is another. That’s where EXPLAIN comes in — it shows us the database’s game plan for running our query.

Think of it like asking Google Maps for directions. We give it the destination (the query), and it shows us the route it’s going to take (the execution plan). If the route looks bad, we can reroute.

EXPLAIN vs EXPLAIN ANALYZE

EXPLAIN shows the estimated plan without actually running the query. It’s safe and fast.

EXPLAIN ANALYZE actually runs the query and shows the real execution time. Use this when we need accurate numbers, but be careful with INSERT/UPDATE/DELETE — it will actually execute them.

-- Just show the plan (doesn't execute)
EXPLAIN SELECT * FROM users WHERE email = 'manish@example.com';

-- Show the plan AND actually run it (shows real timing)
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'manish@example.com';

-- Wrap destructive queries in a transaction to avoid side effects
BEGIN;
EXPLAIN ANALYZE DELETE FROM users WHERE created_at < '2020-01-01';
ROLLBACK;  -- undo the delete

Reading a Query Plan

Here’s what a typical PostgreSQL query plan looks like:

EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'manish@example.com';

-- Output:
-- Index Scan using idx_users_email on users  (cost=0.42..8.44 rows=1 width=72)
--   (actual time=0.023..0.024 rows=1 loops=1)
--   Index Cond: (email = 'manish@example.com')
-- Planning Time: 0.087 ms
-- Execution Time: 0.043 ms

Let’s break that down:

Index Scan — it’s using an index (good!)
cost=0.42..8.44 — estimated startup cost and total cost (in arbitrary units)
rows=1 — estimated number of rows returned
actual time=0.023..0.024 — real time in milliseconds
loops=1 — how many times this step ran

Common Scan Types

Seq Scan

Reads every row in the table. Slow on large tables. Usually means we need an index.

Index Scan

Uses the index to find rows, then fetches the actual data from the table. Fast.

Index Only Scan

Everything needed is in the index. Doesn't touch the table at all. Fastest.

Bitmap Scan

Builds a bitmap of matching pages from the index, then reads those pages. Good for medium-selectivity queries.

Seq Scan isn’t always bad. If we’re reading a large portion of the table (say 50%+ of rows), a sequential scan can actually be faster than an index scan because sequential I/O is faster than random I/O.

Common Join Types

When queries involve multiple tables, we’ll see different join strategies:

Nested Loop    — For each row in table A, scan table B
                 Fast when inner table is small or has an index

Hash Join      — Build a hash table from one table, probe with the other
                 Good for large tables with equality joins

Merge Join     — Sort both tables, then merge them
                 Great when both inputs are already sorted

Spotting Slow Queries

Here are the red flags to look for:

1. Sequential scan on a large table

-- Bad: scanning 10 million rows
-- Seq Scan on orders (cost=0.00..1850000.00 rows=10000000)
--   Filter: (status = 'pending')

-- Fix: add an index
CREATE INDEX idx_orders_status ON orders(status);

2. Nested loop with no index on the inner table

-- Bad: for each user, scanning all orders
-- Nested Loop (cost=0.00..25000000.00)
--   -> Seq Scan on users
--   -> Seq Scan on orders  <-- no index!

-- Fix: add an index on the join column
CREATE INDEX idx_orders_user_id ON orders(user_id);

3. Large row estimate mismatch

-- Plan says rows=1, actual rows=500000
-- This means the statistics are stale
ANALYZE users;  -- update table statistics

MySQL EXPLAIN

MySQL’s EXPLAIN output looks a bit different but tells us the same story:

EXPLAIN SELECT * FROM users WHERE email = 'manish@example.com';

-- +----+------+---------------+------+---------+------+------+-------+
-- | id | type | possible_keys | key  | key_len | ref  | rows | Extra |
-- +----+------+---------------+------+---------+------+------+-------+
-- |  1 | ref  | idx_email     | idx_email | 767 | const | 1   |       |
-- +----+------+---------------+------+---------+------+------+-------+

Key columns to check:

type: ALL (full table scan) is bad, ref/eq_ref/const are good
key: which index is being used (NULL = no index)
rows: estimated rows to examine
Extra: Using index (covering index), Using filesort (needs sorting), Using temporary (temp table)

Practical Workflow

-- 1. Write the query
SELECT u.name, COUNT(o.id) as order_count
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.created_at > '2024-01-01'
GROUP BY u.name
ORDER BY order_count DESC;

-- 2. Check the plan
EXPLAIN ANALYZE <the query above>;

-- 3. If we see Seq Scans, add indexes
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at);

-- 4. Check the plan again — it should now show Index Scan

In simple language, EXPLAIN is our X-ray for queries. Before optimizing blindly, always check the query plan first. It tells us exactly where the database is spending time and what we need to fix.

References

Query Optimization

advanced optimization performance n+1 pagination query-rewriting

Writing correct SQL is step one. Writing fast SQL is step two. Let’s go through the most common optimization techniques that come up in interviews and real-world projects.

1. Avoid SELECT *

SELECT * fetches every column from the table, even the ones we don’t need. This wastes I/O, memory, and network bandwidth.

-- Bad: fetches all 20 columns including a large TEXT bio column
SELECT * FROM users WHERE id = 42;

-- Good: only fetch what we need
SELECT id, name, email FROM users WHERE id = 42;

This also prevents us from using covering indexes. If our index covers (id, name, email), the second query can use an index-only scan. The first query can’t.

2. The N+1 Query Problem

This is one of the most common performance killers, especially in ORMs. It happens when we run 1 query to fetch a list, then N more queries to fetch related data for each item.

-- The N+1 problem:
-- Query 1: Get all orders
SELECT * FROM orders;  -- returns 100 orders

-- Then for EACH order, fetch the customer (100 more queries!)
SELECT * FROM customers WHERE id = 1;
SELECT * FROM customers WHERE id = 2;
SELECT * FROM customers WHERE id = 3;
-- ... 97 more queries

That’s 101 queries when we could do it in 1 or 2:

-- Fix with a JOIN (1 query)
SELECT o.*, c.name as customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.id;

-- Or fix with IN (2 queries)
SELECT * FROM orders;  -- get all orders
SELECT * FROM customers WHERE id IN (1, 2, 3, ...);  -- get all needed customers at once

N+1 Problem

SELECT * FROM orders;
-- for each order:
SELECT * FROM customers WHERE id = ?;
SELECT * FROM customers WHERE id = ?;
SELECT * FROM customers WHERE id = ?;
... × 100

101 queries total

JOIN Fix

SELECT o.*, c.name
FROM orders o
JOIN customers c
ON o.customer_id = c.id;

1 query total

3. Pagination: OFFSET vs Cursor-Based

OFFSET pagination is simple but gets slower as we go deeper. The database still has to scan through all the skipped rows.

-- Page 1 (fast)
SELECT * FROM products ORDER BY id LIMIT 20 OFFSET 0;

-- Page 500 (slow — database scans 10,000 rows, throws away 9,980)
SELECT * FROM products ORDER BY id LIMIT 20 OFFSET 9980;

Cursor-based pagination (also called keyset pagination) is faster because it uses a WHERE clause to skip rows:

-- Page 1
SELECT * FROM products ORDER BY id LIMIT 20;
-- Last id on this page was 20

-- Next page (fast — uses the index!)
SELECT * FROM products WHERE id > 20 ORDER BY id LIMIT 20;
-- Last id on this page was 40

-- Next page
SELECT * FROM products WHERE id > 40 ORDER BY id LIMIT 20;

Cursor-based pagination is always fast regardless of how deep we go, because we’re using an indexed WHERE clause instead of scanning and skipping rows.

4. Use EXISTS Instead of IN for Large Subqueries

When we have a large subquery, EXISTS can be faster than IN because it stops as soon as it finds a match.

-- Slower: IN evaluates the entire subquery first
SELECT * FROM users
WHERE id IN (SELECT user_id FROM orders WHERE total > 1000);

-- Faster: EXISTS stops at the first match for each user
SELECT * FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.user_id = u.id AND o.total > 1000
);

This difference is most noticeable when the subquery returns a lot of rows.

5. Join Ordering Matters

The database optimizer usually picks the best join order, but sometimes it gets it wrong. As a general rule: filter early, join late. Start with the smallest result set.

-- Let the database filter first, then join
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.country = 'India'          -- filter users first (reduces rows)
  AND o.created_at > '2024-01-01'; -- then filter orders

If the optimizer isn’t choosing the right order, we can sometimes help by restructuring the query or using CTEs:

-- Use a CTE to make our intent clear
WITH indian_users AS (
    SELECT id, name FROM users WHERE country = 'India'
)
SELECT iu.name, o.total
FROM indian_users iu
JOIN orders o ON iu.id = o.user_id
WHERE o.created_at > '2024-01-01';

6. Avoid Functions on Indexed Columns

If we wrap an indexed column in a function, the database can’t use the index.

-- Bad: function on indexed column — can't use index on created_at
SELECT * FROM orders WHERE YEAR(created_at) = 2024;

-- Good: rewrite as a range — uses the index
SELECT * FROM orders
WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01';

-- Bad: LOWER() prevents index usage
SELECT * FROM users WHERE LOWER(email) = 'manish@example.com';

-- Good: create a functional index instead (PostgreSQL)
CREATE INDEX idx_users_email_lower ON users(LOWER(email));

7. Batch Operations

Instead of inserting or updating rows one by one, batch them:

-- Bad: 1000 individual inserts
INSERT INTO logs (message) VALUES ('event 1');
INSERT INTO logs (message) VALUES ('event 2');
-- ... 998 more

-- Good: batch insert
INSERT INTO logs (message) VALUES
    ('event 1'),
    ('event 2'),
    ('event 3'),
    -- ... up to 1000 in one statement
    ('event 1000');

8. Use EXPLAIN to Verify

After any optimization, always check the query plan:

-- Before optimization
EXPLAIN ANALYZE SELECT * FROM orders WHERE YEAR(created_at) = 2024;
-- Seq Scan (slow)

-- After optimization
EXPLAIN ANALYZE SELECT * FROM orders
WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01';
-- Index Scan (fast)

In simple language, query optimization is about doing less work. Fetch fewer columns, make fewer round trips, skip fewer rows, and always let the database use its indexes. Check EXPLAIN ANALYZE before and after — the numbers don’t lie.

References

PostgreSQL Performance Tips

Full-Text Search

advanced full-text-search tsvector gin-index elasticsearch

Searching for text inside a database sounds simple until we try to do it at scale. Let’s start with the obvious approach and see why it falls apart.

Why LIKE Doesn’t Scale

The LIKE operator is fine for small tables, but it has serious problems:

-- Prefix match — can use a B-Tree index
SELECT * FROM articles WHERE title LIKE 'postgres%';

-- Contains match — CANNOT use an index, full table scan
SELECT * FROM articles WHERE title LIKE '%postgres%';

-- Case-insensitive — even worse
SELECT * FROM articles WHERE LOWER(title) LIKE '%postgres%';

The moment we put % at the beginning, the database has to scan every single row. On a table with millions of articles, that’s unusable. Plus, LIKE doesn’t understand language — it can’t match “running” when we search for “run”.

PostgreSQL Full-Text Search

PostgreSQL has a built-in full-text search engine. The core concepts are:

tsvector — a processed version of text, broken into lexemes (normalized words)
tsquery — a search query with operators like & (AND), | (OR), ! (NOT)
GIN index — the index type that makes full-text search fast

-- Convert text to tsvector
SELECT to_tsvector('english', 'The quick brown foxes jumped over lazy dogs');
-- Result: 'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2

-- Notice: "foxes" → "fox", "jumped" → "jump", "lazy" → "lazi"
-- Stop words like "the", "over" are removed

The text processing normalizes words (stemming), removes stop words (“the”, “a”, “is”), and records the position of each word. This is what makes it smarter than LIKE.

Setting It Up

-- Add a tsvector column (or compute it on the fly)
ALTER TABLE articles ADD COLUMN search_vector tsvector;

-- Populate it from title and body
UPDATE articles SET search_vector =
    to_tsvector('english', title || ' ' || body);

-- Create a GIN index on it
CREATE INDEX idx_articles_search ON articles USING gin(search_vector);

-- Keep it updated with a trigger
CREATE TRIGGER articles_search_update
BEFORE INSERT OR UPDATE ON articles
FOR EACH ROW EXECUTE FUNCTION
    tsvector_update_trigger(search_vector, 'pg_catalog.english', title, body);

Querying

-- Basic search
SELECT title FROM articles
WHERE search_vector @@ to_tsquery('english', 'database & optimization');

-- Search with ranking (most relevant first)
SELECT title, ts_rank(search_vector, query) as rank
FROM articles, to_tsquery('english', 'database & optimization') query
WHERE search_vector @@ query
ORDER BY rank DESC;

-- Phrase search (words must be adjacent)
SELECT title FROM articles
WHERE search_vector @@ phraseto_tsquery('english', 'query optimization');

-- Prefix matching (autocomplete)
SELECT title FROM articles
WHERE search_vector @@ to_tsquery('english', 'post:*');
-- Matches "postgres", "postgresql", "posting", etc.

Full-Text Search Pipeline

Raw Text
"The foxes jumped"

→

Tokenize
["The","foxes","jumped"]

→

Remove Stop Words
["foxes","jumped"]

→

Stem
["fox","jump"]

→

tsvector
'fox':1 'jump':2

MySQL Full-Text Search

MySQL’s approach is different but simpler to set up:

-- Create a FULLTEXT index
CREATE FULLTEXT INDEX idx_articles_ft ON articles(title, body);

-- Natural language mode (default — ranked by relevance)
SELECT title, MATCH(title, body) AGAINST('database optimization') as score
FROM articles
WHERE MATCH(title, body) AGAINST('database optimization');

-- Boolean mode (AND, OR, NOT operators)
SELECT title FROM articles
WHERE MATCH(title, body) AGAINST('+database -mysql +optimization' IN BOOLEAN MODE);
-- + means must include, - means must exclude

-- With query expansion (finds related terms)
SELECT title FROM articles
WHERE MATCH(title, body) AGAINST('database' WITH QUERY EXPANSION);

When to Use a Dedicated Search Engine

Built-in full-text search works great for many cases. But we should reach for a dedicated search engine like Elasticsearch or Meilisearch when:

We need typo tolerance (“databse” should match “database”)
We need faceted search (filter by category, price range, etc. with counts)
We need instant search / autocomplete with sub-10ms latency
Our search spans multiple tables or even multiple databases
We need custom ranking algorithms (boost newer content, popular items, etc.)
We’re dealing with very high search volume (thousands of queries per second)

For a blog or documentation site? PostgreSQL’s full-text search is more than enough. For an e-commerce product search or a search-heavy app? Consider Elasticsearch or Meilisearch.

Quick Comparison

Feature	LIKE	PostgreSQL FTS	MySQL FULLTEXT	Elasticsearch
Stemming	No	Yes	Yes	Yes
Ranking	No	Yes	Yes	Yes (advanced)
Typo tolerance	No	No	No	Yes
Faceted search	No	No	No	Yes
Index support	Prefix only	GIN	FULLTEXT	Inverted index
Setup complexity	None	Low	Low	High

In simple language, if we need to search text in a database, step up from LIKE to the database’s built-in full-text search. It understands language, it’s indexable, and it ranks results. Only bring in Elasticsearch when we need features the database can’t provide.

References

Transactions & Concurrency

Transactions Deep Dive

intermediate transactions acid commit rollback savepoint

A transaction is a group of SQL operations that are treated as a single unit of work. Either ALL of them succeed, or NONE of them do. There’s no in-between.

Think of it like a bank transfer. If we’re moving $500 from account A to account B, we need two operations: subtract from A and add to B. If the system crashes after subtracting but before adding, the money just vanishes. Transactions prevent this by wrapping both operations together.

The Basics: BEGIN, COMMIT, ROLLBACK

-- Start a transaction
BEGIN;

-- Deduct from sender
UPDATE accounts SET balance = balance - 500 WHERE id = 1;

-- Add to receiver
UPDATE accounts SET balance = balance + 500 WHERE id = 2;

-- Everything looks good? Make it permanent
COMMIT;

If something goes wrong at any point, we can undo everything:

BEGIN;

UPDATE accounts SET balance = balance - 500 WHERE id = 1;

-- Oops, the receiver account doesn't exist!
-- Undo everything
ROLLBACK;
-- The sender's balance is restored to what it was before

Auto-Commit Mode

By default, most databases run in auto-commit mode. This means every single SQL statement is automatically wrapped in its own transaction and committed immediately.

-- These two statements are independent transactions
UPDATE accounts SET balance = balance - 500 WHERE id = 1;
-- ^ auto-committed immediately

UPDATE accounts SET balance = balance + 500 WHERE id = 2;
-- ^ auto-committed immediately

-- If the second one fails, the first is already committed!
-- That's why we need explicit transactions for multi-step operations.

Savepoints

Sometimes we want to undo part of a transaction without rolling back the whole thing. That’s what savepoints are for.

Think of it like saving a game. If we mess up, we can reload from the last save point instead of starting over.

BEGIN;

INSERT INTO orders (user_id, total) VALUES (1, 100.00);

SAVEPOINT before_items;

INSERT INTO order_items (order_id, product_id, qty) VALUES (1, 42, 2);
-- Oops, product 42 is out of stock

ROLLBACK TO before_items;
-- The order is still there, but the item insert is undone

-- Try a different product
INSERT INTO order_items (order_id, product_id, qty) VALUES (1, 43, 2);

COMMIT;  -- commits the order + the second item

Transaction Lifecycle

BEGIN

→

SQL operations

→

SAVEPOINT (optional)

→

COMMIT ✓

ROLLBACK ✗

Common Mistakes

1. Forgetting to COMMIT

If we start a transaction and forget to commit, the locks are held until we either commit, rollback, or the connection times out. Other queries waiting for those rows will be blocked.

BEGIN;
UPDATE products SET price = 29.99 WHERE id = 1;
-- Developer walks away for lunch...
-- This row is now locked for everyone else!

-- Always commit or rollback when done
COMMIT;

2. Long-Running Transactions

Keeping a transaction open for a long time is a recipe for trouble:

Locks are held the entire time, blocking other users
In PostgreSQL, dead tuples can’t be cleaned up by VACUUM
Memory usage increases because the database keeps undo information

-- Bad: doing slow processing inside a transaction
BEGIN;
SELECT * FROM orders WHERE status = 'pending';
-- ... application processes orders for 5 minutes ...
UPDATE orders SET status = 'processed' WHERE id IN (...);
COMMIT;

-- Better: read first, process outside the transaction, then update
SELECT * FROM orders WHERE status = 'pending';
-- ... application processes orders ...
BEGIN;
UPDATE orders SET status = 'processed' WHERE id IN (...);
COMMIT;  -- transaction lasts milliseconds, not minutes

3. Nested Transactions

Most databases don’t support true nested transactions. In PostgreSQL, if we call BEGIN inside an existing transaction, it just ignores it (with a warning). Use savepoints instead:

BEGIN;

-- This does NOT create a nested transaction
-- BEGIN;  -- WARNING: there is already a transaction in progress

-- Use savepoints for nested behavior
SAVEPOINT sp1;
-- do stuff
RELEASE SAVEPOINT sp1;  -- like a mini-commit

COMMIT;

Transaction in Application Code

Here’s how transactions typically look in application code:

-- PostgreSQL with a connection pool (pseudocode)
-- const client = await pool.connect();
-- try {
--     await client.query('BEGIN');
--     await client.query('UPDATE accounts SET balance = balance - $1 WHERE id = $2', [500, 1]);
--     await client.query('UPDATE accounts SET balance = balance + $1 WHERE id = $2', [500, 2]);
--     await client.query('COMMIT');
-- } catch (e) {
--     await client.query('ROLLBACK');
--     throw e;
-- } finally {
--     client.release();
-- }

In simple language, transactions are our safety net for multi-step operations. Wrap related operations in BEGIN/COMMIT, use savepoints for partial rollbacks, and keep transactions as short as possible. A forgotten COMMIT can bring an entire application to its knees.

References

PostgreSQL Transactions

Isolation Levels

intermediate isolation transactions dirty-reads phantom-reads serializable

When multiple transactions run at the same time, they can interfere with each other. The isolation level controls how much one transaction can see of another transaction’s uncommitted or freshly committed changes.

Think of it like working in an office. Full isolation means everyone works in their own soundproof room. No isolation means everyone’s shouting over each other. SQL gives us four levels in between.

The Three Anomalies

Before we look at the levels, let’s understand what can go wrong:

Dirty Read

Transaction A reads data that Transaction B has written but NOT yet committed. If B rolls back, A just read data that never existed.

Non-Repeatable Read

Transaction A reads a row, Transaction B updates that row and commits, then A reads the same row again and gets a different value. Same query, different result.

Phantom Read

Transaction A reads a set of rows matching a condition, Transaction B inserts a new row that matches that condition and commits, then A runs the same query and gets an extra row it didn’t see before.

The Four Isolation Levels

Isolation Level	Dirty Read	Non-Repeatable Read	Phantom Read
Read Uncommitted	Possible	Possible	Possible
Read Committed	Prevented	Possible	Possible
Repeatable Read	Prevented	Prevented	Possible
Serializable	Prevented	Prevented	Prevented

Read Uncommitted

The loosest level. We can see other transactions’ uncommitted changes. Almost nobody uses this because dirty reads are dangerous.

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

-- Transaction A:
BEGIN;
SELECT balance FROM accounts WHERE id = 1;  -- sees $1000

-- Transaction B (not committed yet):
BEGIN;
UPDATE accounts SET balance = 500 WHERE id = 1;

-- Transaction A reads again:
SELECT balance FROM accounts WHERE id = 1;  -- sees $500 (dirty read!)

-- Transaction B rolls back:
ROLLBACK;

-- Transaction A just made a decision based on data that never existed.

Note: PostgreSQL doesn’t actually support Read Uncommitted. Even if we set it, PostgreSQL uses Read Committed instead. It’s that dangerous.

Read Committed (PostgreSQL default)

Each statement sees only data that was committed before it started. No dirty reads. But if we run the same SELECT twice, we might get different results if another transaction committed in between.

SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

-- Transaction A:
BEGIN;
SELECT balance FROM accounts WHERE id = 1;  -- sees $1000

-- Transaction B commits:
UPDATE accounts SET balance = 500 WHERE id = 1;
COMMIT;

-- Transaction A reads again:
SELECT balance FROM accounts WHERE id = 1;  -- sees $500 (non-repeatable read)
COMMIT;

This is the default in PostgreSQL and is fine for most applications.

Repeatable Read (MySQL InnoDB default)

Once we read a row, we’ll get the same value for the rest of our transaction, even if another transaction commits changes to it.

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;

-- Transaction A:
BEGIN;
SELECT balance FROM accounts WHERE id = 1;  -- sees $1000

-- Transaction B commits:
UPDATE accounts SET balance = 500 WHERE id = 1;
COMMIT;

-- Transaction A reads again:
SELECT balance FROM accounts WHERE id = 1;  -- still sees $1000!
COMMIT;

But phantom reads are still possible — new rows matching our WHERE condition can appear.

Note: In PostgreSQL, Repeatable Read actually also prevents phantom reads (it uses snapshot isolation under the hood). In MySQL InnoDB, phantom reads are prevented by gap locks.

Serializable

The strictest level. Transactions behave as if they ran one after another, not concurrently. No anomalies of any kind. But it’s the slowest because the database has to detect and prevent conflicts.

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

-- If two transactions conflict, one will be rolled back
-- with a serialization failure error.
-- Our application code needs to catch this and retry.

Setting Isolation Levels

-- PostgreSQL: set for the current transaction
BEGIN;
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- ... queries ...
COMMIT;

-- PostgreSQL: set for the whole session
SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL READ COMMITTED;

-- MySQL: set for the next transaction
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

-- MySQL: set for the whole session
SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE;

-- Check current level (PostgreSQL)
SHOW transaction_isolation;

-- Check current level (MySQL)
SELECT @@transaction_isolation;

Which Level Should We Use?

Read Committed — good default for most web applications. Each statement sees the latest committed data.
Repeatable Read — good when we need a consistent snapshot within a transaction (reports, balance calculations).
Serializable — when correctness matters more than performance (financial systems, inventory management). Must handle serialization failures in application code.
Read Uncommitted — almost never. Just pretend it doesn’t exist.

In simple language, isolation levels are a trade-off between correctness and performance. Higher isolation = fewer surprises but more overhead. For most web apps, Read Committed is perfectly fine. Step up to Repeatable Read or Serializable only when we need the guarantees.

References

Locking Mechanisms

advanced locks shared-lock exclusive-lock for-update concurrency

Locks are how databases prevent two transactions from stepping on each other’s toes. When Transaction A is modifying a row, locks ensure Transaction B either waits or works with a consistent version.

Think of it like a bathroom door lock. When someone’s inside, the lock prevents anyone else from barging in. Database locks work the same way — they control who can read or write data at the same time.

Shared Locks vs Exclusive Locks

Shared lock (read lock): Multiple transactions can hold a shared lock on the same row simultaneously. Used for reading. Think of it as “I’m reading this — feel free to read it too, but nobody change it.”

Exclusive lock (write lock): Only one transaction can hold an exclusive lock. No other transaction can read or write the locked row. Think of it as “I’m changing this — everyone else stay away.”

Lock Compatibility

	Shared	Exclusive
Shared	Compatible	Conflict
Exclusive	Conflict	Conflict

Multiple shared locks can coexist. Exclusive locks block everything.

Row-Level vs Table-Level Locks

Row-level locks lock individual rows. This is the default for most operations and gives the best concurrency — other transactions can still access different rows in the same table.

Table-level locks lock the entire table. This is needed for operations like ALTER TABLE, TRUNCATE, or explicit LOCK TABLE commands.

-- Row-level lock: only row id=1 is locked
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
-- Other transactions can still update rows 2, 3, 4, etc.
COMMIT;

-- Table-level lock (PostgreSQL)
BEGIN;
LOCK TABLE accounts IN ACCESS EXCLUSIVE MODE;
-- Nobody can read or write this table until we commit
COMMIT;

When we want to read a row and then update it later in the same transaction, a plain SELECT doesn’t lock the row. Another transaction could modify it between our SELECT and our UPDATE.

-- Problem: no lock between read and write
BEGIN;
SELECT balance FROM accounts WHERE id = 1;  -- reads $1000
-- Another transaction could change this right now!
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
COMMIT;

-- Solution: lock the row when we read it
BEGIN;
SELECT balance FROM accounts WHERE id = 1 FOR UPDATE;
-- Row is now exclusively locked — nobody else can touch it
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
COMMIT;

FOR SHARE is the shared-lock version — it prevents other transactions from updating or deleting the row, but allows them to also SELECT ... FOR SHARE:

-- Lock for reading (shared) — prevents writes, allows other reads
SELECT * FROM products WHERE id = 42 FOR SHARE;

-- Lock for writing (exclusive) — prevents everything
SELECT * FROM products WHERE id = 42 FOR UPDATE;

-- Don't wait if the row is locked — return an error immediately
SELECT * FROM products WHERE id = 42 FOR UPDATE NOWAIT;

-- Skip locked rows instead of waiting
SELECT * FROM products WHERE id = 42 FOR UPDATE SKIP LOCKED;

SKIP LOCKED is incredibly useful for job queues — multiple workers can each grab the next unlocked row without waiting.

Intent Locks

Intent locks are a signaling mechanism used by some databases (like MySQL InnoDB). Before acquiring a row-level lock, the transaction places an intent lock on the table.

This tells other transactions: “Hey, I’m about to lock some rows in this table.” It prevents conflicts between row-level and table-level operations without having to check every individual row.

Intent Shared (IS): “I intend to read-lock some rows”
Intent Exclusive (IX): “I intend to write-lock some rows”

We don’t set these manually — the database handles them automatically.

Advisory Locks

Advisory locks are application-level locks that the database manages for us. The database doesn’t enforce them on any particular table or row — they’re just named locks we can use for coordination.

-- PostgreSQL advisory locks

-- Acquire a lock (blocks if already held by another session)
SELECT pg_advisory_lock(12345);

-- Do our critical work
-- ... only one session can hold this lock at a time ...

-- Release the lock
SELECT pg_advisory_unlock(12345);

-- Try to acquire without blocking (returns true/false)
SELECT pg_try_advisory_lock(12345);

Use cases for advisory locks:

Preventing duplicate cron jobs from running
Ensuring only one instance processes a specific resource
Application-level mutexes

Lock Escalation

Some databases (notably SQL Server) will escalate row-level locks to table-level locks if a transaction holds too many row locks. This is a performance optimization — tracking thousands of individual row locks is expensive.

PostgreSQL does NOT do lock escalation. Each row lock is maintained individually, no matter how many there are.

Performance Impact

Locks mean contention. Contention means waiting. Waiting means slower queries.

Tips to minimize lock contention:

Keep transactions short — acquire locks late, release early
Access rows in a consistent order across all transactions (prevents deadlocks)
Use the least restrictive lock type that works
Avoid locking rows we don’t need to modify
Consider optimistic locking for low-contention scenarios

-- Bad: locks rows for a long time
BEGIN;
SELECT * FROM orders WHERE status = 'pending' FOR UPDATE;
-- ... slow application processing ...
UPDATE orders SET status = 'done' WHERE id IN (...);
COMMIT;

-- Better: process first, lock briefly
-- Application fetches and processes orders outside a transaction
-- Then:
BEGIN;
UPDATE orders SET status = 'done' WHERE id IN (1, 2, 3);
COMMIT;  -- lock held for milliseconds

In simple language, locks are the traffic lights of a database. They prevent collisions but can cause traffic jams if we’re not careful. Use FOR UPDATE when we need to read-then-write, keep transactions short, and always think about what other transactions might be waiting on us.

References

Deadlocks

advanced deadlock concurrency locks wait-for-graph

A deadlock happens when two (or more) transactions are each waiting for a lock that the other one holds. Neither can proceed. They’re stuck forever — unless the database intervenes.

Think of it like two cars meeting on a narrow one-lane bridge from opposite sides. Neither can go forward, and neither wants to reverse. Someone has to back up.

The Classic Deadlock Example

-- Transaction A:
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;  -- locks row 1
-- Now waiting to lock row 2...
UPDATE accounts SET balance = balance + 100 WHERE id = 2;  -- BLOCKED!

-- Transaction B (at the same time):
BEGIN;
UPDATE accounts SET balance = balance - 50 WHERE id = 2;   -- locks row 2
-- Now waiting to lock row 1...
UPDATE accounts SET balance = balance + 50 WHERE id = 1;   -- BLOCKED!

Transaction A holds the lock on row 1 and wants row 2. Transaction B holds the lock on row 2 and wants row 1. Neither can finish. Deadlock.

Deadlock Cycle

Txn A

Txn B

Row 1

Held by A Wanted by B

Row 2

Held by B Wanted by A

DEADLOCK — circular wait!

How Databases Detect Deadlocks

Databases use a wait-for graph. Every time a transaction waits for a lock, the database adds an edge to this graph (Transaction A → waits for → Transaction B). If the graph has a cycle, there’s a deadlock.

Most databases check for deadlocks either:

Periodically (MySQL checks every few seconds)
Immediately when a wait begins (PostgreSQL checks instantly)

What Happens When a Deadlock Is Detected

The database picks one transaction as the victim and rolls it back. The other transaction can then proceed.

How does it pick the victim?

PostgreSQL rolls back the transaction that caused the deadlock detection
MySQL InnoDB rolls back the transaction that has done the least amount of work (fewest rows modified)

-- When our transaction is the victim, we get an error:
-- PostgreSQL: ERROR: deadlock detected
-- MySQL: ERROR 1213: Deadlock found when trying to get lock

-- Our application MUST catch this and retry
-- try {
--     await executeTransfer();
-- } catch (error) {
--     if (error.code === '40P01') {  // PostgreSQL deadlock code
--         await executeTransfer();   // retry
--     }
-- }

Prevention Strategies

1. Always Access Rows in the Same Order

The simplest and most effective strategy. If every transaction locks rows in the same order (e.g., always by ascending ID), circular waits can’t happen.

-- Bad: Transaction A locks 1 then 2, Transaction B locks 2 then 1
-- Transaction A:
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;

-- Transaction B:
UPDATE accounts SET balance = balance - 50 WHERE id = 2;  -- opposite order!
UPDATE accounts SET balance = balance + 50 WHERE id = 1;

-- Good: both transactions lock in the same order (ascending ID)
-- Transaction A:
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;

-- Transaction B:
UPDATE accounts SET balance = balance + 50 WHERE id = 1;   -- same order!
UPDATE accounts SET balance = balance - 50 WHERE id = 2;

2. Keep Transactions Short

The shorter the transaction, the less time locks are held, the less chance of deadlocks. Get in, do the work, get out.

-- Bad: long transaction holding locks
BEGIN;
SELECT * FROM products WHERE id = 1 FOR UPDATE;
-- ... 2 seconds of application processing ...
UPDATE products SET stock = stock - 1 WHERE id = 1;
COMMIT;

-- Better: do processing outside the transaction
-- ... process in application code first ...
BEGIN;
UPDATE products SET stock = stock - 1 WHERE id = 1 AND stock > 0;
COMMIT;

3. Use Lock Timeouts

Set a maximum time to wait for a lock. If the lock isn’t available within that time, the statement fails instead of waiting forever.

-- PostgreSQL: timeout after 5 seconds
SET lock_timeout = '5s';

-- MySQL: timeout after 5 seconds
SET innodb_lock_wait_timeout = 5;

4. Use NOWAIT or SKIP LOCKED

-- Don't wait at all — fail immediately if locked
SELECT * FROM products WHERE id = 1 FOR UPDATE NOWAIT;

-- Skip rows that are locked (great for job queues)
SELECT * FROM jobs WHERE status = 'pending'
ORDER BY created_at
LIMIT 1
FOR UPDATE SKIP LOCKED;

5. Reduce Lock Scope

Lock only what we need. Don’t use FOR UPDATE on rows we’re only reading.

-- Bad: locking rows we don't need to update
SELECT * FROM orders
JOIN customers ON orders.customer_id = customers.id
WHERE orders.status = 'pending'
FOR UPDATE;  -- locks both orders AND customers rows!

-- Better: only lock the rows we're updating
SELECT * FROM orders
JOIN customers ON orders.customer_id = customers.id
WHERE orders.status = 'pending'
FOR UPDATE OF orders;  -- only locks orders rows

Monitoring Deadlocks

-- PostgreSQL: check for current locks
SELECT pid, relation::regclass, mode, granted
FROM pg_locks
WHERE NOT granted;

-- PostgreSQL: check deadlock count (since server start)
SELECT deadlocks FROM pg_stat_database WHERE datname = current_database();

-- MySQL: show current lock waits
SHOW ENGINE INNODB STATUS;  -- look for "LATEST DETECTED DEADLOCK" section

In simple language, deadlocks are an inevitable side effect of concurrent access. We can’t eliminate them entirely, but we can minimize them by locking in a consistent order, keeping transactions short, and always having retry logic in our application code. The database will detect and resolve deadlocks — our job is to handle the error gracefully.

References

MVCC (Multi-Version Concurrency Control)

advanced mvcc concurrency vacuum snapshot-isolation undo-log

MVCC stands for Multi-Version Concurrency Control. It’s the reason modern databases are fast — readers don’t block writers, and writers don’t block readers.

Without MVCC, if we’re reading a row while someone else is updating it, one of us has to wait. With MVCC, the database keeps multiple versions of each row, so both operations can happen at the same time.

Think of it like Google Docs version history. While someone is editing the document, we can still read the previous version. Nobody waits.

How MVCC Works (High Level)

Instead of modifying data in place, the database creates a new version of the row. Old versions stick around so that in-flight transactions can still see the data as it was when they started.

Each transaction gets a snapshot — a frozen-in-time view of the database. Changes made by other transactions after our snapshot was taken are invisible to us.

MVCC: Multiple Versions of the Same Row

Version 1

name = "Manish" (deleted by Txn 102)

Version 2

name = "Manish P" (deleted by Txn 105)

Version 3

name = "Manish Prajapati" (current)

Txn 100 sees Version 1 | Txn 104 sees Version 2 | New Txns see Version 3

PostgreSQL’s MVCC Implementation

PostgreSQL stores multiple versions of each row directly in the table (called the heap). Each row version (tuple) has two hidden columns:

xmin — the transaction ID that created this version
xmax — the transaction ID that deleted/updated this version (0 if still alive)

When we UPDATE a row, PostgreSQL doesn’t modify the existing tuple. Instead, it:

Marks the old tuple as dead (sets its xmax)
Inserts a brand new tuple with the updated values (new xmin)

-- Let's say our transaction ID is 100
INSERT INTO users (name) VALUES ('Manish');
-- Creates tuple: xmin=100, xmax=0, name="Manish"

-- Transaction 105 updates the row
UPDATE users SET name = 'Manish P' WHERE name = 'Manish';
-- Old tuple: xmin=100, xmax=105 (marked dead)
-- New tuple: xmin=105, xmax=0, name="Manish P"

When a transaction reads data, it checks the xmin and xmax of each tuple against its own snapshot to decide which version to see.

MySQL/InnoDB’s MVCC Implementation

MySQL takes a different approach. Instead of keeping old versions in the table itself, it uses an undo log.

When a row is updated:

The current version is modified in place
The old version is saved in the undo log
A pointer from the row links to the previous version in the undo log

When a transaction needs to see an older version, it follows these pointers back through the undo log until it finds the version that was current at its snapshot time.

MySQL also uses read views — a snapshot of which transactions are active. This lets each transaction know which row versions are visible to it.

Snapshot Isolation

Snapshot isolation is the natural result of MVCC. Each transaction sees a consistent snapshot of the database as of the moment it started (or as of each statement, depending on the isolation level).

-- Read Committed: snapshot per STATEMENT
BEGIN;
SELECT count(*) FROM orders;  -- snapshot taken NOW, sees 100
-- Another transaction inserts 5 orders and commits
SELECT count(*) FROM orders;  -- NEW snapshot, sees 105
COMMIT;

-- Repeatable Read: snapshot per TRANSACTION
BEGIN;
SELECT count(*) FROM orders;  -- snapshot taken NOW, sees 100
-- Another transaction inserts 5 orders and commits
SELECT count(*) FROM orders;  -- SAME snapshot, still sees 100
COMMIT;

VACUUM in PostgreSQL

Remember those old tuple versions? They pile up. PostgreSQL needs a way to clean them up once no running transaction can see them anymore. That’s what VACUUM does.

VACUUM scans tables and marks dead tuples as reusable space. Without VACUUM, tables grow endlessly (called “table bloat”).

-- Manual vacuum (rarely needed — autovacuum handles this)
VACUUM users;

-- Vacuum with analysis (updates statistics too)
VACUUM ANALYZE users;

-- Full vacuum (rewrites the entire table, requires exclusive lock)
-- Only use this for severe bloat
VACUUM FULL users;

-- Check autovacuum status
SELECT relname, last_vacuum, last_autovacuum, n_dead_tup
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

Autovacuum runs automatically in the background. In most cases, we should just let it do its thing. But long-running transactions can prevent VACUUM from cleaning up tuples (because those transactions might still need the old versions), which is another reason to keep transactions short.

MySQL’s Purge Thread

MySQL’s equivalent of VACUUM is the purge thread. It runs in the background and removes old versions from the undo log once no active transaction needs them. This is automatic and requires no manual intervention.

Why MVCC Matters

Before MVCC, databases used strict two-phase locking (2PL) — readers blocked writers and writers blocked readers. This was correct but slow.

MVCC changed the game:

Reads never wait for writes — readers just see an older version
Writes rarely wait for reads — writers create new versions without blocking readers
Only write-write conflicts need locks — two transactions updating the same row still need coordination

This is why PostgreSQL and MySQL can handle thousands of concurrent connections efficiently. Without MVCC, most web applications would be painfully slow.

In simple language, MVCC keeps the database fast by maintaining multiple versions of data. Instead of making everyone wait in line, each transaction gets its own consistent snapshot. The trade-off is extra storage for old versions and the need for cleanup (VACUUM in PostgreSQL, purge thread in MySQL). But that’s a small price to pay for the massive concurrency win.

References

Optimistic vs Pessimistic Locking

intermediate optimistic-locking pessimistic-locking concurrency version-column

When two users try to update the same data at the same time, we need a strategy. There are two main approaches: pessimistic locking (lock first, then work) and optimistic locking (work first, check for conflicts at the end).

Pessimistic Locking

With pessimistic locking, we assume conflicts are likely and lock the data before we even start working on it. “I’m going to grab this row and hold onto it until I’m done.”

-- Pessimistic: lock the row before reading
BEGIN;

SELECT * FROM products WHERE id = 42 FOR UPDATE;
-- Row is now locked — nobody else can modify it

-- Check stock, calculate price, etc.
-- ... application logic ...

UPDATE products SET stock = stock - 1 WHERE id = 42;
COMMIT;
-- Lock released

If another transaction tries to SELECT ... FOR UPDATE on the same row, it will wait (block) until our transaction finishes.

Optimistic Locking

With optimistic locking, we assume conflicts are rare. We read the data, do our work, and only check for conflicts when we try to save. No locks are held during the work.

The most common pattern is using a version column:

-- Step 1: Read the row (no lock)
SELECT id, name, stock, version FROM products WHERE id = 42;
-- Returns: id=42, name="Widget", stock=10, version=5

-- Step 2: Do application processing (no lock held)
-- ... calculate new values ...

-- Step 3: Update only if version hasn't changed
UPDATE products
SET stock = 9, version = version + 1
WHERE id = 42 AND version = 5;
-- If rows affected = 1 → success
-- If rows affected = 0 → someone else updated it, retry!

If another transaction modified the row (bumping the version to 6), our WHERE clause won’t match, the update affects 0 rows, and we know there was a conflict.

Pessimistic Locking

1. BEGIN transaction

2. LOCK the row (FOR UPDATE)

3. Read data

4. Process in application

5. Write changes

6. COMMIT (release lock)

Lock held during steps 2-6

Optimistic Locking

1. Read data + version number

2. Process in application

3. BEGIN transaction

4. UPDATE WHERE version = N

5. COMMIT

6. If 0 rows affected → retry

No lock held during processing

When to Use Which

Optimistic locking works best when:

Conflicts are rare (most users edit different data)
We want high throughput with many concurrent readers
The cost of retrying on conflict is acceptable
Examples: editing a user profile, updating preferences, CMS content

Pessimistic locking works best when:

Conflicts are frequent (many users competing for the same resource)
The cost of a failed operation is high (can’t just retry easily)
We need to guarantee the operation succeeds on the first try
Examples: booking a seat, purchasing limited inventory, bank transfers

Implementation: Optimistic with Version Column

-- Table setup
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255),
    stock INT NOT NULL DEFAULT 0,
    version INT NOT NULL DEFAULT 1  -- version column
);

-- Application pseudocode:
-- 1. Read
-- product = SELECT * FROM products WHERE id = 42;
-- currentVersion = product.version;

-- 2. Process (no database locks)
-- newStock = product.stock - orderQuantity;

-- 3. Conditional update
-- result = UPDATE products
--          SET stock = newStock, version = version + 1
--          WHERE id = 42 AND version = currentVersion;

-- 4. Check result
-- if (result.rowsAffected === 0) {
--     // Conflict! Someone else updated the row. Retry from step 1.
-- }

Some ORMs have built-in optimistic locking. For example, Hibernate uses @Version annotation, Sequelize has optimisticLocking option, and Rails has lock_version column.

Implementation: Pessimistic with FOR UPDATE

-- Booking a concert seat (can't afford a conflict)
BEGIN;

-- Lock the seat row
SELECT * FROM seats WHERE event_id = 1 AND seat_number = 'A15' FOR UPDATE;

-- Check if it's available
-- if (seat.status !== 'available') throw new Error('Already booked');

-- Book it
UPDATE seats SET status = 'booked', booked_by = 42
WHERE event_id = 1 AND seat_number = 'A15';

COMMIT;

Alternative: updated_at Instead of Version

Instead of a version number, we can use a timestamp:

-- Read the row
SELECT * FROM products WHERE id = 42;
-- Returns: updated_at = '2024-03-15 10:30:00'

-- Update only if timestamp matches
UPDATE products
SET stock = 9, updated_at = NOW()
WHERE id = 42 AND updated_at = '2024-03-15 10:30:00';

The version number approach is generally safer because timestamps can have precision issues (two updates in the same millisecond).

Quick Comparison

Aspect	Pessimistic	Optimistic
Lock timing	Before work	No lock (check at commit)
Concurrency	Lower (blocking)	Higher (non-blocking)
Conflict handling	Prevention (wait)	Detection (retry)
Best for	High contention	Low contention
Deadlock risk	Higher	None
Implementation	`SELECT FOR UPDATE`	Version column + conditional UPDATE

In simple language, pessimistic locking is like reserving a table at a restaurant before we go — guaranteed spot but others can’t use it while we’re on the way. Optimistic locking is like showing up and hoping there’s a table — usually works, but sometimes we have to try again. Pick the strategy that matches how often our users compete for the same data.

References

PostgreSQL Advisory Locks

NoSQL Databases

Document Stores

intermediate nosql mongodb document-store denormalization

A document store is a database that stores data as documents — think JSON objects. Instead of rigid rows and columns, each document can have its own structure. No two documents in the same collection need to look alike.

MongoDB is the most popular document store, so we’ll use it as our primary example. Others include CouchDB, Amazon DocumentDB, and Firebase Firestore.

Terminology: SQL vs Document Store

SQL (PostgreSQL)	Document (MongoDB)
Database	Database
Table	Collection
Row	Document
Column	Field
JOIN	$lookup (or embedding)
Schema (enforced)	Schema (optional, flexible)

What a Document Looks Like

Documents are stored as BSON (Binary JSON) in MongoDB. They can contain nested objects, arrays, and any level of complexity:

{
  "_id": "ObjectId('65f1a2b3c4d5e6f7a8b9c0d1')",
  "name": "Manish Prajapati",
  "email": "manish@example.com",
  "age": 25,
  "address": {
    "city": "Mumbai",
    "state": "Maharashtra",
    "pin": "400001"
  },
  "skills": ["JavaScript", "Python", "PostgreSQL"],
  "experience": [
    {
      "company": "Acme Corp",
      "role": "Backend Developer",
      "years": 2
    },
    {
      "company": "Startup Inc",
      "role": "Full Stack",
      "years": 1
    }
  ]
}

Notice how the address and experience are embedded directly inside the user document. In a relational database, those would be separate tables with foreign keys.

Embedding vs Referencing

This is the biggest design decision in document databases.

Embedding (denormalized) — store related data directly inside the document:

// Order with embedded items — one read gets everything
{
  "_id": "order_123",
  "customer": "Manish",
  "items": [
    { "product": "Keyboard", "price": 2500, "qty": 1 },
    { "product": "Mouse", "price": 800, "qty": 2 }
  ],
  "total": 4100
}

Referencing (normalized) — store just the ID and look up the related data separately:

// Order with references — need a second query to get items
{
  "_id": "order_123",
  "customer_id": "user_456",
  "item_ids": ["item_789", "item_790"],
  "total": 4100
}

When to embed:

Data is always accessed together (order + items)
The embedded data doesn’t change often
The embedded data won’t grow unboundedly
One-to-few relationships

When to reference:

Data is shared across documents (a product referenced by many orders)
The related data changes frequently
The related data could grow very large
Many-to-many relationships

Basic MongoDB Queries

// Insert a document
db.users.insertOne({
  name: "Manish",
  email: "manish@example.com",
  age: 25,
  skills: ["JavaScript", "Python"]
})

// Find documents
db.users.find({ age: { $gte: 25 } })              // age >= 25
db.users.find({ skills: "Python" })                 // array contains "Python"
db.users.find({ "address.city": "Mumbai" })         // nested field

// Update
db.users.updateOne(
  { email: "manish@example.com" },                  // filter
  { $set: { age: 26 }, $push: { skills: "Go" } }   // update
)

// Delete
db.users.deleteOne({ email: "manish@example.com" })

Aggregation Pipeline

MongoDB’s aggregation pipeline is like SQL’s GROUP BY on steroids. We chain stages together to transform data:

// Find the average order total per customer, sorted by highest
db.orders.aggregate([
  { $match: { status: "completed" } },           // WHERE status = 'completed'
  { $group: {
      _id: "$customer_id",                        // GROUP BY customer_id
      avgTotal: { $avg: "$total" },               // AVG(total)
      orderCount: { $sum: 1 }                     // COUNT(*)
  }},
  { $sort: { avgTotal: -1 } },                   // ORDER BY avgTotal DESC
  { $limit: 10 }                                  // LIMIT 10
])

When Documents Beat Tables

Document stores shine when:

Schema varies across records — e-commerce products where a laptop has different fields than a t-shirt
Data is hierarchical — nested objects and arrays are natural
Rapid prototyping — no migrations needed, just add fields
Read-heavy with co-located data — one read gets everything (no JOINs)
Horizontal scaling — sharding is built into MongoDB

Document stores struggle when:

We need complex JOINs across many collections
We need strict schema enforcement (though MongoDB supports schema validation now)
We need multi-document ACID transactions everywhere (MongoDB supports them since v4.0, but they’re slower than single-document operations)
Data is highly relational with many-to-many relationships

In simple language, document stores let us store data the way our application actually uses it. Instead of splitting everything into flat tables and JOINing them back together, we store complete objects. The trade-off is that some queries become harder (cross-collection joins), but the queries we run most often become much simpler and faster.

References

MongoDB Documentation

Key-Value Stores

beginner nosql redis dynamodb key-value caching

A key-value store is the simplest type of database. We give it a key, and it gives us back a value. That’s it. No columns, no schemas, no joins.

Think of it like a giant dictionary (or a HashMap/Object in code). We look up a word (key) and get the definition (value).

key: "user:42:session"  →  value: "eyJhbGciOiJIUzI1NiJ9..."
key: "product:99:price"  →  value: "2999"
key: "config:theme"      →  value: "dark"

Popular Key-Value Stores

Redis — in-memory, blazing fast (sub-millisecond reads), supports data structures like lists, sets, sorted sets, and hashes. The most popular choice for caching and real-time data.

Amazon DynamoDB — managed, serverless, massively scalable. Supports both key-value and document models. Pay per request or provisioned capacity.

etcd — distributed key-value store used by Kubernetes for storing cluster state. Focuses on consistency and reliability.

Memcached — pure in-memory cache. Simpler than Redis but faster for basic key-value caching because it’s multi-threaded.

Common Use Cases

Caching — store the result of an expensive query or computation. Next time, check the cache first.

// Pseudocode
cache.set("user:42", JSON.stringify(userData), { ttl: 3600 })  // cache for 1 hour
cached = cache.get("user:42")
if (cached) return JSON.parse(cached)  // cache hit!
// otherwise, query the database

Session Storage — store user sessions with an expiration time. Way faster than querying a database on every request.

Rate Limiting — count requests per user per time window.

Feature Flags — toggle features on/off without deploying code.

Leaderboards — Redis sorted sets are perfect for ranked lists.

Pub/Sub Messaging — Redis supports publish/subscribe for real-time communication.

Redis in Action

Redis isn’t just a simple key-value store. It supports rich data types:

# Basic key-value
SET user:42:name "Manish"
GET user:42:name                    # "Manish"

# With expiration (TTL)
SET session:abc123 "user_data" EX 3600    # expires in 1 hour
TTL session:abc123                         # seconds remaining

# Hash (like a mini-object)
HSET user:42 name "Manish" email "manish@example.com" age 25
HGET user:42 name                  # "Manish"
HGETALL user:42                    # all fields

# List (queue/stack)
LPUSH notifications:42 "New message!"
LPUSH notifications:42 "Order shipped!"
LRANGE notifications:42 0 -1       # get all

# Set (unique values)
SADD user:42:skills "JavaScript" "Python" "Go"
SISMEMBER user:42:skills "Python"   # true
SMEMBERS user:42:skills             # all skills

# Sorted Set (leaderboard)
ZADD leaderboard 1500 "player:1"
ZADD leaderboard 2200 "player:2"
ZADD leaderboard 1800 "player:3"
ZREVRANGE leaderboard 0 2 WITHSCORES  # top 3 players

Limitations

Key-value stores trade features for speed:

No complex queries — we can’t say “find all users where age > 25”. We can only look up by key.
No relationships — no joins, no foreign keys, no referential integrity.
No schema — the value is opaque to the database. It can’t index or search inside values (Redis hashes are a partial exception).
Memory constraints — in-memory stores like Redis are limited by available RAM. Data larger than memory needs disk-backed solutions.

When to Pick Key-Value

✓

We always look up data by a known key

✓

We need sub-millisecond response times

✓

Data is simple (strings, counters, small objects)

✓

We need TTL (time-to-live) for auto-expiration

✗

We need to search by value or run complex queries

✗

We need relationships between records

✗

We need ACID transactions across multiple keys

In most real applications, key-value stores are used alongside a primary database, not as a replacement. PostgreSQL for the main data, Redis for caching and sessions. They complement each other.

In simple language, key-value stores are the fastest, simplest databases out there. Give it a key, get a value. No frills, no overhead. Use them for caching, sessions, and any scenario where we know exactly which key we want.

References

Wide-Column Stores

advanced nosql cassandra hbase wide-column time-series

Wide-column stores look like tables on the surface, but they work very differently from relational databases. Each row can have a different set of columns, and the data is organized for fast writes and horizontal scaling.

The name “wide-column” comes from the fact that rows can have a huge number of columns — potentially millions. But we shouldn’t confuse them with traditional relational columns. Think of each row as a sorted map of column names to values.

The Big Names

Apache Cassandra — massively scalable, eventually consistent, peer-to-peer architecture (no single point of failure). Used by Netflix, Discord, Apple.

Apache HBase — built on top of Hadoop HDFS. Strongly consistent, modeled after Google’s Bigtable paper. Used for random, real-time read/write access to big data.

Google Bigtable — the original, managed service on Google Cloud. Powers Google Search, Gmail, Maps, etc.

Core Concepts

Column Families

A column family is like a table, but each row can have different columns. Columns within a family are stored together on disk, which makes reading a family’s data efficient.

// Column Family: user_profile
Row Key     | name           | email                  | age  | city
------------|----------------|------------------------|------|--------
user:001    | "Manish"       | "manish@example.com"   | 25   | "Mumbai"
user:002    | "Priya"        | "priya@example.com"    |      | "Delhi"
user:003    | "Rahul"        |                        | 30   |

Notice how user:002 has no age and user:003 has no email or city. That’s fine — there’s no schema enforcing that every row must have every column.

Partition Key + Clustering Key

The partition key determines which node stores the data. All rows with the same partition key live on the same node.

The clustering key determines the sort order within a partition.

This is the most important design decision in Cassandra. We design our keys based on how we query the data.

-- Cassandra CQL (looks like SQL but isn't)
CREATE TABLE sensor_readings (
    sensor_id TEXT,           -- partition key
    timestamp TIMESTAMP,      -- clustering key (sorted)
    temperature DOUBLE,
    humidity DOUBLE,
    PRIMARY KEY (sensor_id, timestamp)
);

-- This is fast: reads all data for one sensor, sorted by time
SELECT * FROM sensor_readings
WHERE sensor_id = 'sensor-42'
AND timestamp > '2024-01-01';

-- This is SLOW or impossible: scanning across all sensors
-- SELECT * FROM sensor_readings WHERE temperature > 30;
-- Cassandra can't efficiently do this without scanning every partition

Partition Key → Node Assignment

Node 1

sensor-01 → [readings...]
sensor-05 → [readings...]

Node 2

sensor-02 → [readings...]
sensor-04 → [readings...]

Node 3

sensor-03 → [readings...]
sensor-06 → [readings...]

Partition key hash determines which node owns the data

Write-Optimized

Wide-column stores are designed for insane write throughput. Here’s how:

Every write is an append. Cassandra doesn’t update data in place. It writes to an in-memory table (memtable), then flushes to disk as immutable sorted files (SSTables). Old versions are cleaned up later during compaction.

This means writes are always fast — there’s no “find the row and update it” step. We just append.

-- Inserting data (CQL)
INSERT INTO sensor_readings (sensor_id, timestamp, temperature, humidity)
VALUES ('sensor-42', '2024-03-15 10:30:00', 25.5, 65.0);

-- "Updating" is just inserting with the same key — newer timestamp wins
INSERT INTO sensor_readings (sensor_id, timestamp, temperature, humidity)
VALUES ('sensor-42', '2024-03-15 10:30:00', 26.0, 64.0);

Eventual Consistency

By default, Cassandra is eventually consistent. When we write data, it goes to one node first and then replicates to other nodes. For a brief moment, different nodes might have different versions.

We can tune this with consistency levels:

-- Write to a quorum of nodes (majority must acknowledge)
CONSISTENCY QUORUM;
INSERT INTO sensor_readings (...) VALUES (...);

-- ONE: fastest, least consistent
-- QUORUM: good balance
-- ALL: slowest, most consistent (all replicas must respond)

Use Cases

Wide-column stores excel at:

Time-series data — sensor readings, metrics, stock prices (partition by device/symbol, cluster by time)
IoT data — millions of devices writing data continuously
Logging at scale — billions of log entries that need fast writes
User activity feeds — partition by user, cluster by timestamp
Messaging systems — Discord uses Cassandra to store billions of messages

They struggle with:

Ad-hoc queries (we must know our access patterns upfront)
Multi-partition queries (aggregating across all sensors)
Strong consistency requirements (possible but costs performance)
Small-scale applications (the overhead isn’t worth it for small data)

In simple language, wide-column stores are built for massive write throughput and horizontal scaling. We design the data model around our query patterns, not around the data relationships. If we’re writing millions of events per second and reading them back by a known key, this is the right tool. For anything that needs flexible querying or strong consistency, stick with a relational database.

References

Graph Databases

advanced nosql graph-database neo4j cypher relationships

In a relational database, relationships are an afterthought — we represent them with foreign keys and JOINs. In a graph database, relationships are first-class citizens. They’re stored directly, traversed instantly, and queried naturally.

Think of it like a social network. We don’t think “User #42 has a row in the friends table pointing to User #99.” We think “Manish is friends with Priya.” Graph databases store data the way we naturally think about connections.

Core Concepts

A graph database has three building blocks:

Nodes — the entities (people, products, locations). Like rows in a table.

Edges (Relationships) — the connections between nodes. They have a type and direction. Like foreign keys, but stored as actual links.

Properties — key-value pairs on both nodes and edges. Like columns.

Graph Structure

Manish :Person

FRIENDS_WITH ────────→ since: 2020

Priya :Person

↓ WORKS_AT

Acme :Company

TechCo :Company

Neo4j and Cypher

Neo4j is the most popular graph database. It uses Cypher as its query language, which uses ASCII art patterns to describe graph traversals.

// Create nodes
CREATE (m:Person {name: "Manish", age: 25})
CREATE (p:Person {name: "Priya", age: 24})
CREATE (a:Company {name: "Acme Corp"})

// Create relationships
CREATE (m)-[:FRIENDS_WITH {since: 2020}]->(p)
CREATE (m)-[:WORKS_AT {role: "Backend Dev"}]->(a)

// Find a person
MATCH (p:Person {name: "Manish"})
RETURN p

// Find Manish's friends
MATCH (m:Person {name: "Manish"})-[:FRIENDS_WITH]->(friend)
RETURN friend.name

// Find friends of friends (2 hops)
MATCH (m:Person {name: "Manish"})-[:FRIENDS_WITH*2]->(fof)
RETURN DISTINCT fof.name

// Find the shortest path between two people
MATCH path = shortestPath(
  (a:Person {name: "Manish"})-[*]-(b:Person {name: "Rahul"})
)
RETURN path

The beauty of Cypher is that it reads almost like a sentence. (m)-[:FRIENDS_WITH]->(friend) literally means “m is friends with friend.”

Why Graphs Beat JOINs

In a relational database, finding “friends of friends of friends” requires multiple self-JOINs:

-- SQL: friends of friends of friends (3 levels deep)
SELECT DISTINCT f3.name
FROM friendships f1
JOIN friendships f2 ON f1.friend_id = f2.user_id
JOIN friendships f3 ON f2.friend_id = f3.user_id
WHERE f1.user_id = 42;
-- Gets exponentially slower with each level

In a graph database, the same query is constant time per hop:

// Cypher: friends of friends of friends
MATCH (m:Person {id: 42})-[:FRIENDS_WITH*3]->(fof)
RETURN DISTINCT fof.name
// Traverses the graph directly — no JOINs needed

The key difference: relational databases compute relationships at query time (JOINs). Graph databases store relationships as physical pointers. Traversing a relationship is just following a pointer — O(1) regardless of the total data size.

Use Cases

Social networks — who knows whom, mutual friends, connection suggestions. LinkedIn’s “People You May Know” is a graph problem.

Recommendation engines — “People who bought X also bought Y.” We traverse the graph: User → bought → Product ← bought ← Other Users → bought → Other Products.

Fraud detection — find circular money transfers, identify suspicious patterns in transaction networks.

Knowledge graphs — Google’s Knowledge Graph, Wikipedia’s data structure. “What is the capital of the country where the Eiffel Tower is located?”

Access control — “Does user X have permission to resource Y through group Z?” This is a graph traversal.

Network topology — mapping computer networks, dependencies between microservices, infrastructure relationships.

When NOT to Use Graph Databases

Simple CRUD — if our data is mostly flat tables with simple relationships, a relational database is simpler and more battle-tested
Heavy aggregations — summing up sales by month, calculating averages. SQL is better at this
High write throughput — graph databases are generally optimized for read-heavy, traversal-heavy workloads
Large-scale analytics — for data warehousing and analytics, columnar databases are better

Other Graph Databases

Amazon Neptune — managed graph database supporting both property graphs (Gremlin) and RDF (SPARQL)
ArangoDB — multi-model database that supports graphs, documents, and key-value
Dgraph — distributed graph database built for production scale
JanusGraph — open-source, distributed, built on top of existing storage backends (Cassandra, HBase)

In simple language, graph databases are the right tool when the connections between our data matter as much as the data itself. If we’re constantly asking “who is connected to whom” or “how do we get from A to B”, a graph database will be orders of magnitude faster than joining relational tables. But for everything else, relational databases are still the better choice.

References

NoSQL Data Modeling

intermediate nosql data-modeling denormalization access-patterns single-table-design

NoSQL data modeling is backwards compared to relational modeling. In SQL, we design the schema first and figure out the queries later. In NoSQL, we start with the queries (access patterns) and design the data to serve them.

This is the single most important mindset shift. If we design a NoSQL schema the same way we’d design a SQL schema, we’ll end up with something slow and awkward.

Rule #1: Think Access Patterns First

Before we create any collection or table, we write down every question our application will ask:

1. Get a user by ID
2. Get all orders for a user, sorted by date
3. Get the top 10 products by sales this month
4. Get all comments on a post
5. Check if a user has already liked a post

Then we design the data to answer these questions as efficiently as possible — ideally with a single read operation each.

Rule #2: Denormalization Is the Norm

In SQL, we normalize data to avoid duplication. In NoSQL, we intentionally duplicate data to avoid joins (which most NoSQL databases can’t do efficiently).

// SQL approach (normalized): 3 tables, need JOINs
// users table:     { id: 1, name: "Manish" }
// orders table:    { id: 101, user_id: 1, total: 5000 }
// products table:  { id: 42, name: "Keyboard" }

// NoSQL approach (denormalized): everything in one document
{
  "_id": "order:101",
  "user": {
    "id": 1,
    "name": "Manish"           // duplicated from users
  },
  "items": [
    {
      "product_id": 42,
      "name": "Keyboard",      // duplicated from products
      "price": 2500,
      "qty": 2
    }
  ],
  "total": 5000,
  "created_at": "2024-03-15"
}

The trade-off: when “Manish” changes his name, we have to update it everywhere it’s duplicated. But reads are now lightning fast — one query gets everything.

Denormalize data that’s read together. If we always show the user’s name alongside their orders, put the name in the order document.

Embedding vs Referencing

This is the most common design decision in document databases like MongoDB.

Embed When:

Data is accessed together

One-to-few relationship

Child data doesn't change often

Child data belongs to one parent

Document stays under 16MB

Reference When:

Data is accessed independently

One-to-many or many-to-many

Referenced data changes frequently

Data is shared across documents

Unbounded growth (thousands of items)

// Embed: blog post with comments (accessed together, bounded)
{
  "_id": "post:1",
  "title": "NoSQL Data Modeling",
  "comments": [
    { "user": "Priya", "text": "Great post!", "date": "2024-03-15" },
    { "user": "Rahul", "text": "Very helpful", "date": "2024-03-16" }
  ]
}

// Reference: user with orders (orders grow unboundedly)
{
  "_id": "user:1",
  "name": "Manish",
  "order_ids": ["order:101", "order:102"]  // just IDs
}

Single Table Design (DynamoDB)

DynamoDB takes this to the extreme with single table design. Instead of multiple tables, we store everything in one table with carefully designed partition keys and sort keys.

// All entities in ONE table
PK              | SK              | Data
----------------|-----------------|---------------------------
USER#manish     | PROFILE         | { name: "Manish", email: "..." }
USER#manish     | ORDER#2024-001  | { total: 5000, status: "shipped" }
USER#manish     | ORDER#2024-002  | { total: 3000, status: "pending" }
USER#priya      | PROFILE         | { name: "Priya", email: "..." }
USER#priya      | ORDER#2024-003  | { total: 1500, status: "shipped" }
PRODUCT#42      | INFO            | { name: "Keyboard", price: 2500 }
PRODUCT#42      | REVIEW#manish   | { rating: 5, text: "Love it" }

Now we can answer multiple access patterns with one table:

Get user profile: PK = "USER#manish", SK = "PROFILE"
Get all orders for a user: PK = "USER#manish", SK begins_with "ORDER#"
Get all reviews for a product: PK = "PRODUCT#42", SK begins_with "REVIEW#"

Common Anti-Patterns

1. Treating NoSQL Like SQL

// BAD: Normalized MongoDB (just like SQL tables)
// users: { _id: 1, name: "Manish" }
// addresses: { _id: 1, user_id: 1, city: "Mumbai" }
// Now every read needs two queries + a "join"

// GOOD: Embed the address
// users: { _id: 1, name: "Manish", address: { city: "Mumbai" } }

2. Unbounded Arrays

// BAD: A product with millions of reviews embedded
{
  "_id": "product:42",
  "reviews": [
    // ... 500,000 reviews ...
    // Document exceeds 16MB limit!
  ]
}

// GOOD: Store reviews in a separate collection with a reference
// products: { _id: "product:42", name: "Keyboard" }
// reviews: { product_id: "product:42", user: "Manish", text: "..." }

3. Not Thinking About Updates

If we embed data that changes frequently, every update has to find and modify all the copies. Either accept the update cost, or reference instead.

4. Forgetting About Query Patterns

// We designed for "get orders by user"
// But now we need "get all orders by status"
// Our partition key is user_id — we can't efficiently query by status!

// Solution: create a Global Secondary Index (GSI) on status
// Or design a separate "view" of the data optimized for status queries

Design Process

List all access patterns — what will the application ask?
Identify the primary entity — what’s the main thing we’re querying?
Choose embed vs reference for each relationship
Design the keys — partition key for lookups, sort key for ordering
Add secondary indexes for alternate query patterns
Accept duplication — it’s the price we pay for read performance

In simple language, NoSQL data modeling is about optimizing for reads. We design the data structure to match exactly how the application reads it. Duplication is expected. Normalization is the enemy. If we catch ourselves thinking “I need a JOIN,” we probably need to restructure our data.

References

Choosing the Right Database

intermediate database-selection sql-vs-nosql architecture decision-framework

One of the most common interview questions (and real-world decisions) is: “Which database should we use?” The answer is always “it depends.” But we can break it down into a practical framework.

There’s no single best database. Each type was built to solve a specific problem. Our job is to match the problem to the tool.

The Decision Framework

Ask these four questions:

1. What Does Our Data Look Like?

Structured/Tabular

Fixed schema, rows and columns → Relational (PostgreSQL, MySQL)

Semi-structured

Nested objects, variable fields → Document (MongoDB)

Simple lookups

Key → value pairs, caches → Key-Value (Redis)

Highly connected

Relationships matter most → Graph (Neo4j)

Time-series/Logs

Append-heavy, sorted by time → Wide-Column (Cassandra) or TimescaleDB

2. What Are Our Access Patterns?

OLTP (lots of small reads/writes) — relational or document databases
OLAP (complex analytics, aggregations) — columnar databases (ClickHouse, BigQuery, Redshift)
Full-text search — Elasticsearch, Meilisearch, or PostgreSQL with GIN indexes
Graph traversal — graph databases
Simple lookups by key — key-value stores

3. What Consistency Do We Need?

Strong consistency (every read sees the latest write) — PostgreSQL, MySQL, MongoDB (with write concern)
Eventual consistency (reads might be slightly stale) — Cassandra, DynamoDB (default), Redis replicas

If we’re building a banking system, we need strong consistency. If we’re building a social media feed, eventual consistency is fine — nobody notices if a like count is off by 1 for a second.

4. How Big Will We Get?

Vertical scaling (bigger server) — PostgreSQL, MySQL. Works well up to several TB and thousands of connections.
Horizontal scaling (more servers) — Cassandra, DynamoDB, MongoDB (with sharding). Needed for truly massive scale.

Most applications never need horizontal scaling. PostgreSQL on a decent server can handle millions of users. Don’t over-engineer.

Real-World Examples

E-commerce Platform

Primary database: PostgreSQL
  - Products, orders, users, payments
  - Need ACID transactions for orders and payments
  - Complex queries (search by category, price range, filters)

Cache layer: Redis
  - Product catalog cache
  - Session storage
  - Cart data
  - Rate limiting

Search: Elasticsearch
  - Product search with typo tolerance
  - Faceted filtering (brand, price, rating)

Primary database: PostgreSQL or MySQL
  - User profiles, posts, follows
  - Need strong consistency for auth and profiles

Activity feed: Redis (sorted sets)
  - Pre-computed feed per user
  - Fast reads for infinite scroll

Media metadata: MongoDB
  - Variable metadata per media type (photo vs video vs story)

Friend suggestions: Neo4j or graph queries
  - "People you may know" (friends of friends)

IoT / Monitoring System

Time-series data: Cassandra or TimescaleDB
  - Millions of sensor readings per second
  - Need massive write throughput

Recent data / alerts: Redis
  - Current sensor values
  - Alert thresholds and triggers

Configuration: PostgreSQL
  - Device management
  - User accounts
  - Alert rules

Real-time Chat (like Discord)

Messages: Cassandra
  - Billions of messages
  - Partition by channel, cluster by timestamp
  - Fast writes, fast range reads

Presence (online/offline): Redis
  - Sub-millisecond lookups
  - Pub/sub for real-time updates

User/server data: PostgreSQL
  - User accounts, server settings, permissions
  - Need ACID for role changes and permissions

The “Just Use PostgreSQL” Rule

Here’s a pragmatic take: start with PostgreSQL for almost everything. It handles JSON (like MongoDB), has full-text search (like Elasticsearch for simpler cases), supports pub/sub, and can handle more scale than most apps will ever need.

Add specialized databases only when PostgreSQL demonstrably can’t handle a specific requirement:

Need sub-millisecond caching? Add Redis.
Need to search millions of documents with typo tolerance? Add Elasticsearch.
Need to write millions of events per second? Add Cassandra or Kafka.
Need complex graph traversals at scale? Add Neo4j.

Don’t start with 5 databases on day one. Start with one and add others when the need is proven.

Quick Comparison

Database	Type	Best For	Consistency	Scale
PostgreSQL	Relational	General purpose, OLTP	Strong	Vertical
MySQL	Relational	Web apps, read-heavy	Strong	Vertical
MongoDB	Document	Variable schemas, rapid dev	Tunable	Horizontal
Redis	Key-Value	Caching, sessions, real-time	Eventual*	Horizontal
Cassandra	Wide-Column	Time-series, high write throughput	Tunable	Horizontal
Neo4j	Graph	Relationship-heavy queries	Strong	Vertical
Elasticsearch	Search engine	Full-text search, analytics	Eventual	Horizontal
DynamoDB	Key-Value/Doc	Serverless, auto-scaling	Tunable	Horizontal
ClickHouse	Columnar	Analytics, OLAP	Eventual	Horizontal

*Redis can be configured for strong consistency with Redis Sentinel/Cluster

In simple language, the right database depends on what we’re building. Start with PostgreSQL (it can do 80% of things well), add Redis for caching, and only bring in specialized databases when we have a proven need. The worst mistake is using 5 different databases on day one because they sound cool. Keep it simple until complexity is required.

References

DB-Engines Ranking

Redis Essentials

Redis Data Types and Commands

intermediate redis data-types in-memory cache commands

A lot of people think Redis is “just a cache.” That’s selling it short. Redis is an in-memory data structure store — it gives us strings, hashes, lists, sets, sorted sets, streams, and more. Each data type has its own set of commands, and picking the right one can make a huge difference in how we solve a problem.

Let’s walk through each one.

Strings

The simplest type. A string in Redis can hold text, numbers, or even binary data (up to 512 MB). But we mostly use it for simple key-value caching and counters.

# Basic set and get
SET user:name "Manish"
GET user:name              # "Manish"

# Set with expiry (seconds)
SETEX session:abc123 3600 "user_42"   # expires in 1 hour

# Set only if key doesn't exist (useful for locks)
SETNX lock:order:99 "worker_1"       # returns 1 if set, 0 if already exists

# Atomic counter — no race conditions
SET page:views 0
INCR page:views            # 1
INCR page:views            # 2
INCRBY page:views 10       # 12

Use cases: simple cache, session tokens, counters, rate limiting.

Hashes

Think of a hash as a mini object or dictionary stored under one key. Instead of serializing a whole JSON object into a string, we can store each field separately — and update individual fields without rewriting the whole thing.

# Store a user profile
HSET user:42 name "Manish" age 28 city "Pune"

# Get a single field
HGET user:42 name          # "Manish"

# Get all fields
HGETALL user:42            # name, Manish, age, 28, city, Pune

# Update just one field
HSET user:42 age 29

# Increment a numeric field
HINCRBY user:42 age 1      # 30

Use cases: user profiles, product details, any object with multiple fields that we update independently.

Lists

Redis lists are linked lists — fast push/pop at both ends, but slow random access in the middle. They’re perfect for queues and “recent items” feeds.

# Build a queue (push left, pop right = FIFO)
LPUSH queue:emails "email_1"
LPUSH queue:emails "email_2"
RPOP queue:emails           # "email_1" (first in, first out)

# Blocking pop — waits up to 5 seconds for a new item
BLPOP queue:emails 5

# Recent activity feed (keep last 100 items)
LPUSH feed:user:42 "liked a post"
LTRIM feed:user:42 0 99    # trim to 100 items
LRANGE feed:user:42 0 9    # get the 10 most recent

Use cases: message queues, recent activity, job queues, chat history.

Sets

Unordered collections of unique values. Sets shine when we need to check membership or compute intersections/unions.

# Track tags on a post
SADD post:1:tags "redis" "database" "caching"
SADD post:2:tags "redis" "performance" "caching"

# Check if a tag exists
SISMEMBER post:1:tags "redis"    # 1 (true)

# All tags on post 1
SMEMBERS post:1:tags

# Tags common to both posts
SINTER post:1:tags post:2:tags   # "redis", "caching"

# All unique tags across both posts
SUNION post:1:tags post:2:tags

Use cases: tags, unique visitors, mutual friends, “already processed” tracking.

Sorted Sets

Like sets, but every member has a score. Redis keeps them sorted by score automatically. This is what makes leaderboards and priority queues trivial.

# Leaderboard
ZADD leaderboard 1500 "alice"
ZADD leaderboard 2300 "bob"
ZADD leaderboard 1800 "charlie"

# Top 3 players (highest score first)
ZREVRANGE leaderboard 0 2 WITHSCORES
# bob 2300, charlie 1800, alice 1500

# What's alice's rank? (0-based, highest first)
ZREVRANK leaderboard "alice"   # 2

# Increment score
ZINCRBY leaderboard 1000 "alice"   # alice now at 2500

Use cases: leaderboards, priority queues, rate limiting (sliding window), scheduling.

Streams

Streams are Redis’s answer to event logs and message queues. Unlike Pub/Sub, streams persist messages and support consumer groups — so we can have multiple consumers processing messages reliably.

# Add events to a stream
XADD orders * product "laptop" qty 1
XADD orders * product "mouse" qty 3

# Read all entries
XREAD COUNT 10 STREAMS orders 0

# Consumer groups — multiple workers processing orders
XGROUP CREATE orders workers 0
XREADGROUP GROUP workers worker-1 COUNT 1 STREAMS orders >

Use cases: event sourcing, audit logs, reliable message queues, activity streams.

HyperLogLog

This one is clever. HyperLogLog counts approximate unique items using very little memory — about 12 KB regardless of how many items we add. The trade-off is it’s ~0.81% off from the exact count.

# Count unique visitors
PFADD visitors:today "user_1" "user_2" "user_3"
PFADD visitors:today "user_1"    # duplicate, not counted again
PFCOUNT visitors:today            # 3

# Merge multiple days
PFMERGE visitors:week visitors:mon visitors:tue visitors:wed

Use cases: unique visitor counting, unique search queries, cardinality estimation.

Bitmaps

Bitmaps let us set individual bits on a string. Each bit is either 0 or 1. Super memory-efficient for tracking boolean states across millions of users.

# Track daily active users (user ID = bit offset)
SETBIT active:2026-03-31 42 1    # user 42 was active today
SETBIT active:2026-03-31 99 1    # user 99 was active today

# Was user 42 active?
GETBIT active:2026-03-31 42      # 1

# How many users were active today?
BITCOUNT active:2026-03-31       # 2

Use cases: feature flags, daily active users, bloom filter implementation.

Data Type Comparison

Type	Best For	Key Commands
String	Cache, counters, locks	SET, GET, INCR, SETNX
Hash	Objects, user profiles	HSET, HGET, HGETALL
List	Queues, recent items	LPUSH, RPOP, BLPOP
Set	Unique items, tags	SADD, SINTER, SUNION
Sorted Set	Leaderboards, rankings	ZADD, ZRANGE, ZREVRANK
Stream	Event logs, message queues	XADD, XREAD, XREADGROUP
HyperLogLog	Unique counts (~0.81% error)	PFADD, PFCOUNT, PFMERGE
Bitmap	Feature flags, DAU tracking	SETBIT, GETBIT, BITCOUNT

Internal Encodings (Brief)

Redis doesn’t always use the “obvious” data structure under the hood. For small data, it uses compact encodings to save memory:

Small hashes/lists/sets use listpack (previously ziplist) — a flat, compact byte array. Much less memory than a real hash table or linked list.
Large sorted sets use a skiplist — a probabilistic data structure that gives us O(log n) lookups, similar to a balanced tree.
Large hashes switch to a real hashtable once they exceed a threshold.

We usually don’t need to worry about this — Redis picks the best encoding automatically. But it’s good to know for interviews when someone asks “how does Redis store a sorted set internally?”

References

Redis Persistence

intermediate redis persistence rdb aof durability

Here’s a question that comes up in almost every interview about Redis: “If Redis stores everything in memory, what happens when the server crashes? Is all the data gone?”

The answer is no — Redis CAN persist data to disk. It gives us three approaches: RDB snapshots, AOF (Append-Only File), and hybrid persistence. Each has different trade-offs between performance, durability, and disk usage.

RDB — Point-in-Time Snapshots

RDB is the simplest approach. At configured intervals, Redis takes a snapshot of all the data in memory and writes it to a binary file on disk (usually dump.rdb).

How It Works

Redis forks the main process (creates a child process)
The child process writes all the data to a temporary file
Once done, it replaces the old RDB file with the new one
The parent process keeps serving requests the whole time

The fork uses copy-on-write — the child shares memory with the parent. Only pages that get modified after the fork need to be copied. So the memory overhead is usually small.

# Manual snapshot (blocks the server — don't use in production)
SAVE

# Background snapshot (non-blocking — use this one)
BGSAVE

# Check when last save happened
LASTSAVE

Configuration

# In redis.conf — save after N changes in M seconds
save 900 1        # snapshot if at least 1 key changed in 900 seconds
save 300 10       # snapshot if at least 10 keys changed in 300 seconds
save 60 10000     # snapshot if at least 10000 keys changed in 60 seconds

# RDB filename and directory
dbfilename dump.rdb
dir /var/lib/redis

Pros and Cons of RDB

Compact file — single binary file, easy to back up and transfer
Fast restarts — loading an RDB file is much faster than replaying logs
Low runtime overhead — snapshots happen in a background fork
Data loss between snapshots — if Redis crashes 5 minutes after the last snapshot, those 5 minutes of writes are gone
Fork can be slow — on large datasets, the fork itself takes time and can cause a brief pause

AOF — Append-Only File

AOF takes a completely different approach. Instead of periodic snapshots, it logs every single write operation to a file as it happens. To recover, Redis replays the entire log from start to finish.

fsync Policies

The big question with AOF is: how often do we flush the log to disk?

# In redis.conf
appendonly yes
appendfilename "appendonly.aof"

# fsync policy — this is the critical setting
appendfsync always      # fsync after every write — safest, slowest
appendfsync everysec    # fsync once per second — good compromise (default)
appendfsync no          # let the OS decide — fastest, least safe

always — we lose zero writes on crash, but every write pays the disk I/O cost. Very slow.
everysec — we lose at most 1 second of writes. This is what most people use.
no — the OS flushes whenever it wants (usually every 30 seconds). Fast but risky.

AOF Rewrite

The AOF file keeps growing with every write. If we SET the same key a million times, the file has a million entries for it — but only the last one matters.

AOF rewrite compacts the file by replacing all those operations with the minimal set of commands needed to recreate the current state.

# Trigger a manual rewrite
BGREWRITEAOF

# Auto-rewrite when AOF file grows by this percentage
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb

Pros and Cons of AOF

Minimal data loss — at most 1 second with everysec
Human-readable — it’s just a log of Redis commands, we can inspect it
Append-only — no seek operations, good for spinning disks
Larger files — AOF files are much bigger than RDB snapshots
Slower restarts — replaying thousands of commands takes longer than loading a binary

Hybrid Persistence (Redis 4.0+)

Why choose when we can have both? Hybrid persistence combines the best of RDB and AOF:

During an AOF rewrite, Redis writes an RDB snapshot as the base of the new AOF file
Then appends only the new AOF entries that happened during the rewrite

On restart, Redis loads the RDB portion first (fast), then replays the small AOF tail (catches up the rest).

# Enable hybrid persistence
aof-use-rdb-preamble yes   # enabled by default since Redis 5.0

This gives us fast restarts (RDB) AND minimal data loss (AOF). It’s the recommended approach for most production setups.

RDB vs AOF vs Hybrid

Feature	RDB	AOF	Hybrid
Data loss risk	Minutes	~1 second	~1 second
File size	Small	Large	Medium
Restart speed	Fast	Slow	Fast
Write overhead	Low	Medium	Medium
Recommended?	Backups only	Legacy setups	Yes

What About No Persistence?

We can disable both RDB and AOF entirely. This makes Redis a pure in-memory cache — if it crashes, everything is gone.

# Disable RDB
save ""

# Disable AOF
appendonly no

This is perfectly fine if we’re using Redis purely as a cache and the source of truth lives in a database. We can always rebuild the cache from the DB.

Interview Tip

When someone asks “Is Redis durable?”, the answer is: it depends on our configuration. With appendfsync always, we get durability comparable to a traditional database (at the cost of performance). With appendfsync everysec (the default), we trade up to 1 second of data loss for much better throughput. And with RDB-only, we accept larger windows of potential data loss for minimal overhead.

The right choice depends on what we’re storing. Session tokens? AOF everysec is fine. Financial transactions? We probably shouldn’t be using Redis as the primary store anyway.

References

Caching Patterns

intermediate caching redis cache-aside write-through cache-stampede cache-penetration

Caching sounds simple on the surface — store stuff in memory so we don’t hit the database every time. But the devil is in the details. How we cache matters a lot. Do we populate the cache on reads or writes? What happens when the cache and database go out of sync? What if a popular key expires and a thousand requests hit the database at once?

Let’s look at the main caching patterns and the problems we’ll inevitably run into.

Cache-Aside (Lazy Loading)

This is the most common pattern. The application manages the cache itself:

Read: Check the cache first. If found (cache hit), return it. If not (cache miss), read from the database, put it in the cache, then return it.
Write: Write directly to the database. Optionally invalidate (delete) the cache entry.

App checks cache for key

Cache miss! Key not found

App reads from database

App writes result to cache (with TTL)

App returns data to client

Cache-aside read flow (on a cache miss)

# Pseudocode for cache-aside
def get_user(user_id):
    # Step 1: Check cache
    cached = redis.GET(f"user:{user_id}")
    if cached:
        return deserialize(cached)    # cache hit!

    # Step 2: Cache miss — read from DB
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)

    # Step 3: Populate cache with TTL
    redis.SETEX(f"user:{user_id}", 3600, serialize(user))

    return user

Pros: Only caches data that’s actually requested. Simple to implement. Cons: First request is always slow (cache miss). Data can become stale if the DB changes and we don’t invalidate.

Write-Through

With write-through, every write goes to both the cache and the database at the same time. The cache is always up to date.

# Pseudocode for write-through
def update_user(user_id, data):
    # Write to DB
    db.query("UPDATE users SET ... WHERE id = %s", user_id, data)

    # Write to cache at the same time
    redis.SETEX(f"user:{user_id}", 3600, serialize(data))

Pros: Cache is always consistent with the DB. No stale reads. Cons: Higher write latency (two writes per operation). We cache data that might never be read.

Write-Behind (Write-Back)

This is the performance-optimized version. We write to the cache immediately and asynchronously flush to the database later (in batches or after a delay).

# Pseudocode for write-behind
def update_user(user_id, data):
    # Write to cache immediately — fast!
    redis.SETEX(f"user:{user_id}", 3600, serialize(data))

    # Queue the write for async processing
    queue.push({"action": "update_user", "id": user_id, "data": data})

# Background worker picks up the queue and writes to DB

Pros: Super fast writes (only cache). Batching reduces DB load. Cons: If the cache crashes before flushing, we lose data. More complex to implement. Debugging is harder.

Read-Through

Similar to cache-aside, but the cache itself is responsible for fetching from the DB on a miss. The application doesn’t talk to the DB directly — it always goes through the cache layer.

In simple language, the only difference from cache-aside is who does the fetching. In cache-aside, our app code fetches from DB. In read-through, the cache library handles it.

Cache Invalidation Strategies

Phil Karlton famously said there are only two hard things in computer science: cache invalidation and naming things. He wasn’t wrong.

TTL-based — set an expiration time. Simple but the data can be stale until the TTL expires.
Event-driven — when the DB changes, publish an event to invalidate the cache. More accurate but more complex.
Version-based — include a version number in the cache key. When data changes, bump the version. Old entries naturally become orphans.

Cache Stampede (Thundering Herd)

This is a nasty problem. Imagine a popular cache key (say, the homepage data) expires. Suddenly, thousands of requests all see a cache miss at the same time and all hit the database simultaneously. The DB gets crushed.

Solutions

1. Distributed lock (SETNX) Only one request fetches from the DB. Others wait for the cache to be repopulated.

# Only one process gets the lock and rebuilds the cache
def get_with_lock(key):
    value = redis.GET(key)
    if value:
        return value

    # Try to acquire a lock
    if redis.SET(f"lock:{key}", "1", NX=True, EX=10):
        # We got the lock — rebuild the cache
        value = db.query(...)
        redis.SETEX(key, 3600, value)
        redis.DEL(f"lock:{key}")
        return value
    else:
        # Someone else is rebuilding — wait and retry
        sleep(0.1)
        return get_with_lock(key)

2. Probabilistic early expiration Each request has a small random chance of refreshing the cache before it actually expires. The hotter the key, the more likely someone refreshes it early.

3. Background refresh A background job refreshes popular keys before they expire. The cache never actually becomes empty.

Cache Penetration

This happens when requests keep asking for keys that don’t exist — not in the cache AND not in the database. Every request is a cache miss that hits the DB, and the DB always returns nothing.

Think of it like someone querying user:99999999 over and over. The cache never helps because there’s nothing to cache.

Solutions

1. Cache null values If the DB returns nothing, cache a null/empty marker with a short TTL.

# Cache the "not found" result too
value = db.query("SELECT * FROM users WHERE id = %s", user_id)
if value is None:
    redis.SETEX(f"user:{user_id}", 300, "NULL")   # cache for 5 min

2. Bloom filter A bloom filter is a space-efficient data structure that tells us “this key DEFINITELY doesn’t exist” or “this key MIGHT exist.” We check the bloom filter before hitting the cache or DB.

# Before hitting cache or DB, check bloom filter
if not bloom_filter.might_contain(user_id):
    return None    # definitely doesn't exist, skip everything

# Otherwise, proceed with normal cache-aside logic

Cache Avalanche

Similar to a stampede, but worse. This happens when a large number of keys expire at the same time — maybe because they were all set with the same TTL during a cache warm-up.

Solution: Add a random jitter to TTL values.

# Instead of a fixed TTL
redis.SETEX(key, 3600, value)

# Add random jitter (3600 ± 300 seconds)
import random
ttl = 3600 + random.randint(-300, 300)
redis.SETEX(key, ttl, value)

Quick Reference

Cache-Aside

App manages cache. Most common. Simple but stale data risk.

Write-Through

Write to both. Always consistent. Higher write latency.

Write-Behind

Write cache, async DB. Fastest writes. Data loss risk.

Read-Through

Cache fetches from DB on miss. Cleaner app code.

References

Redis Eviction and Expiry

intermediate redis eviction lru lfu ttl memory

Redis stores everything in memory. Memory is finite. So what happens when Redis runs out of it?

This is controlled by two things: the maxmemory setting (how much memory Redis is allowed to use) and the eviction policy (what Redis does when it hits that limit). Getting these right is critical for production.

Setting the Memory Limit

# In redis.conf
maxmemory 2gb

# Or set at runtime
CONFIG SET maxmemory 2gb

# Check current memory usage
INFO memory

If we don’t set maxmemory, Redis on a 64-bit system will use as much memory as it wants — which can be dangerous. On 32-bit systems, there’s an implicit 3 GB limit.

The 8 Eviction Policies

When Redis hits the memory limit and we try to write new data, it looks at the eviction policy to decide what to do.

noeviction

Redis refuses to accept new writes and returns an error. Reads still work fine. This is the default.

Use it when: we’d rather fail loudly than silently lose cached data.

allkeys-lru

Evict the least recently used key across ALL keys in the database.

Use it when: we’re using Redis as a general-purpose cache and want old/cold data to make room for new data.

volatile-lru

Same as allkeys-lru, but only considers keys that have a TTL set. Keys without a TTL are never evicted.

Use it when: we have a mix of permanent data (no TTL) and cache data (with TTL), and we only want to evict the cache.

allkeys-lfu

Evict the least frequently used key across all keys. LFU tracks how often a key is accessed, not just when.

Use it when: we want to keep “popular” keys around even if they haven’t been accessed in the last few seconds.

volatile-lfu

LFU, but only among keys with a TTL.

allkeys-random

Pick a random key and evict it. No tracking, no overhead.

Use it when: all keys are equally important and we just need to make space.

volatile-random

Random eviction, but only among keys with a TTL.

volatile-ttl

Evict the key with the shortest remaining TTL — the one closest to expiring anyway.

Use it when: keys with shorter TTLs are less important.

Choosing the Right Policy

Decision Guide

Pure cache (all data is expendable)
→ Use allkeys-lru or allkeys-lfu

Mix of cache + permanent data
→ Use volatile-lru (set TTL only on cache keys)

Some keys are way more popular than others
→ Use allkeys-lfu (keeps hot keys around)

Data loss is unacceptable
→ Use noeviction (and handle the errors in our app)

# Set eviction policy
CONFIG SET maxmemory-policy allkeys-lru

# In redis.conf
maxmemory-policy allkeys-lru

LRU in Redis Is an Approximation

Here’s an important detail: Redis does NOT implement true LRU. A real LRU would require tracking the access time of every single key and maintaining a linked list — that’s expensive in terms of memory.

Instead, Redis uses sampled LRU. When it needs to evict, it picks a random sample of keys (default: 5) and evicts the least recently used one from that sample. It’s not perfect, but it’s surprisingly close to true LRU with much less overhead.

# Increase the sample size for better accuracy (at the cost of CPU)
maxmemory-samples 10   # default is 5

With a sample size of 10, Redis’s approximation is nearly indistinguishable from true LRU. At 5, it’s still quite good for most workloads.

LFU works similarly — it’s also sampled, not exact.

TTL and Key Expiration

Besides eviction (which happens when memory is full), Redis also supports TTL-based expiration — we can set a timer on any key, and Redis will automatically delete it when the timer runs out.

Setting TTL

# Set TTL in seconds
SET session:abc "user_42"
EXPIRE session:abc 3600          # expires in 1 hour

# Set TTL in milliseconds
PEXPIRE session:abc 3600000      # same thing, in ms

# Set TTL at a specific Unix timestamp
EXPIREAT session:abc 1743465600  # expires at this exact time

# Set key with TTL in one command
SETEX session:abc 3600 "user_42"

# Check remaining TTL
TTL session:abc      # returns seconds remaining (-1 = no TTL, -2 = key doesn't exist)
PTTL session:abc     # returns milliseconds remaining

# Remove TTL (make key permanent again)
PERSIST session:abc

How Expiration Actually Works

Redis uses two strategies to clean up expired keys:

1. Lazy expiration (passive) When we try to access a key, Redis checks if it’s expired. If it is, Redis deletes it right there and returns nothing. This means expired keys can sit in memory until someone asks for them.

2. Active expiration (periodic scan) Redis runs a background task ~10 times per second that:

Randomly samples 20 keys from the set of keys with TTL
Deletes any that are expired
If more than 25% of the sampled keys were expired, repeat immediately

This combination ensures that expired keys don’t pile up too much, while keeping the overhead low.

In simple language, Redis is lazy about expiration — it doesn’t watch a clock for every key. Instead, it checks when we access a key (lazy) and periodically sweeps through a sample of keys (active). This is why at any given moment, there might be expired keys still sitting in memory that haven’t been cleaned up yet.

Eviction vs Expiration

These are two different things and it’s easy to mix them up:

Eviction

Triggered by: memory limit reached

What's removed: depends on policy

Purpose: make room for new data

Expiration

Triggered by: TTL timer runs out

What's removed: only that specific key

Purpose: auto-cleanup of stale data

A key can be evicted even if its TTL hasn’t expired (if memory is full). And a key can expire even if there’s plenty of memory left.

Practical Configuration

Here’s a solid starting point for a production Redis cache:

# redis.conf
maxmemory 4gb                     # set based on available RAM
maxmemory-policy allkeys-lru      # good default for a cache
maxmemory-samples 5               # default, increase to 10 for better accuracy

# Lazy-free eviction (non-blocking deletion for large keys)
lazyfree-lazy-eviction yes         # delete evicted keys in background thread
lazyfree-lazy-expire yes           # delete expired keys in background thread

The lazyfree settings are important for large keys. Without them, deleting a key with millions of elements blocks the main Redis thread. With lazyfree enabled, the deletion happens in a background thread.

References

Redis Pub/Sub, Distributed Locks, and Cluster

advanced redis pub-sub distributed-locks redlock cluster sentinel

Redis isn’t just about storing and retrieving data. It also gives us tools for real-time messaging (Pub/Sub), coordination between services (distributed locks), and horizontal scaling (Redis Cluster). These are the features that turn Redis from a simple cache into a system-level building block.

Pub/Sub — Fire and Forget Messaging

Redis Pub/Sub lets us broadcast messages to multiple subscribers in real-time. Publishers send messages to a channel, and any client subscribed to that channel receives them instantly.

Think of it like a radio broadcast. The station (publisher) doesn’t know or care who’s listening. Listeners (subscribers) tune into a channel and hear whatever comes through.

# Terminal 1 — Subscribe to a channel
SUBSCRIBE notifications

# Terminal 2 — Publish a message
PUBLISH notifications "New order received!"
# Returns the number of subscribers who received it

# Pattern-based subscription (wildcards)
PSUBSCRIBE order:*       # matches order:created, order:shipped, etc.

The Big Caveat: No Persistence

Here’s the thing we need to remember — Pub/Sub is fire-and-forget. If a subscriber isn’t connected when a message is published, that message is gone forever. There’s no queue, no replay, no message history.

This means:

If our subscriber crashes and reconnects, it misses everything that happened while it was down
If nobody is listening on a channel, published messages just vanish
There’s no acknowledgment — we don’t know if anyone received our message

When to Use Pub/Sub vs Streams

If we need guaranteed delivery, message replay, or consumer groups — use Redis Streams instead. Pub/Sub is great for real-time notifications where losing a message occasionally is acceptable:

Cache invalidation signals across multiple app servers
Live dashboard updates
Chat messages (if we’re okay with no history)
Real-time notifications

Distributed Locks

In a distributed system, sometimes we need to make sure only one process does something at a time — processing a payment, sending an email, updating a shared resource. We need a distributed lock.

The Simple Approach

Redis gives us the building blocks with a single command:

# Acquire lock: set a key only if it doesn't exist, with a timeout
SET lock:order:123 "worker_1" NX EX 30
# NX = only if Not eXists
# EX 30 = expires in 30 seconds (so the lock auto-releases if we crash)

# Do our work...

# Release lock: delete it (but only if WE hold it)
# Use a Lua script to make this atomic
EVAL "if redis.call('GET', KEYS[1]) == ARGV[1] then return redis.call('DEL', KEYS[1]) else return 0 end" 1 lock:order:123 worker_1

Why the Lua script for releasing? Because we need to check-and-delete atomically. Without it, we could accidentally delete someone else’s lock:

Worker A’s lock expires (timeout)
Worker B acquires the lock
Worker A finishes its work and deletes the key — but that’s now Worker B’s lock!

The Lua script prevents this by checking the value before deleting.

The Problem with Single-Instance Locks

What if our single Redis server goes down? The lock disappears, and now multiple processes think they have the lock. That’s where Redlock comes in.

Redlock Algorithm

Redlock uses multiple independent Redis instances (not replicas — completely separate servers). The idea:

Get the current time
Try to acquire the lock on all N Redis instances with the same key and a random value
If we got the lock on at least N/2 + 1 instances (majority), and the total time to acquire was less than the lock timeout, we have the lock
If we failed, release the lock on all instances

Redis 1

LOCKED

Redis 2

LOCKED

Redis 3

LOCKED

Redis 4

FAILED

Redis 5

LOCKED

Redlock with 5 instances — 4/5 locked (majority achieved) — lock acquired

Martin Kleppmann’s Critique

Martin Kleppmann (author of “Designing Data-Intensive Applications”) wrote a famous critique of Redlock. His argument:

Process pauses — even after acquiring the lock, a GC pause or network delay can cause the lock to expire while we think we still have it
Clock drift — Redlock relies on time. If a server’s clock jumps forward, the lock can expire early
Fencing tokens — a safer approach is to use a monotonically increasing token with each lock acquisition. The protected resource rejects operations with old tokens

In simple language, Redlock gives us a “best effort” lock but can’t guarantee correctness in all edge cases. If we truly need a safe distributed lock, Kleppmann recommends using a consensus system like ZooKeeper or etcd.

For most practical use cases (deduplication, rate limiting, preventing duplicate jobs), Redlock works just fine. For critical financial operations, we should think twice.

Redis Cluster

When a single Redis server isn’t enough — either the dataset is too large to fit in one machine’s memory, or we need more throughput — we can use Redis Cluster for horizontal scaling.

How It Works

Redis Cluster divides the key space into 16,384 hash slots. Each key is mapped to a slot using CRC16(key) % 16384. Each node in the cluster owns a subset of these slots.

# Which slot does a key map to?
CLUSTER KEYSLOT "user:42"     # returns a number 0-16383

# Example: 3-node cluster
# Node A: slots 0-5460
# Node B: slots 5461-10922
# Node C: slots 10923-16383

When we send a command to the wrong node, it returns a MOVED redirect telling us which node actually has that slot. Smart Redis clients learn the slot mapping and route commands directly.

Replication Within the Cluster

Each primary node can have replicas. If a primary goes down, its replica gets promoted automatically. So a typical production cluster might have 3 primaries + 3 replicas = 6 nodes.

Limitations

Multi-key operations only work if all keys are on the same slot. We can force this with hash tags: {user:42}:profile and {user:42}:settings both hash to the same slot because Redis only hashes the part inside {}.
No cross-slot transactions — MULTI/EXEC only works within a single slot.

Sentinel vs Cluster

Redis Sentinel

High availability for single-server Redis

Automatic failover (primary → replica)

NO sharding — all data on one server

Use when: data fits in one server

Redis Cluster

Horizontal scaling across multiple servers

Automatic sharding + failover

Data split across 16,384 hash slots

Use when: data/traffic exceeds one server

In simple language: Sentinel gives us failover (if the primary dies, a replica takes over). Cluster gives us failover AND sharding (data split across multiple servers). If our dataset fits in one machine, Sentinel is simpler to operate. If we need to scale beyond a single server, Cluster is the way to go.

References

Scaling & Real-World Patterns

Replication

intermediate replication master-slave primary-replica failover scaling

Replication is one of the most fundamental concepts in database scaling. In simple language, it means keeping copies of the same data on multiple servers. If one server dies, another has the data. If one server is overloaded with reads, we spread the load across copies.

Every production database of any significant size uses replication. Let’s understand how it works.

Why Replicate?

Three main reasons:

High availability — if the primary server crashes, a replica can take over. No downtime.
Read scaling — distribute read queries across multiple replicas. One server handling 10,000 reads/sec becomes 3 servers handling 3,300 each.
Disaster recovery — a replica in a different data center protects against entire facility failures.

Master-Slave (Primary-Replica)

This is the most common replication setup. One server is the primary (master) — it handles all writes. One or more replicas (slaves) receive a copy of every change and serve read queries.

Application

writes

↓

Primary

reads + writes

reads

↓

Replica 1

read-only

Replica 2

read-only

Primary ──replication──→ Replicas (async or sync)

How It Works

The primary writes changes to a write-ahead log (WAL in PostgreSQL, binlog in MySQL)
Replicas connect to the primary and stream these changes
Replicas apply the changes to their own copy of the data
Clients can read from any replica

-- PostgreSQL: create a replica (simplified)
-- On the primary: configure pg_hba.conf to allow replication connections
-- On the replica:
pg_basebackup -h primary-host -D /var/lib/postgresql/data -U replicator -P

-- Check replication status on the primary
SELECT * FROM pg_stat_replication;

Replication Lag

Here’s the catch. Replicas don’t get changes instantly — there’s always some delay between a write hitting the primary and showing up on replicas. This is called replication lag.

Most of the time, lag is measured in milliseconds. But under heavy load, it can spike to seconds or even minutes. This means if a user writes something and immediately reads it from a replica, they might not see their own change.

Common solution: read-your-own-writes — route a user’s reads to the primary for a short window after they write something.

Master-Master (Multi-Primary)

In multi-primary replication, both (or all) nodes accept writes. Each node replicates changes to the others.

This sounds great in theory — no single point of failure for writes! But in practice, it introduces a nasty problem: write conflicts.

The Conflict Problem

What happens if User A updates a row on Node 1, and User B updates the same row on Node 2 at the same time? Both writes succeed locally, but when they replicate to each other — which one wins?

Conflict resolution strategies:

Last-write-wins (LWW) — use timestamps. Whichever write has the later timestamp wins. Simple but can lose data.
Application-level resolution — the app decides how to merge conflicting writes (like Git merge conflicts)
CRDTs (Conflict-free Replicated Data Types) — special data structures designed so conflicts are impossible

When to Use Multi-Primary

Multi-region writes — users in Europe write to the European primary, users in Asia write to the Asian one. Avoids cross-continent latency for writes.
High write availability — if one primary goes down, the other still accepts writes.

But honestly, most applications are better off with a single primary. Multi-primary adds a lot of complexity.

Synchronous vs Asynchronous Replication

This is a critical design decision.

Asynchronous (Default for Most Systems)

The primary writes to its local storage and immediately confirms the write to the client. Replication happens in the background.

Fast writes — no waiting for replicas
Data loss risk — if the primary crashes before replicating, those writes are gone
Stale reads — replicas can lag behind

Synchronous

The primary waits for at least one replica to confirm it received the write before confirming to the client.

No data loss on failover — at least one replica is always up to date
Slower writes — every write waits for a network round-trip to the replica
Availability risk — if the synchronous replica is down, writes block

Semi-Synchronous (Practical Compromise)

One replica is synchronous (guaranteed up-to-date), the rest are async. If the sync replica goes down, another async replica gets promoted to sync.

-- PostgreSQL: configure synchronous replication
-- In postgresql.conf on the primary
synchronous_commit = on
synchronous_standby_names = 'replica1'

-- Check sync status
SELECT * FROM pg_stat_replication;

Failover

When the primary goes down, we need to promote a replica to become the new primary. This can be manual or automatic.

Manual Failover

A DBA notices the primary is down, picks the most up-to-date replica, promotes it, and reconfigures the application to point to the new primary. This can take minutes.

Automatic Failover

A monitoring system detects the failure and automatically promotes a replica. Tools for this:

PostgreSQL: pg_auto_failover, Patroni (with etcd/ZooKeeper)
MySQL: MySQL Group Replication, Orchestrator, ProxySQL
Redis: Redis Sentinel

The tricky part is avoiding split-brain — a situation where the old primary comes back online and both it and the new primary think they’re in charge. This is why automatic failover tools use consensus protocols (like Raft) to agree on who the primary is.

Replication Topologies

Beyond simple primary-replica, we can get creative:

Chain replication — Primary → Replica A → Replica B. Reduces load on the primary (it only replicates to one node), but adds more lag for downstream replicas.
Star topology — Primary → multiple replicas directly. More load on the primary, but less lag.
Cascading — a replica can be the source for other replicas. Useful for cross-region setups.

Key Takeaways for Interviews

Replication gives us availability and read scaling, not write scaling
Async replication is faster but risks data loss; sync is safer but slower
Replication lag is inevitable — design our app to handle it
Multi-primary sounds appealing but conflict resolution is genuinely hard
Automatic failover needs consensus to avoid split-brain

References

Sharding

advanced sharding partitioning consistent-hashing horizontal-scaling

Replication helps us scale reads. But what about writes? What if our single primary server can’t handle the write load, or our data simply doesn’t fit on one machine anymore?

That’s where sharding comes in. In simple language, sharding means splitting our data across multiple database servers (called shards). Each shard holds a subset of the total data. Together, they hold everything.

Think of it like a library that’s gotten too big for one building. Instead of one massive library, we split the books across three buildings — A-H in Building 1, I-P in Building 2, Q-Z in Building 3. Each building is smaller and faster to search.

When Do We Actually Need Sharding?

Sharding adds significant complexity. We should only do it when simpler solutions have been exhausted:

Vertical scaling — get a bigger server (more RAM, faster CPU, better disks). This is often cheaper and simpler than sharding.
Read replicas — if the problem is too many reads, add replicas.
Caching — reduce database load with Redis or Memcached.
Query optimization — maybe the problem is bad queries, not insufficient hardware.

If none of these are enough, then yes, it’s time to shard.

Sharding Strategies

The critical question is: how do we decide which shard holds which data?

Range-Based Sharding

Split data based on ranges of a key value.

-- Shard 1: user IDs 1 to 1,000,000
-- Shard 2: user IDs 1,000,001 to 2,000,000
-- Shard 3: user IDs 2,000,001 to 3,000,000

-- To find user 1,500,000:
-- 1,500,000 falls in range 1M-2M → go to Shard 2
SELECT * FROM users WHERE id = 1500000;  -- routes to Shard 2

Pros: Simple to understand and implement. Range queries are efficient (all data in a range lives on one shard).

Cons: Hot spots. If most new users have high IDs, Shard 3 gets all the writes while Shard 1 sits idle. The distribution is often uneven in practice.

Hash-Based Sharding

Apply a hash function to the shard key, then use modulo to pick a shard.

-- shard_number = hash(user_id) % num_shards

-- With 3 shards:
-- hash(42) % 3 = 1   → Shard 1
-- hash(43) % 3 = 0   → Shard 0
-- hash(44) % 3 = 2   → Shard 2

Pros: Even distribution. No hot spots (assuming a good hash function).

Cons: Range queries are impossible — data that was “next to each other” gets scattered across shards. And resharding is painful — if we go from 3 shards to 4, almost every key maps to a different shard. We’d have to move most of the data.

Directory-Based Sharding

Maintain a lookup table that maps each key (or key range) to its shard.

-- Lookup table (stored in a separate database or service)
-- key_range    | shard
-- user:1-500K  | shard_1
-- user:500K-1M | shard_2
-- user:1M-2M   | shard_3

-- To find a user: look up the mapping first, then query the right shard

Pros: Maximum flexibility. We can move data between shards without changing the sharding logic — just update the lookup table.

Cons: The lookup service is a single point of failure. Every query has an extra hop. The directory itself needs to be fast and highly available.

Sharding Strategy Comparison

Strategy	Distribution	Range Queries	Resharding
Range	Uneven (hot spots)	Efficient	Easy (split ranges)
Hash	Even	Impossible	Painful (mass migration)
Directory	Flexible	Depends	Easy (update mapping)

Consistent Hashing

Consistent hashing solves the biggest problem with regular hash-based sharding: adding or removing a shard doesn’t require moving all the data.

The Problem with Simple Hashing

With hash(key) % N, changing N (the number of shards) changes where almost every key maps. Going from 3 shards to 4 means ~75% of keys need to move. That’s a nightmare.

How Consistent Hashing Works

Imagine the hash space as a ring (circle from 0 to 2^32). Both shards and keys are placed on this ring using a hash function.

To find which shard owns a key: start at the key’s position on the ring and walk clockwise until we hit a shard. That shard owns the key.

When we add a shard: the new shard takes over some keys from its clockwise neighbor. Only those keys move — everything else stays put.

When we remove a shard: its keys move to the next shard clockwise. Again, minimal disruption.

Shard A

Shard B

Shard C

key1 ↗

key2 ↘

key3 ↙

Keys walk clockwise to find their shard

In practice, each physical shard gets multiple virtual nodes on the ring for better distribution. Instead of Shard A at one point, we put Shard A at 100 points around the ring. This smooths out imbalances.

The Headaches of Sharding

Cross-Shard Queries

This is the biggest pain point. If we shard users by user_id, and we need to find “all orders in the last 24 hours” — that data is spread across all shards. We have to query every shard and merge the results. That’s slow and complex.

-- This query needs to hit ALL shards and merge results
SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '24 hours'
ORDER BY created_at DESC LIMIT 20;
-- Each shard returns its top 20, then we merge and re-sort

Cross-Shard JOINs

Even worse than queries — JOINs across shards are extremely expensive. If users are on Shard 1 and their orders are on Shard 3, a JOIN users ON orders.user_id = users.id requires fetching data from both shards over the network.

The common solution: denormalize. Store enough data in each shard so we don’t need cross-shard JOINs.

Choosing the Shard Key

The shard key is the single most important decision in sharding. A bad choice can make our lives miserable:

Too few unique values — uneven distribution (imagine sharding by country)
Doesn’t match query patterns — forces cross-shard queries for common operations
Can’t be changed later — resharding is extremely painful

Good shard keys have high cardinality, even distribution, and match our most common query patterns.

Resharding

When we need to add or remove shards, we have to move data. Even with consistent hashing, this means:

Setting up the new shard
Copying data from existing shards to the new one
Keeping both in sync during migration
Switching traffic to the new topology
Cleaning up old data

Some databases handle this automatically (like CockroachDB, Vitess for MySQL). With others, it’s a manual and risky operation.

Real-World Sharding

Instagram — sharded PostgreSQL by user_id (each shard is a PostgreSQL schema)
Discord — sharded by guild_id (server ID) using Cassandra
Notion — sharded PostgreSQL, eventually moved to a custom solution
Vitess — YouTube’s MySQL sharding middleware (now open source)

Interview Tip

The key thing interviewers want to hear: sharding is a last resort, not a first instinct. Always explain the simpler alternatives first (vertical scaling, caching, read replicas). Then if sharding is needed, explain the strategy, the shard key choice, and the trade-offs — especially around cross-shard queries and resharding.

References

Connection Pooling

intermediate connection-pooling pgbouncer hikaricp performance scaling

Opening a database connection is one of those things that sounds trivial but is actually surprisingly expensive. Every time our application opens a connection to PostgreSQL or MySQL, a whole chain of events happens:

TCP handshake — three network round-trips just to establish the connection
TLS handshake — if encrypted, another 1-2 round-trips for the SSL negotiation
Authentication — the database verifies our credentials
Process/thread creation — PostgreSQL forks a new process for each connection, MySQL creates a new thread
Memory allocation — each connection consumes ~5-10 MB on the server side

All of this can take 20-50 milliseconds. For a single query, that might be fine. But in a web application handling 1,000 requests per second, opening and closing a connection for every request means we’re spending more time connecting than actually querying.

Enter Connection Pooling

Connection pooling solves this by reusing existing connections instead of creating new ones. We maintain a pool of open, ready-to-use connections. When our app needs a database connection, it borrows one from the pool. When it’s done, it returns it to the pool (not closes it). The connection stays open and ready for the next request.

Application

1000 requests/sec

→ borrow →

← return ←

Connection Pool

● idle ● in use

→

Database

Only 5 connections
instead of 1000

In simple language, instead of 1,000 requests each opening their own connection (which would overwhelm the database), they all share a pool of, say, 20 connections. The database only sees 20 connections, not 1,000.

Pool Sizing — The Goldilocks Problem

Setting the pool size wrong causes real problems:

Too small — requests wait in line for a free connection. Latency spikes. Timeouts.
Too large — too many connections overwhelm the database. Context switching, memory pressure, worse performance for everyone.

The Rule of Thumb

A surprisingly common formula from the HikariCP docs:

connections = (core_count * 2) + effective_spindle_count

For a 4-core server with an SSD (spindle count = 0):

connections = (4 * 2) + 0 = 8 connections

That’s not a typo. Most applications need far fewer connections than people think. A pool of 10-20 connections can handle thousands of requests per second because each connection is only “in use” for the milliseconds it takes to run the query.

Common Poolers by Language

Node.js — pg Pool

const { Pool } = require('pg');

const pool = new Pool({
  host: 'localhost',
  database: 'myapp',
  user: 'postgres',
  password: 'secret',
  max: 20,                    // max connections in pool
  idleTimeoutMillis: 30000,   // close idle connections after 30s
  connectionTimeoutMillis: 2000, // fail if can't get connection in 2s
});

// Borrow a connection, use it, return it
const result = await pool.query('SELECT * FROM users WHERE id = $1', [42]);

// Or for transactions — manual checkout
const client = await pool.connect();
try {
  await client.query('BEGIN');
  await client.query('INSERT INTO orders ...');
  await client.query('COMMIT');
} catch (e) {
  await client.query('ROLLBACK');
  throw e;
} finally {
  client.release();  // return to pool — don't forget this!
}

Java — HikariCP

HikariCP is the fastest and most reliable connection pool for Java. It’s the default in Spring Boot.

HikariConfig config = new HikariConfig();
config.setJdbcUrl("jdbc:postgresql://localhost/myapp");
config.setUsername("postgres");
config.setMaximumPoolSize(10);           // max connections
config.setMinimumIdle(5);                // keep at least 5 idle
config.setIdleTimeout(300000);           // close idle after 5 min
config.setMaxLifetime(1800000);          // recycle connections every 30 min
config.setConnectionTimeout(30000);      // wait max 30s for a connection

HikariDataSource ds = new HikariDataSource(config);

Python — SQLAlchemy

from sqlalchemy import create_engine

engine = create_engine(
    "postgresql://postgres:secret@localhost/myapp",
    pool_size=10,          # max connections to keep open
    max_overflow=5,        # allow 5 extra connections under load
    pool_timeout=30,       # wait max 30s for a connection
    pool_recycle=1800,     # recycle connections after 30 min
)

External Poolers — PgBouncer

Sometimes in-app pooling isn’t enough. If we have 10 application servers, each with a pool of 20 connections, that’s 200 connections to PostgreSQL. With serverless functions (AWS Lambda), it can be even worse — each invocation might open its own connection.

PgBouncer sits between our app and PostgreSQL as a lightweight proxy. All our app servers connect to PgBouncer, and PgBouncer maintains a much smaller pool of actual database connections.

PgBouncer Pool Modes

This is a common interview question.

Transaction mode (most common): A connection is assigned to a client for the duration of a single transaction. Between transactions, the connection goes back to the pool. Very efficient.

Session mode: A connection is assigned to a client for the entire session (until they disconnect). Less efficient but supports all PostgreSQL features (prepared statements, temp tables, etc.).

Statement mode: A connection is assigned per statement. Ultra-efficient but doesn’t support multi-statement transactions.

; pgbouncer.ini
[databases]
myapp = host=localhost port=5432 dbname=myapp

[pgbouncer]
listen_port = 6432
pool_mode = transaction        ; most common choice
max_client_conn = 1000         ; accept up to 1000 client connections
default_pool_size = 20         ; but only keep 20 real DB connections
min_pool_size = 5
reserve_pool_size = 5          ; extra connections for spikes

Configuration Parameters to Know

max_connections — Maximum connections in the pool. Start low (10-20), increase only if needed.

min_idle — Minimum idle connections to keep open. Avoids cold-start delay.

idle_timeout — Close connections idle for this long. Frees resources on the DB side.

max_lifetime — Recycle connections after this time, even if active. Prevents stale connections.

connection_timeout — How long to wait for a free connection before throwing an error.

Common Mistakes

Not returning connections to the pool — forgetting to call release() or close() on a borrowed connection. The pool gradually runs out and everything hangs.
Setting the pool too large — 200 connections doesn’t mean 200x performance. Past a point, more connections means more contention and worse performance.
No connection validation — if a connection goes stale (DB restarted, network blip), the pool hands out a dead connection. Most pools have a test-on-borrow or validation-query option to check.
Ignoring max_lifetime — connections held open for days can accumulate server-side memory leaks, DNS changes, or firewall timeouts.

References

CQRS and Event Sourcing

advanced cqrs event-sourcing architecture eventual-consistency projections

Most applications use the same database model for both reading and writing data. We design one table schema, write to it, and read from it. Simple. And for most apps, that’s perfectly fine.

But sometimes reads and writes have very different requirements. Writes need strong consistency, validation, and normalized data. Reads need fast queries, denormalized data, and specialized views. Trying to optimize for both in a single model leads to compromises everywhere.

That’s where CQRS comes in. And if we want to take it further, Event Sourcing changes how we think about data entirely.

CQRS — Command Query Responsibility Segregation

The name sounds intimidating, but the idea is straightforward. In simple language, CQRS means: use different models for reading and writing data.

Command side (writes): handles creates, updates, deletes. Optimized for consistency and business rules. Uses a normalized schema.
Query side (reads): handles queries and reports. Optimized for speed. Uses denormalized views, materialized views, or even a completely different database.

Command Side (Write)

● Validates business rules

● Normalized schema

● Strong consistency

● Relational DB (PostgreSQL)

events / sync

→→→

Query Side (Read)

● Optimized for queries

● Denormalized views

● Eventual consistency (usually)

● Could be Elasticsearch, Redis, etc.

A Practical Example

Say we’re building an e-commerce platform. The write side has normalized tables: orders, order_items, products, customers. Great for consistency.

But the product listing page needs to show product name, average rating, inventory count, seller info, and recent reviews — all in one query. With a normalized schema, that’s a multi-table JOIN.

With CQRS, the read side has a pre-built product_listing_view table with all of that data already denormalized. One simple SELECT, no JOINs.

-- Write model: normalized, proper constraints
INSERT INTO orders (customer_id, total, status) VALUES (42, 99.99, 'pending');
INSERT INTO order_items (order_id, product_id, qty) VALUES (1001, 55, 2);

-- Read model: denormalized, fast queries
-- This table gets updated asynchronously when the write side changes
SELECT product_name, avg_rating, stock_count, seller_name, thumbnail_url
FROM product_listing_view
WHERE category = 'electronics'
ORDER BY popularity DESC
LIMIT 20;

When Is CQRS Worth It?

CQRS adds complexity — we have two models to maintain, and keeping them in sync introduces eventual consistency. It’s not worth it for simple CRUD apps.

It IS worth it when:

Read and write patterns are drastically different
We need different scaling for reads vs writes
We need specialized read stores (search engine, analytics DB)
The domain is complex enough that separating concerns helps clarity

Event Sourcing — Store Events, Not State

This is where things get interesting. In a traditional database, we store the current state of an entity. If a user changes their email, we update the email column. The old email is gone.

With event sourcing, we don’t store current state. We store every event that ever happened:

UserCreated { id: 42, name: "Manish", email: "old@example.com" }
EmailChanged { id: 42, email: "new@example.com" }
NameChanged { id: 42, name: "Manish P" }
AccountDeactivated { id: 42 }
AccountReactivated { id: 42 }

To get the current state of user 42, we replay all events from the beginning. The event log is the source of truth — everything else is derived from it.

The Event Store

-- A simplified event store table
CREATE TABLE events (
    event_id BIGSERIAL PRIMARY KEY,
    aggregate_id UUID NOT NULL,         -- which entity this event belongs to
    aggregate_type VARCHAR(100),         -- "User", "Order", etc.
    event_type VARCHAR(100) NOT NULL,    -- "OrderPlaced", "OrderShipped"
    event_data JSONB NOT NULL,           -- the event payload
    created_at TIMESTAMPTZ DEFAULT NOW(),
    version INT NOT NULL                 -- for optimistic concurrency
);

-- Index for replaying events for a specific entity
CREATE INDEX idx_events_aggregate ON events (aggregate_type, aggregate_id, version);

Projections

Nobody wants to replay thousands of events every time they query something. That’s where projections come in.

A projection is a materialized view built from events. It’s a read-optimized representation of the current state. When a new event arrives, the projection updates itself.

Think of it like a bank statement. The event log is every transaction that ever happened on our account. The projection (current balance) is just the sum of all those transactions. We don’t recalculate from scratch every time — we update the running total.

Event: OrderPlaced { id: 1001, customer: 42, total: 99.99 }
  → Projection updates: customer_orders[42].count += 1
  → Projection updates: customer_orders[42].total_spent += 99.99
  → Projection updates: daily_revenue[today] += 99.99

Benefits of Event Sourcing

Complete audit trail — every change ever made is recorded. Perfect for finance, healthcare, compliance.
Temporal queries — “What did this order look like at 3pm yesterday?” Just replay events up to that timestamp.
Easy debugging — when something goes wrong, we can see exactly what happened and in what order.
Rebuild views — if a projection has a bug, fix it and rebuild from the event log. The events are the truth.
New features from old data — want to add a new analytics dashboard? Build a new projection from existing events.

Challenges

Eventual consistency — projections lag behind the event log. If we just placed an order and immediately check our order list, it might not show up yet.
Schema evolution — events are immutable. If we need to change the structure of an event, we have to handle both old and new formats.
Storage growth — events accumulate forever. Snapshotting helps (periodically save current state so we don’t replay from the beginning).
Complexity — it’s a fundamentally different way of thinking about data. The learning curve is steep.

CQRS + Event Sourcing Together

These two patterns work great together, but they don’t have to be used together:

CQRS without Event Sourcing: separate read/write models, but both store current state. Perfectly valid.
Event Sourcing without CQRS: store events, but use the same model for reads. Works for simple cases.
Both together: events are the write model, projections are the read model. The full power combo.

Command

PlaceOrder, CancelOrder

↓ validated & persisted as

Event Store

OrderPlaced, OrderCancelled (immutable log)

↓ events consumed by

Projection A

Order list view

Projection B

Revenue dashboard

Projection C

Search index

Interview Tip

Interviewers love asking “when would you use CQRS or event sourcing?” The safe answer: for complex domains with different read/write patterns, audit requirements, or event-driven architectures. Always acknowledge the trade-offs — eventual consistency, increased complexity, and the learning curve. And make it clear that for most CRUD applications, a simple shared model with good indexing is the right call.

References

Database per Service, Outbox Pattern, and CDC

advanced microservices cdc debezium outbox-pattern saga distributed-systems

One of the first things we learn about microservices is that each service should own its data. No shared databases. Sounds clean in theory. In practice, it creates a whole set of new problems: how do services share data? How do we maintain consistency across services? How do we avoid dual writes?

Let’s break down the patterns that solve these problems.

Database per Service

In a monolith, all the code talks to one big database. Every module can JOIN any table, every query can reach any data. It works, but it means everything is tightly coupled at the data level.

In microservices, we flip this around. Each service gets its own database — and no other service is allowed to touch it directly.

Order Service   → orders_db (PostgreSQL)
User Service    → users_db (PostgreSQL)
Product Service → products_db (MongoDB)
Search Service  → search_index (Elasticsearch)

Why Separate Databases?

Loose coupling — we can change the Order Service’s schema without breaking the User Service
Independent deployment — each service can be deployed, scaled, and migrated independently
Right tool for the job — the Product Service can use MongoDB while the Order Service uses PostgreSQL
Blast radius — if one database goes down, only that service is affected

The Hard Part: Querying Across Services

With a shared database, getting a user’s order history is a simple JOIN. With separate databases, the Order Service doesn’t have access to the Users table. So how do we answer “show me all orders with customer names”?

Option 1: API calls — the Order Service calls the User Service API to get customer names. Simple but adds latency and creates runtime dependencies.

Option 2: Data replication — the Order Service keeps a local copy of the customer names it needs. More complex but faster queries and no runtime dependency.

Option 3: Events — when a user’s name changes, the User Service publishes an event. The Order Service listens and updates its local copy. Decoupled but eventually consistent.

The Dual Write Problem

Here’s a common mistake. Say we need to save an order AND send a message to Kafka:

1. Save order to database   ← succeeds
2. Publish event to Kafka   ← FAILS (network error)

Now we have an order in the database but no event was published. Other services never find out about the order. This is called the dual write problem — writing to two systems without a transaction across them.

We can’t just wrap both in a database transaction because Kafka isn’t part of the database’s transaction. That’s where the Outbox Pattern comes in.

The Outbox Pattern

The outbox pattern guarantees that if the business data is saved, the event will be published — eventually.

How It Works

In the same database transaction, we write the business data AND an event to a local outbox table
A separate process (the “relay”) reads the outbox table and publishes events to the message broker
Once published, the relay marks the outbox entry as processed

Since both writes happen in the same transaction, they either both succeed or both fail. No inconsistency possible.

-- Step 1: In the same transaction
BEGIN;

-- Save the business data
INSERT INTO orders (id, customer_id, total, status)
VALUES (1001, 42, 99.99, 'placed');

-- Write the event to the outbox table
INSERT INTO outbox (
    aggregate_type, aggregate_id, event_type, payload
) VALUES (
    'Order', '1001', 'OrderPlaced',
    '{"orderId": 1001, "customerId": 42, "total": 99.99}'
);

COMMIT;

-- Step 2: A background relay process picks up outbox entries
-- and publishes them to Kafka/RabbitMQ
-- Step 3: After successful publish, marks them as processed

The outbox table looks something like this:

CREATE TABLE outbox (
    id BIGSERIAL PRIMARY KEY,
    aggregate_type VARCHAR(100) NOT NULL,
    aggregate_id VARCHAR(100) NOT NULL,
    event_type VARCHAR(100) NOT NULL,
    payload JSONB NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    processed_at TIMESTAMPTZ          -- NULL until published
);

Guarantees

The outbox pattern gives us at-least-once delivery. The relay might crash after publishing but before marking as processed, causing a re-publish. So consumers need to be idempotent — handling the same event twice should be harmless.

Change Data Capture (CDC)

CDC is another way to solve the “how do we get events from database changes” problem. But instead of writing to an outbox table, we tap directly into the database’s internal change log.

Every database already keeps a log of changes:

PostgreSQL has the WAL (Write-Ahead Log)
MySQL has the binlog
MongoDB has the oplog

CDC tools read these logs and turn them into event streams.

Debezium

Debezium is the most popular open-source CDC tool. It reads the database’s change log and publishes events to Kafka.

PostgreSQL

WAL (change log)

→

Debezium

reads WAL, emits events

→

Kafka

event stream

→

Search Service

Analytics

Cache Sync

Debezium captures every INSERT, UPDATE, DELETE and streams it to Kafka

CDC Use Cases

Syncing data across services — when the User Service updates a profile, CDC captures the change and other services receive it
Populating search indexes — every database change automatically updates Elasticsearch
Analytics pipelines — stream all changes to a data warehouse in real-time
Cache invalidation — when a row changes, invalidate the corresponding cache entry

CDC vs Outbox

Both solve similar problems but differently:

Outbox: application explicitly writes events. Full control over event format and content. Extra write per transaction.
CDC: captures changes automatically from the DB log. Zero application changes needed. But event format is tied to table structure.

A nice combo: use the outbox pattern with CDC. The application writes to an outbox table, and Debezium reads the outbox table’s changes from the WAL. This way we get explicit event control AND automatic publishing without a custom relay process. This is actually how Debezium’s recommended “outbox” connector works.

The Saga Pattern (Brief)

When an operation spans multiple services — like “place an order” which involves the Order Service, Payment Service, and Inventory Service — we can’t use a regular database transaction (because each service has its own database).

The Saga pattern breaks a distributed transaction into a sequence of local transactions, each in its own service. If one step fails, we run compensating transactions to undo the previous steps.

Two flavors:

Choreography — each service publishes events and listens for events. No central coordinator. Simple but hard to track and debug as the chain grows.

Order Service: OrderPlaced →
  Payment Service: PaymentProcessed →
    Inventory Service: StockReserved →
      Order Service: OrderConfirmed

If Payment fails:
  Payment Service: PaymentFailed →
    Order Service: OrderCancelled (compensating)

Orchestration — a central saga orchestrator tells each service what to do and handles failures. Easier to understand and debug, but the orchestrator is a single point of logic.

Orchestrator:
  1. Tell Order Service: create order → OK
  2. Tell Payment Service: charge payment → OK
  3. Tell Inventory Service: reserve stock → FAILED
  4. Tell Payment Service: refund payment (compensate)
  5. Tell Order Service: cancel order (compensate)

Important Caveat

Sagas give us eventual consistency, not strong consistency. Between steps, the system is in a partially-completed state. Our application needs to handle this — showing “order processing” instead of “order confirmed” until all steps complete.

Interview Tip

When discussing microservice data patterns, show that we understand the trade-offs. Database per service gives us independence but makes cross-service queries hard. The outbox pattern solves dual writes but adds a table and a relay process. CDC is elegant but ties events to database schema. Sagas handle distributed transactions but only provide eventual consistency. There’s no free lunch — every pattern solves one problem and introduces another.