Subqueries & CTEs - DBMS

A subquery is simply a query inside another query. Instead of hardcoding a value, we compute it on the fly. And a CTE (Common Table Expression) is a way to write subqueries that are actually readable.

Types of Subqueries

Scalar Subquery — Returns a Single Value

This goes wherever a single value is expected — in a SELECT, WHERE, or HAVING clause.

-- Find employees who earn more than the average salary
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

-- The subquery returns one number (e.g., 75000)
-- The outer query then filters: WHERE salary > 75000

Row Subquery — Returns a Single Row

Returns one row with multiple columns. Useful for comparing against a set of values.

-- Find the employee with the highest salary in each department
SELECT name, salary, department_id
FROM employees
WHERE (department_id, salary) IN (
    SELECT department_id, MAX(salary)
    FROM employees
    GROUP BY department_id
);

Table Subquery — Returns Multiple Rows

Returns a full result set that we can use as a virtual table.

-- Use a subquery as a temporary table in FROM
SELECT dept_name, avg_salary
FROM (
    SELECT d.name AS dept_name, AVG(e.salary) AS avg_salary
    FROM employees e
    JOIN departments d ON e.department_id = d.id
    GROUP BY d.name
) AS dept_stats
WHERE avg_salary > 80000;

Correlated vs Non-Correlated

This distinction is crucial and comes up in interviews a lot.

Non-Correlated Subquery

The inner query runs once, independently of the outer query. It doesn’t reference any columns from the outer query.

-- Non-correlated: inner query runs ONCE
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

-- The database computes the average once, then uses that value for every row

Correlated Subquery

The inner query runs once for each row of the outer query. It references a column from the outer query. This makes it slower but more powerful.

-- Correlated: inner query runs ONCE PER ROW
-- Find employees who earn more than their department's average
SELECT e.name, e.salary, e.department_id
FROM employees e
WHERE e.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2
    WHERE e2.department_id = e.department_id  -- references outer query!
);

-- For each employee, the database computes their department's average
-- Then checks if that employee's salary is above it

In simple language: if the subquery can run on its own (paste it in a SQL editor and it works), it’s non-correlated. If it references the outer query and can’t run alone, it’s correlated.

EXISTS vs IN

Both check if something exists, but they work differently.

IN — Checks Against a List

-- Find employees in departments that are in New York
SELECT name FROM employees
WHERE department_id IN (
    SELECT id FROM departments WHERE location = 'New York'
);

-- The subquery returns a list of IDs: (1, 3, 7)
-- Then it checks: WHERE department_id IN (1, 3, 7)

EXISTS — Checks for Existence

-- Same logic, different approach
SELECT e.name FROM employees e
WHERE EXISTS (
    SELECT 1 FROM departments d
    WHERE d.id = e.department_id AND d.location = 'New York'
);

-- For each employee, it asks: "Does a matching department exist?"
-- Returns TRUE/FALSE, doesn't actually return the rows

When to Use Which?

Use IN when the subquery returns a small result set. It materializes the whole list first.
Use EXISTS when the subquery might return a large result set. It stops as soon as it finds one match (short-circuits).
EXISTS is usually faster for correlated subqueries with large tables.
IN can have issues with NULL values (NULL IN (…) returns NULL, not FALSE).

-- Gotcha with IN and NULLs
SELECT name FROM employees
WHERE department_id NOT IN (SELECT id FROM departments);
-- If departments has a NULL id, this returns NOTHING!
-- Because NOT IN with NULLs evaluates to UNKNOWN

-- EXISTS doesn't have this problem
SELECT e.name FROM employees e
WHERE NOT EXISTS (
    SELECT 1 FROM departments d WHERE d.id = e.department_id
);
-- Works correctly even with NULLs

CTEs — Common Table Expressions

CTEs let us name a subquery and reference it by name. They use the WITH keyword and make complex queries way more readable.

Think of it like: “Let me define a temporary result, give it a name, and use it below.”

-- Without CTE (messy nested subqueries)
SELECT dept_name, avg_salary
FROM (
    SELECT d.name AS dept_name, AVG(e.salary) AS avg_salary
    FROM employees e
    JOIN departments d ON e.department_id = d.id
    GROUP BY d.name
) AS dept_stats
WHERE avg_salary > (SELECT AVG(salary) FROM employees);

-- With CTE (clean and readable)
WITH dept_stats AS (
    SELECT d.name AS dept_name, AVG(e.salary) AS avg_salary
    FROM employees e
    JOIN departments d ON e.department_id = d.id
    GROUP BY d.name
),
company_avg AS (
    SELECT AVG(salary) AS avg_salary FROM employees
)
SELECT ds.dept_name, ds.avg_salary
FROM dept_stats ds, company_avg ca
WHERE ds.avg_salary > ca.avg_salary;

We can chain multiple CTEs and each one can reference the previous ones.

Recursive CTEs

Recursive CTEs are powerful for hierarchical data — org charts, folder structures, category trees.

They have two parts:

Base case — the starting rows
Recursive case — how to find the next level

-- Find all subordinates of a manager (org chart traversal)
WITH RECURSIVE subordinates AS (
    -- Base case: start with the manager
    SELECT id, name, manager_id, 1 AS level
    FROM employees
    WHERE id = 1  -- Start from employee #1

    UNION ALL

    -- Recursive case: find people who report to current level
    SELECT e.id, e.name, e.manager_id, s.level + 1
    FROM employees e
    INNER JOIN subordinates s ON e.manager_id = s.id
)
SELECT name, level FROM subordinates ORDER BY level;

-- Result:
-- name    | level
-- Alice   | 1       (the manager)
-- Bob     | 2       (reports to Alice)
-- Charlie | 2       (reports to Alice)
-- Diana   | 3       (reports to Bob)

Another classic example — generating a series of dates:

-- Generate all dates in January 2024
WITH RECURSIVE dates AS (
    SELECT DATE '2024-01-01' AS dt
    UNION ALL
    SELECT dt + INTERVAL '1 day'
    FROM dates
    WHERE dt < '2024-01-31'
)
SELECT dt FROM dates;

CTE vs Subquery — When to Use What

	Subquery	CTE
Readability	Can get messy when nested	Clean and named
Reusability	Must duplicate if used twice	Define once, use many times
Recursion	Not possible	Supported with RECURSIVE
Performance	Sometimes faster (inlined)	Sometimes materialized (check EXPLAIN)

In general: use CTEs for readability and when we need to reference the same derived table more than once. Use subqueries for simple, one-off filters.

Interview Tip

If asked to write a complex SQL query in an interview, use CTEs liberally. It shows clean thinking and makes the query much easier for the interviewer to follow. Also, knowing the difference between correlated and non-correlated subqueries is a common “gotcha” question — always mention the performance implication (correlated = once per row).