Database Sharding

intermediate 4-7 YOE database sharding partitioning scaling distributed-systems

There comes a point when a single database server just can’t handle it all. The data is too big to fit on one disk, or the queries are too many for one CPU. That’s when we need sharding.

What Is Sharding?

Sharding is splitting our data across multiple database servers. Each server holds a subset of the data (called a shard or partition). Together, all shards hold the complete dataset.

Think of it like splitting a phone book into volumes: A-F goes to Server 1, G-M goes to Server 2, N-Z goes to Server 3. Each server only deals with its portion.

Horizontal vs Vertical Partitioning

Horizontal sharding (row-based): We split rows across servers. All servers have the same columns, but different rows. This is what people usually mean when they say “sharding.”

Vertical partitioning (column-based): We split columns across servers. One server holds user_id, name, email and another holds user_id, profile_pic, bio. This is more like splitting a big table into smaller, focused tables.

Most of the time, we’re talking about horizontal sharding.

Horizontal Sharding — Users Table
Application Layer
shard_key = user_id % 3
Shard 0
user_id: 3, 6, 9...
~33% of users
Shard 1
user_id: 1, 4, 7...
~33% of users
Shard 2
user_id: 2, 5, 8...
~33% of users

Sharding Strategies

Hash-Based Sharding

We run the shard key through a hash function: shard = hash(user_id) % number_of_shards. This distributes data evenly across shards.

Pros: Even distribution, simple. Cons: Range queries become expensive (we have to query all shards). Adding a new shard means rehashing everything. This is why consistent hashing exists (next topic).

Range-Based Sharding

We assign ranges to each shard. Users with IDs 1-1M go to Shard 1, 1M-2M go to Shard 2, and so on. Or we could shard by date: January data on one shard, February on another.

Pros: Range queries are efficient (all data in a range lives on one shard). Cons: Can create hotspots. If most new users sign up this month, the “current month” shard gets hammered while older shards sit idle.

Directory-Based Sharding

We maintain a lookup table that maps each key to its shard. A separate service says “user 42 lives on Shard 3.”

Pros: Flexible — we can move data between shards without changing the sharding logic. Cons: The lookup service becomes a single point of failure and a bottleneck.

Problems with Sharding

Sharding introduces real complexity. Here’s what we have to deal with:

Cross-Shard Joins

If a user is on Shard 1 and their orders are on Shard 2, doing a JOIN across shards is a nightmare. We either have to query both shards and merge in the application, or denormalize our data so related data lives on the same shard.

Resharding

When we add or remove shards, we have to redistribute data. With hash-based sharding, hash(key) % 3 gives completely different results than hash(key) % 4. So adding one shard means moving most of our data. This is why consistent hashing was invented — it minimizes data movement.

Hotspots

Even with good distribution, some keys get way more traffic. A celebrity’s profile on a social network gets millions of reads while a regular user gets a few. That one shard handling the celebrity becomes a hotspot.

Mitigation: Add a random suffix to hot keys to spread them across shards, or handle hot keys specially in the application.

Referential Integrity

Foreign key constraints don’t work across shards. We lose the database’s help in enforcing relationships and have to handle it in application code.

When Do We Actually Need Sharding?

Sharding is a last resort, not a first step. Before sharding, we should try:

  1. Vertical scaling — get a bigger machine (more RAM, faster SSD)
  2. Read replicas — offload reads to follower databases
  3. Caching — put Redis in front of the database
  4. Query optimization — add indexes, rewrite slow queries

If all of that isn’t enough and our database is still struggling, then it’s time to shard.

In simple language, sharding is like splitting a library into multiple buildings. Each building has a portion of the books. It lets us serve way more readers, but finding a book now requires knowing which building it’s in. And moving books between buildings is a headache.