We’re designing a real-time chat system like WhatsApp or Messenger. This is an advanced question because it touches on real-time communication, message ordering, delivery guarantees, group chats, and online presence — all at massive scale.
The core challenge: how do we deliver a message from one person to another in real time, reliably, and at the scale of billions of messages per day?
Step 1: Requirements
Functional Requirements
- One-on-one messaging between users
- Group chats (up to 500 members)
- Message delivery status: sent, delivered, read
- Online/offline presence indicators
- Push notifications for offline users
- Media sharing (images, videos, documents)
- Message history and persistence
Non-Functional Requirements
- Real-time — messages should arrive in under 500ms
- Reliable delivery — no messages should be lost, even if the recipient is offline
- Message ordering — messages in a conversation should appear in the correct order
- High availability — chat is a core feature, downtime is unacceptable
- Low latency — the system should feel instant
- Scale — support 2B users, 60B messages/day
Step 2: Estimation
Assumptions:
- 2B total users, 500M daily active users
- Each active user sends 40 messages/day on average
- Average message size: 100 bytes (text), 500 KB (media — 10% of messages include media)
QPS:
Messages/day = 500M × 40 = 20B messages/day
Write QPS = 20B / 86,400 ≈ ~230,000 messages/sec
Peak QPS = ~500,000 messages/sec
Storage:
Text messages: 20B × 100 bytes = 2 TB/day
Media messages: 20B × 0.10 × 500 KB = 1 PB/day (media stored in object storage)
Text per year: ~730 TB
Connections:
500M DAU = 500M concurrent WebSocket connections (at peak)
Each connection ≈ 10 KB memory on the server
500M × 10 KB = 5 TB of memory across all chat servers
That’s a lot of connections. We’ll need thousands of chat servers.
Step 3: High-Level Design
Why WebSocket?
HTTP is request-response. The client has to ask the server “got any new messages?” over and over. That’s called polling, and it’s wasteful.
WebSocket gives us a persistent, two-way connection. The server can push a message to the client the instant it arrives. No polling, no wasted requests. This is how every modern chat app works.
The flow for a 1:1 message:
- User A sends a message through their WebSocket connection to Chat Server 1
- Chat Server 1 looks up which chat server User B is connected to (using a connection registry in Redis)
- If User B is online → route the message to that chat server → push to B’s WebSocket
- If User B is offline → send a push notification via APNs/FCM
- Either way, the message gets persisted to the database
Step 4: API Design
For WebSocket messages, we use a JSON-based protocol:
// Send a message
{
"type": "send_message",
"conversation_id": "conv_abc123",
"content": "Hey, how are you?",
"content_type": "text",
"client_msg_id": "uuid-from-client"
}
// Server acknowledgment
{
"type": "message_ack",
"client_msg_id": "uuid-from-client",
"server_msg_id": "msg_789",
"status": "sent",
"timestamp": 1735689600
}
// Receive a message (pushed by server)
{
"type": "new_message",
"conversation_id": "conv_abc123",
"server_msg_id": "msg_789",
"sender_id": "user_42",
"content": "Hey, how are you?",
"content_type": "text",
"timestamp": 1735689600
}
// Delivery receipt
{
"type": "message_status",
"server_msg_id": "msg_789",
"status": "delivered" // or "read"
}
For REST endpoints (non-real-time operations):
GET /api/v1/conversations — list user's conversations
GET /api/v1/conversations/{id}/messages — message history (paginated)
POST /api/v1/conversations — create a group chat
POST /api/v1/media/upload — get pre-signed URL for media upload
Step 5: Data Model
We need a database that handles massive write throughput and scales horizontally. Cassandra is a great fit for the messages table — it’s designed for write-heavy workloads and scales linearly.
-- Users table (PostgreSQL — relational, small)
CREATE TABLE users (
user_id BIGINT PRIMARY KEY,
username VARCHAR(50) UNIQUE,
display_name VARCHAR(100),
avatar_url TEXT,
last_seen TIMESTAMP,
created_at TIMESTAMP
);
-- Conversations table (PostgreSQL)
CREATE TABLE conversations (
conversation_id BIGINT PRIMARY KEY,
type VARCHAR(10), -- 'direct' or 'group'
name VARCHAR(100), -- group name (null for direct)
created_at TIMESTAMP
);
-- Conversation participants (PostgreSQL)
CREATE TABLE participants (
conversation_id BIGINT,
user_id BIGINT,
role VARCHAR(10) DEFAULT 'member', -- 'admin', 'member'
joined_at TIMESTAMP,
PRIMARY KEY (conversation_id, user_id)
);
-- Messages table (Cassandra — write-heavy, time-series)
-- Partition key: conversation_id
-- Clustering key: message_id (TimeUUID for ordering)
CREATE TABLE messages (
conversation_id BIGINT,
message_id TIMEUUID,
sender_id BIGINT,
content TEXT,
content_type VARCHAR(10), -- 'text', 'image', 'video'
media_url TEXT,
created_at TIMESTAMP,
PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id ASC);
Why Cassandra for messages? Each conversation is a partition. Reading a conversation’s messages is a single partition read — super fast. Writes are append-only. And Cassandra scales horizontally by adding more nodes.
Connection registry (Redis):
# Which chat server is each user connected to?
user_connection:{user_id} → "chat-server-42:ws-connection-id"
TTL: 300 seconds (refreshed by heartbeat)
Step 6: Deep Dives
Deep Dive 1: Message Delivery and Ordering
Delivering messages reliably is harder than it sounds. What if User B’s phone is offline? What if they switch networks mid-conversation? What if two messages arrive out of order?
Message delivery flow:
Sender Server Receiver
│ │ │
│── Send message ────────>│ │
│<── ACK (sent) ──────────│ │
│ │── Push message ─────────>│
│ │<── ACK (delivered) ──────│
│<── Status: delivered ───│ │
│ │ (user opens chat) │
│<── Status: read ────────│<── Read receipt ─────────│
Three statuses: sent (server received it), delivered (recipient’s device got it), read (recipient opened the conversation).
Ordering: Messages within a single conversation must appear in order. We use a per-conversation sequence number for this. Every message in a conversation gets an incrementing sequence number assigned by the server.
Conversation conv_abc123:
msg_1: seq=1, "Hey" (from A, 1:00:00)
msg_2: seq=2, "What's up?" (from A, 1:00:01)
msg_3: seq=3, "Not much, you?" (from B, 1:00:05)
The client uses these sequence numbers to display messages in order. If a message arrives with seq=5 but the client only has up to seq=3, it knows seq=4 is missing and requests it from the server.
Handling offline users: When User B is offline, the server stores the message in the database and sends a push notification. When User B comes back online and reconnects via WebSocket, the server sends all undelivered messages (everything since B’s last received sequence number).
Deep Dive 2: Group Chat Fan-Out
In a 1:1 chat, we deliver the message to one person. In a group chat with 200 people, we have to deliver it to 199 people. How?
Option A: Fan-out on write (push model)
When someone sends a message to a group, the server immediately pushes it to every online member’s WebSocket connection and queues push notifications for offline members.
Alice sends "Hello everyone" to group (200 members)
→ Server looks up all 200 participants
→ For each online member: push via WebSocket
→ For each offline member: send push notification
→ Write message once to the messages table
Pros: Receivers get the message instantly. Reading is simple — just fetch from the connection. Cons: Sending a message is expensive for large groups. If 200 people are online, that’s 200 WebSocket pushes per message.
Option B: Fan-out on read (pull model)
The server just stores the message. When each member opens the group chat, they pull new messages.
Pros: Sending is fast — just one write. Cons: Opening a group chat requires a DB read. For active groups, this adds latency.
Our approach: Push for small/medium groups (up to ~500 members). WhatsApp caps groups at 1024 members for exactly this reason. For very large groups (like Slack channels with 10K+ people), we’d switch to a pull model with smart caching.
The message queue (Kafka) helps here. When a message is sent to a group, it’s published to a topic partitioned by conversation. Consumer workers read from the topic and handle the fan-out — looking up members, pushing to WebSockets, and sending push notifications.
Deep Dive 3: Online Presence and Typing Indicators
How does the app show that someone is “online” or “typing…”?
Online/offline status:
Every connected client sends a heartbeat to the server every 30 seconds. The server updates the user’s status in Redis:
presence:{user_id} → { status: "online", last_seen: 1735689600 }
TTL: 60 seconds (auto-expires if no heartbeat)
When the TTL expires without a heartbeat, the user is considered offline. We update last_seen and publish an “offline” event to their contacts.
But we don’t want to notify every contact every time someone goes online/offline. That’s too expensive at scale. Instead:
- We only send presence updates to people who have the chat open (actively viewing the conversation list or the specific chat)
- Presence is best-effort — it’s okay if it’s a few seconds stale
Typing indicators:
When a user starts typing, the client sends a typing_start event. The server forwards it to the other participants in the conversation. After 3 seconds of no keystrokes, the client sends typing_stop.
{ "type": "typing", "conversation_id": "conv_abc", "user_id": "user_42", "status": "typing" }
Typing indicators are fire-and-forget. If one gets lost, nothing bad happens — the “typing…” label just disappears sooner. No persistence needed.
Step 7: Scaling
Chat server scaling:
- Each chat server handles ~100K concurrent WebSocket connections
- 500M concurrent users → ~5,000 chat servers
- Use consistent hashing to assign users to chat servers
- The connection registry in Redis lets any chat server route a message to the right destination
Database scaling:
- Messages in Cassandra — partition by conversation_id for locality
- User data in PostgreSQL — read replicas for lookups
- Hot conversations (active group chats) can be cached in Redis
Message queue (Kafka) scaling:
- Partition by conversation_id — all messages for a conversation go to the same partition (preserves ordering)
- Add more partitions and consumers as throughput grows
Media handling:
- Media files go to object storage (S3) via pre-signed URLs
- Thumbnails generated asynchronously by a media processing service
- CDN in front of object storage for fast delivery
Multi-region deployment:
- Chat servers in multiple regions for low latency
- Messages replicated across regions asynchronously
- User connects to the nearest region (GeoDNS)
- Cross-region messages: Server in Region A publishes to Kafka → replicated to Region B → delivered locally
Push notifications at scale:
- Batch push notifications to APNs/FCM (they support batching)
- Separate push notification workers so they don’t block message delivery
- Rate limit notifications per user (nobody wants 100 notifications from an active group)
In simple language, a chat system is a network of WebSocket connections with a message queue in the middle. The client connects to a chat server. The server knows where every other user is connected (Redis registry). Messages flow through Kafka for reliability and ordering. Offline messages get queued and delivered on reconnect. The hardest parts are keeping messages in order, handling group fan-out efficiently, and managing millions of persistent connections. But the core pattern is surprisingly straightforward: accept message, find recipient, deliver.