Document Stores

intermediate nosql mongodb document-store denormalization

A document store is a database that stores data as documents — think JSON objects. Instead of rigid rows and columns, each document can have its own structure. No two documents in the same collection need to look alike.

MongoDB is the most popular document store, so we’ll use it as our primary example. Others include CouchDB, Amazon DocumentDB, and Firebase Firestore.

Terminology: SQL vs Document Store

SQL (PostgreSQL) Document (MongoDB)
DatabaseDatabase
TableCollection
RowDocument
ColumnField
JOIN$lookup (or embedding)
Schema (enforced)Schema (optional, flexible)

What a Document Looks Like

Documents are stored as BSON (Binary JSON) in MongoDB. They can contain nested objects, arrays, and any level of complexity:

{
  "_id": "ObjectId('65f1a2b3c4d5e6f7a8b9c0d1')",
  "name": "Manish Prajapati",
  "email": "manish@example.com",
  "age": 25,
  "address": {
    "city": "Mumbai",
    "state": "Maharashtra",
    "pin": "400001"
  },
  "skills": ["JavaScript", "Python", "PostgreSQL"],
  "experience": [
    {
      "company": "Acme Corp",
      "role": "Backend Developer",
      "years": 2
    },
    {
      "company": "Startup Inc",
      "role": "Full Stack",
      "years": 1
    }
  ]
}

Notice how the address and experience are embedded directly inside the user document. In a relational database, those would be separate tables with foreign keys.

Embedding vs Referencing

This is the biggest design decision in document databases.

Embedding (denormalized) — store related data directly inside the document:

// Order with embedded items — one read gets everything
{
  "_id": "order_123",
  "customer": "Manish",
  "items": [
    { "product": "Keyboard", "price": 2500, "qty": 1 },
    { "product": "Mouse", "price": 800, "qty": 2 }
  ],
  "total": 4100
}

Referencing (normalized) — store just the ID and look up the related data separately:

// Order with references — need a second query to get items
{
  "_id": "order_123",
  "customer_id": "user_456",
  "item_ids": ["item_789", "item_790"],
  "total": 4100
}

When to embed:

  • Data is always accessed together (order + items)
  • The embedded data doesn’t change often
  • The embedded data won’t grow unboundedly
  • One-to-few relationships

When to reference:

  • Data is shared across documents (a product referenced by many orders)
  • The related data changes frequently
  • The related data could grow very large
  • Many-to-many relationships

Basic MongoDB Queries

// Insert a document
db.users.insertOne({
  name: "Manish",
  email: "manish@example.com",
  age: 25,
  skills: ["JavaScript", "Python"]
})

// Find documents
db.users.find({ age: { $gte: 25 } })              // age >= 25
db.users.find({ skills: "Python" })                 // array contains "Python"
db.users.find({ "address.city": "Mumbai" })         // nested field

// Update
db.users.updateOne(
  { email: "manish@example.com" },                  // filter
  { $set: { age: 26 }, $push: { skills: "Go" } }   // update
)

// Delete
db.users.deleteOne({ email: "manish@example.com" })

Aggregation Pipeline

MongoDB’s aggregation pipeline is like SQL’s GROUP BY on steroids. We chain stages together to transform data:

// Find the average order total per customer, sorted by highest
db.orders.aggregate([
  { $match: { status: "completed" } },           // WHERE status = 'completed'
  { $group: {
      _id: "$customer_id",                        // GROUP BY customer_id
      avgTotal: { $avg: "$total" },               // AVG(total)
      orderCount: { $sum: 1 }                     // COUNT(*)
  }},
  { $sort: { avgTotal: -1 } },                   // ORDER BY avgTotal DESC
  { $limit: 10 }                                  // LIMIT 10
])

When Documents Beat Tables

Document stores shine when:

  • Schema varies across records — e-commerce products where a laptop has different fields than a t-shirt
  • Data is hierarchical — nested objects and arrays are natural
  • Rapid prototyping — no migrations needed, just add fields
  • Read-heavy with co-located data — one read gets everything (no JOINs)
  • Horizontal scaling — sharding is built into MongoDB

Document stores struggle when:

  • We need complex JOINs across many collections
  • We need strict schema enforcement (though MongoDB supports schema validation now)
  • We need multi-document ACID transactions everywhere (MongoDB supports them since v4.0, but they’re slower than single-document operations)
  • Data is highly relational with many-to-many relationships

In simple language, document stores let us store data the way our application actually uses it. Instead of splitting everything into flat tables and JOINing them back together, we store complete objects. The trade-off is that some queries become harder (cross-collection joins), but the queries we run most often become much simpler and faster.