Should I embed or reference in MongoDB?

Embed when: the data is always accessed together, the embedded data is small and bounded, and the relationship is 'owned by' (user → address). Reference when: data is large, accessed independently, or shared across many documents.

Can MongoDB do joins?

Yes, with $lookup in the aggregation pipeline. But unlike SQL joins, $lookup requires a round-trip to a second collection — it's slower. Good schema design minimizes the need for $lookup.

What is the 16MB document limit and does it matter?

MongoDB documents are capped at 16MB. In practice, you hit access pattern problems before the size limit. If a document approaches that size, you need the Outlier or Bucket pattern.

How does MongoDB indexing differ from SQL?

MongoDB indexing concepts are similar — B-tree indexes, compound indexes, covered queries. The difference is that you can index fields inside embedded arrays and nested documents, which SQL can't do.

MongoDB Schema Design Guide | Embed vs Reference, Patterns, and Best Practices

2026년 4월 7일 · 24분 읽기 · 수정 2026년 4월 7일 intermediate tutorial

이 글의 핵심

MongoDB schema design is fundamentally different from SQL — you model around your queries, not around normalization rules. This guide covers the embed-vs-reference decision, relationship patterns, and real-world schema patterns.

Why Schema Design Is Different in MongoDB

In SQL, you normalize to eliminate redundancy. In MongoDB, you denormalize for query performance — store related data together so a single document read answers the query.

The core question for every relationship: How will this data be accessed?

Real-world insight: A team migrated a blog from SQL to MongoDB and naively mapped each SQL table to a collection. Performance was worse than SQL. After redesigning around read patterns (embed comments with posts), queries went from 3 joins to 1 document read.

1. Embedding vs. Referencing

Embed when:

Data is always read together
Embedded data is small and has a bounded size
Relationship is “owned by” (user owns their addresses)
Data isn’t shared across documents

// EMBEDDED: user with addresses
{
  _id: ObjectId("..."),
  name: "Alice",
  email: "[email protected]",
  addresses: [
    { type: "home", street: "123 Main St", city: "Seoul", zip: "04524" },
    { type: "work", street: "456 Tower Ave", city: "Seoul", zip: "06236" }
  ]
}

Single db.users.findOne({ _id: userId }) returns everything.

Reference when:

Data is large or grows without bound
Data is accessed independently
Data is shared across many documents
You need to update one piece of data in many places

// REFERENCED: post with author
// posts collection
{
  _id: ObjectId("post-1"),
  title: "MongoDB Schema Design",
  authorId: ObjectId("user-1"),  // reference
  tags: ["mongodb", "database"]
}

// users collection
{
  _id: ObjectId("user-1"),
  name: "Alice",
  email: "[email protected]"
}

To get post + author: $lookup or two separate queries.

2. Relationship Patterns

One-to-One

Always embed unless the embedded document is large or rarely accessed together.

// Embed profile inside user (always needed together)
{
  _id: ObjectId("..."),
  email: "[email protected]",
  profile: {
    bio: "Software engineer",
    avatar: "https://...",
    location: "Seoul"
  }
}

// Reference if profile is large or separately managed
{
  _id: ObjectId("user-1"),
  email: "[email protected]",
  profileId: ObjectId("profile-1")  // separate collection
}

One-to-Few (embed)

When the “many” side is small and bounded (e.g., ≤ 10 items), embed.

// Blog post with tags (bounded, small)
{
  _id: ObjectId("..."),
  title: "My Post",
  tags: ["tech", "mongodb", "backend"]
}

// Order with line items (bounded, always needed together)
{
  _id: ObjectId("..."),
  userId: ObjectId("..."),
  total: 129.99,
  lineItems: [
    { productId: ObjectId("..."), name: "Widget", qty: 2, price: 49.99 },
    { productId: ObjectId("..."), name: "Gadget", qty: 1, price: 30.01 }
  ]
}

One-to-Many (reference on the “many” side)

When the “many” side can grow large (comments, orders, logs), put the reference on the child.

// posts collection
{ _id: ObjectId("post-1"), title: "My Post" }

// comments collection — reference points to parent
{ _id: ObjectId("..."), postId: ObjectId("post-1"), text: "Great post!", userId: ObjectId("...") }
{ _id: ObjectId("..."), postId: ObjectId("post-1"), text: "Thanks!", userId: ObjectId("...") }

// Query: get all comments for a post
db.comments.find({ postId: ObjectId("post-1") }).sort({ createdAt: -1 })

One-to-Many with partial embed (hybrid)

Store a subset of frequently accessed fields in the parent to avoid joins for common queries:

// Store 3 most recent comments in the post (for display)
// but comments still have their own collection for full access
{
  _id: ObjectId("post-1"),
  title: "My Post",
  commentCount: 247,
  recentComments: [
    { userId: ObjectId("..."), username: "Bob", text: "Great!", createdAt: ISODate("...") },
    { userId: ObjectId("..."), username: "Carol", text: "Thanks!", createdAt: ISODate("...") }
  ]
}

Many-to-Many

Use arrays of references on one or both sides depending on access patterns.

// Students enrolled in courses
// students collection
{
  _id: ObjectId("student-1"),
  name: "Alice",
  enrolledCourseIds: [ObjectId("course-1"), ObjectId("course-2")]
}

// courses collection
{
  _id: ObjectId("course-1"),
  name: "MongoDB Fundamentals",
  // Don't duplicate student array here unless needed
}

// Query students in a course
db.students.find({ enrolledCourseIds: ObjectId("course-1") })

// Query courses for a student
db.courses.find({ _id: { $in: student.enrolledCourseIds } })

3. Schema Patterns

Bucket Pattern

Group time-series data into buckets to reduce document count and enable efficient range queries.

Problem: Storing one document per sensor reading creates millions of tiny documents.

// NAIVE: one doc per reading (bad)
{ sensorId: "s1", timestamp: ISODate("2026-04-16T10:00:00"), value: 22.5 }
{ sensorId: "s1", timestamp: ISODate("2026-04-16T10:01:00"), value: 22.7 }
// ... 1440 documents per sensor per day

// BUCKET: one doc per hour
{
  sensorId: "s1",
  date: ISODate("2026-04-16"),
  hour: 10,
  readings: [
    { minute: 0, value: 22.5 },
    { minute: 1, value: 22.7 },
    // ... up to 60 readings
  ],
  count: 60,
  avgValue: 22.6,
  minValue: 22.1,
  maxValue: 23.0
}

Benefits: 60× fewer documents, pre-computed aggregates, efficient range scans.

Outlier Pattern

Handle the rare case where a document would grow unbounded by splitting overflow to a separate collection.

// Product with reviews — most products have <50 reviews
{
  _id: ObjectId("product-1"),
  name: "Popular Gadget",
  reviews: [ /* first 50 reviews embedded */ ],
  reviewCount: 15000,
  hasExtraReviews: true   // flag signals overflow
}

// Overflow collection for products with many reviews
// product_reviews_overflow collection
{
  productId: ObjectId("product-1"),
  page: 2,
  reviews: [ /* reviews 51-150 */ ]
}

Application checks hasExtraReviews and queries overflow collection only when needed.

Computed Pattern

Pre-compute expensive aggregations and store the result to avoid recalculating on every read.

// Instead of aggregating every request...
db.orders.aggregate([
  { $match: { customerId: ObjectId("...") } },
  { $group: { _id: null, total: { $sum: "$amount" }, count: { $sum: 1 } } }
])

// Store computed stats on the customer document
{
  _id: ObjectId("customer-1"),
  name: "Alice",
  stats: {
    totalOrders: 47,
    totalSpent: 3847.50,
    avgOrderValue: 81.86,
    lastOrderDate: ISODate("2026-04-10")
  }
}

// Update stats when a new order is placed
db.customers.updateOne(
  { _id: customerId },
  {
    $inc: { "stats.totalOrders": 1, "stats.totalSpent": orderAmount },
    $set: { "stats.lastOrderDate": new Date() }
  }
)

Schema Versioning Pattern

Add a schemaVersion field to handle schema evolution without downtime migrations.

// Version 1 (old documents)
{ _id: ..., schemaVersion: 1, name: "Alice Smith" }

// Version 2 (new documents)
{ _id: ..., schemaVersion: 2, firstName: "Alice", lastName: "Smith" }

// Application handles both
function getUserName(user) {
  if (user.schemaVersion >= 2) {
    return `${user.firstName} ${user.lastName}`;
  }
  return user.name;  // fallback for v1
}

// Lazy migration: upgrade on read
async function getUser(id) {
  const user = await db.users.findOne({ _id: id });
  if (user.schemaVersion < 2) {
    const [firstName, ...rest] = user.name.split(' ');
    const updated = { ...user, schemaVersion: 2, firstName, lastName: rest.join(' ') };
    await db.users.replaceOne({ _id: id }, updated);
    return updated;
  }
  return user;
}

4. Indexing for Schema Patterns

Compound indexes for embedded fields

// Schema
{
  userId: ObjectId("..."),
  "address.city": "Seoul",
  "address.zip": "04524"
}

// Index on embedded fields
db.users.createIndex({ "address.city": 1, "address.zip": 1 })

Multikey indexes for arrays

// Automatically handles arrays — one index entry per array element
db.posts.createIndex({ tags: 1 })

// Query any element in array
db.posts.find({ tags: "mongodb" })

Text indexes for search

db.products.createIndex({ name: "text", description: 'text. MongoDB Schema Design Guide에 대한 완전한 가이드입니다. 실전 예제와 함께 핵심 개념부터 고급 활용까지 다룹니다.' })

db.products.find({ $text: { $search: "wireless headphones" } })

Partial indexes — only index relevant documents

// Only index active users (saves space and write overhead)
db.users.createIndex(
  { email: 1 },
  { partialFilterExpression: { isActive: { $eq: true } } }
)

// Only index unpaid orders
db.orders.createIndex(
  { createdAt: 1 },
  { partialFilterExpression: { status: { $in: ["pending", "processing"] } } }
)

TTL index — auto-expire documents

// Sessions expire after 24 hours
db.sessions.createIndex({ createdAt: 1 }, { expireAfterSeconds: 86400 })

// Logs expire at a specific field value
db.logs.createIndex({ expireAt: 1 }, { expireAfterSeconds: 0 })
// Set expireAt: new Date(Date.now() + 7 * 24 * 60 * 60 * 1000) on insert

5. Aggregation Pipeline

// Sales report: total revenue by category, last 30 days
db.orders.aggregate([
  // Stage 1: filter recent orders
  { $match: {
    status: "completed",
    createdAt: { $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) }
  }},

  // Stage 2: unwind line items (array → multiple docs)
  { $unwind: "$lineItems" },

  // Stage 3: group by category
  { $group: {
    _id: "$lineItems.category",
    revenue: { $sum: { $multiply: ["$lineItems.price", "$lineItems.qty"] } },
    orderCount: { $sum: 1 }
  }},

  // Stage 4: sort and limit
  { $sort: { revenue: -1 } },
  { $limit: 10 },

  // Stage 5: rename _id
  { $project: { category: "$_id", revenue: 1, orderCount: 1, _id: 0 } }
])

$lookup (join)

db.posts.aggregate([
  { $match: { status: "published" } },
  { $lookup: {
    from: "users",
    localField: "authorId",
    foreignField: "_id",
    as: "author"
  }},
  { $unwind: "$author" },
  { $project: {
    title: 1,
    "author.name": 1,
    "author.avatar": 1
  }}
])

Use $lookup sparingly — it’s slower than embedded data. If you use it for every read, reconsider your schema.

6. Common Mistakes

Mistake 1: Embedding unbounded arrays

// AVOID: users array grows indefinitely
{
  _id: ObjectId("post-1"),
  title: "Popular Post",
  likes: [
    ObjectId("user-1"),
    ObjectId("user-2"),
    // ... could be millions
  ]
}

// FIX: separate likes collection or store count only
{
  _id: ObjectId("post-1"),
  title: "Popular Post",
  likeCount: 15420
}
// Separate: db.likes.insertOne({ postId, userId, createdAt })

Mistake 2: Treating MongoDB like SQL (too many collections, no embedding)

// OVER-NORMALIZED (bad for MongoDB)
// users collection, profiles collection, addresses collection...
// Requires $lookup for every user read

// BETTER: embed what's always needed together
{
  _id: ObjectId("..."),
  email: "[email protected]",
  profile: { bio: "...", avatar: "..." },
  addresses: [{ type: "home", ... }]
}

Mistake 3: Missing indexes on foreign key fields

// If you store parentId references, always index them
db.comments.createIndex({ postId: 1, createdAt: -1 })
db.orders.createIndex({ userId: 1 })

Decision Guide

Is the relationship "owns" (user → address)?
  → Embed

Will the embedded array grow without bound?
  → Reference (or Bucket/Outlier pattern)

Is the embedded data accessed independently without the parent?
  → Reference

Is the same data referenced from many places?
  → Reference (avoids update anomalies)

Is this time-series data with high write volume?
  → Bucket pattern

Do you compute expensive aggregations on every read?
  → Computed pattern

Are you evolving the schema without a full migration?
  → Schema Versioning pattern

Key Takeaways

Concept	Rule
Embed	Data accessed together, bounded size, “owned by”
Reference	Data accessed independently, unbounded growth, shared
Hybrid	Embed subset (e.g., 3 recent items) + reference full list
Bucket	Time-series or high-volume writes → group into buckets
Computed	Pre-aggregate totals/stats to avoid per-request aggregation
Indexes	Index every field in `find()`, `sort()`, and `$lookup`
$lookup	Use sparingly — frequent joins signal a schema design issue

MongoDB rewards thinking about queries first and schema second. Ask “what queries will this app run 1000 times per second?” — then design the schema to answer those queries with a single document read.