MongoDB Schema Design Guide | Embed vs Reference, Patterns, and Best Practices
이 글의 핵심
MongoDB schema design is fundamentally different from SQL — you model around your queries, not around normalization rules. This guide covers the embed-vs-reference decision, relationship patterns, and real-world schema patterns.
Why Schema Design Is Different in MongoDB
In SQL, you normalize to eliminate redundancy. In MongoDB, you denormalize for query performance — store related data together so a single document read answers the query.
The core question for every relationship: How will this data be accessed?
Real-world insight: A team migrated a blog from SQL to MongoDB and naively mapped each SQL table to a collection. Performance was worse than SQL. After redesigning around read patterns (embed comments with posts), queries went from 3 joins to 1 document read.
1. Embedding vs. Referencing
Embed when:
- Data is always read together
- Embedded data is small and has a bounded size
- Relationship is “owned by” (user owns their addresses)
- Data isn’t shared across documents
// EMBEDDED: user with addresses
{
_id: ObjectId("..."),
name: "Alice",
email: "[email protected]",
addresses: [
{ type: "home", street: "123 Main St", city: "Seoul", zip: "04524" },
{ type: "work", street: "456 Tower Ave", city: "Seoul", zip: "06236" }
]
}
Single db.users.findOne({ _id: userId }) returns everything.
Reference when:
- Data is large or grows without bound
- Data is accessed independently
- Data is shared across many documents
- You need to update one piece of data in many places
// REFERENCED: post with author
// posts collection
{
_id: ObjectId("post-1"),
title: "MongoDB Schema Design",
authorId: ObjectId("user-1"), // reference
tags: ["mongodb", "database"]
}
// users collection
{
_id: ObjectId("user-1"),
name: "Alice",
email: "[email protected]"
}
To get post + author: $lookup or two separate queries.
2. Relationship Patterns
One-to-One
Always embed unless the embedded document is large or rarely accessed together.
// Embed profile inside user (always needed together)
{
_id: ObjectId("..."),
email: "[email protected]",
profile: {
bio: "Software engineer",
avatar: "https://...",
location: "Seoul"
}
}
// Reference if profile is large or separately managed
{
_id: ObjectId("user-1"),
email: "[email protected]",
profileId: ObjectId("profile-1") // separate collection
}
One-to-Few (embed)
When the “many” side is small and bounded (e.g., ≤ 10 items), embed.
// Blog post with tags (bounded, small)
{
_id: ObjectId("..."),
title: "My Post",
tags: ["tech", "mongodb", "backend"]
}
// Order with line items (bounded, always needed together)
{
_id: ObjectId("..."),
userId: ObjectId("..."),
total: 129.99,
lineItems: [
{ productId: ObjectId("..."), name: "Widget", qty: 2, price: 49.99 },
{ productId: ObjectId("..."), name: "Gadget", qty: 1, price: 30.01 }
]
}
One-to-Many (reference on the “many” side)
When the “many” side can grow large (comments, orders, logs), put the reference on the child.
// posts collection
{ _id: ObjectId("post-1"), title: "My Post" }
// comments collection — reference points to parent
{ _id: ObjectId("..."), postId: ObjectId("post-1"), text: "Great post!", userId: ObjectId("...") }
{ _id: ObjectId("..."), postId: ObjectId("post-1"), text: "Thanks!", userId: ObjectId("...") }
// Query: get all comments for a post
db.comments.find({ postId: ObjectId("post-1") }).sort({ createdAt: -1 })
One-to-Many with partial embed (hybrid)
Store a subset of frequently accessed fields in the parent to avoid joins for common queries:
// Store 3 most recent comments in the post (for display)
// but comments still have their own collection for full access
{
_id: ObjectId("post-1"),
title: "My Post",
commentCount: 247,
recentComments: [
{ userId: ObjectId("..."), username: "Bob", text: "Great!", createdAt: ISODate("...") },
{ userId: ObjectId("..."), username: "Carol", text: "Thanks!", createdAt: ISODate("...") }
]
}
Many-to-Many
Use arrays of references on one or both sides depending on access patterns.
// Students enrolled in courses
// students collection
{
_id: ObjectId("student-1"),
name: "Alice",
enrolledCourseIds: [ObjectId("course-1"), ObjectId("course-2")]
}
// courses collection
{
_id: ObjectId("course-1"),
name: "MongoDB Fundamentals",
// Don't duplicate student array here unless needed
}
// Query students in a course
db.students.find({ enrolledCourseIds: ObjectId("course-1") })
// Query courses for a student
db.courses.find({ _id: { $in: student.enrolledCourseIds } })
3. Schema Patterns
Bucket Pattern
Group time-series data into buckets to reduce document count and enable efficient range queries.
Problem: Storing one document per sensor reading creates millions of tiny documents.
// NAIVE: one doc per reading (bad)
{ sensorId: "s1", timestamp: ISODate("2026-04-16T10:00:00"), value: 22.5 }
{ sensorId: "s1", timestamp: ISODate("2026-04-16T10:01:00"), value: 22.7 }
// ... 1440 documents per sensor per day
// BUCKET: one doc per hour
{
sensorId: "s1",
date: ISODate("2026-04-16"),
hour: 10,
readings: [
{ minute: 0, value: 22.5 },
{ minute: 1, value: 22.7 },
// ... up to 60 readings
],
count: 60,
avgValue: 22.6,
minValue: 22.1,
maxValue: 23.0
}
Benefits: 60× fewer documents, pre-computed aggregates, efficient range scans.
Outlier Pattern
Handle the rare case where a document would grow unbounded by splitting overflow to a separate collection.
// Product with reviews — most products have <50 reviews
{
_id: ObjectId("product-1"),
name: "Popular Gadget",
reviews: [ /* first 50 reviews embedded */ ],
reviewCount: 15000,
hasExtraReviews: true // flag signals overflow
}
// Overflow collection for products with many reviews
// product_reviews_overflow collection
{
productId: ObjectId("product-1"),
page: 2,
reviews: [ /* reviews 51-150 */ ]
}
Application checks hasExtraReviews and queries overflow collection only when needed.
Computed Pattern
Pre-compute expensive aggregations and store the result to avoid recalculating on every read.
// Instead of aggregating every request...
db.orders.aggregate([
{ $match: { customerId: ObjectId("...") } },
{ $group: { _id: null, total: { $sum: "$amount" }, count: { $sum: 1 } } }
])
// Store computed stats on the customer document
{
_id: ObjectId("customer-1"),
name: "Alice",
stats: {
totalOrders: 47,
totalSpent: 3847.50,
avgOrderValue: 81.86,
lastOrderDate: ISODate("2026-04-10")
}
}
// Update stats when a new order is placed
db.customers.updateOne(
{ _id: customerId },
{
$inc: { "stats.totalOrders": 1, "stats.totalSpent": orderAmount },
$set: { "stats.lastOrderDate": new Date() }
}
)
Schema Versioning Pattern
Add a schemaVersion field to handle schema evolution without downtime migrations.
// Version 1 (old documents)
{ _id: ..., schemaVersion: 1, name: "Alice Smith" }
// Version 2 (new documents)
{ _id: ..., schemaVersion: 2, firstName: "Alice", lastName: "Smith" }
// Application handles both
function getUserName(user) {
if (user.schemaVersion >= 2) {
return `${user.firstName} ${user.lastName}`;
}
return user.name; // fallback for v1
}
// Lazy migration: upgrade on read
async function getUser(id) {
const user = await db.users.findOne({ _id: id });
if (user.schemaVersion < 2) {
const [firstName, ...rest] = user.name.split(' ');
const updated = { ...user, schemaVersion: 2, firstName, lastName: rest.join(' ') };
await db.users.replaceOne({ _id: id }, updated);
return updated;
}
return user;
}
4. Indexing for Schema Patterns
Compound indexes for embedded fields
// Schema
{
userId: ObjectId("..."),
"address.city": "Seoul",
"address.zip": "04524"
}
// Index on embedded fields
db.users.createIndex({ "address.city": 1, "address.zip": 1 })
Multikey indexes for arrays
// Automatically handles arrays — one index entry per array element
db.posts.createIndex({ tags: 1 })
// Query any element in array
db.posts.find({ tags: "mongodb" })
Text indexes for search
db.products.createIndex({ name: "text", description: 'text. MongoDB Schema Design Guide에 대한 완전한 가이드입니다. 실전 예제와 함께 핵심 개념부터 고급 활용까지 다룹니다.' })
db.products.find({ $text: { $search: "wireless headphones" } })
Partial indexes — only index relevant documents
// Only index active users (saves space and write overhead)
db.users.createIndex(
{ email: 1 },
{ partialFilterExpression: { isActive: { $eq: true } } }
)
// Only index unpaid orders
db.orders.createIndex(
{ createdAt: 1 },
{ partialFilterExpression: { status: { $in: ["pending", "processing"] } } }
)
TTL index — auto-expire documents
// Sessions expire after 24 hours
db.sessions.createIndex({ createdAt: 1 }, { expireAfterSeconds: 86400 })
// Logs expire at a specific field value
db.logs.createIndex({ expireAt: 1 }, { expireAfterSeconds: 0 })
// Set expireAt: new Date(Date.now() + 7 * 24 * 60 * 60 * 1000) on insert
5. Aggregation Pipeline
// Sales report: total revenue by category, last 30 days
db.orders.aggregate([
// Stage 1: filter recent orders
{ $match: {
status: "completed",
createdAt: { $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000) }
}},
// Stage 2: unwind line items (array → multiple docs)
{ $unwind: "$lineItems" },
// Stage 3: group by category
{ $group: {
_id: "$lineItems.category",
revenue: { $sum: { $multiply: ["$lineItems.price", "$lineItems.qty"] } },
orderCount: { $sum: 1 }
}},
// Stage 4: sort and limit
{ $sort: { revenue: -1 } },
{ $limit: 10 },
// Stage 5: rename _id
{ $project: { category: "$_id", revenue: 1, orderCount: 1, _id: 0 } }
])
$lookup (join)
db.posts.aggregate([
{ $match: { status: "published" } },
{ $lookup: {
from: "users",
localField: "authorId",
foreignField: "_id",
as: "author"
}},
{ $unwind: "$author" },
{ $project: {
title: 1,
"author.name": 1,
"author.avatar": 1
}}
])
Use $lookup sparingly — it’s slower than embedded data. If you use it for every read, reconsider your schema.
6. Common Mistakes
Mistake 1: Embedding unbounded arrays
// AVOID: users array grows indefinitely
{
_id: ObjectId("post-1"),
title: "Popular Post",
likes: [
ObjectId("user-1"),
ObjectId("user-2"),
// ... could be millions
]
}
// FIX: separate likes collection or store count only
{
_id: ObjectId("post-1"),
title: "Popular Post",
likeCount: 15420
}
// Separate: db.likes.insertOne({ postId, userId, createdAt })
Mistake 2: Treating MongoDB like SQL (too many collections, no embedding)
// OVER-NORMALIZED (bad for MongoDB)
// users collection, profiles collection, addresses collection...
// Requires $lookup for every user read
// BETTER: embed what's always needed together
{
_id: ObjectId("..."),
email: "[email protected]",
profile: { bio: "...", avatar: "..." },
addresses: [{ type: "home", ... }]
}
Mistake 3: Missing indexes on foreign key fields
// If you store parentId references, always index them
db.comments.createIndex({ postId: 1, createdAt: -1 })
db.orders.createIndex({ userId: 1 })
Decision Guide
Is the relationship "owns" (user → address)?
→ Embed
Will the embedded array grow without bound?
→ Reference (or Bucket/Outlier pattern)
Is the embedded data accessed independently without the parent?
→ Reference
Is the same data referenced from many places?
→ Reference (avoids update anomalies)
Is this time-series data with high write volume?
→ Bucket pattern
Do you compute expensive aggregations on every read?
→ Computed pattern
Are you evolving the schema without a full migration?
→ Schema Versioning pattern
Key Takeaways
| Concept | Rule |
|---|---|
| Embed | Data accessed together, bounded size, “owned by” |
| Reference | Data accessed independently, unbounded growth, shared |
| Hybrid | Embed subset (e.g., 3 recent items) + reference full list |
| Bucket | Time-series or high-volume writes → group into buckets |
| Computed | Pre-aggregate totals/stats to avoid per-request aggregation |
| Indexes | Index every field in find(), sort(), and $lookup |
| $lookup | Use sparingly — frequent joins signal a schema design issue |
MongoDB rewards thinking about queries first and schema second. Ask “what queries will this app run 1000 times per second?” — then design the schema to answer those queries with a single document read.