RSS is flat but we hit OOM—could it still not be a leak?

Yes: fragmentation, temporary spikes, third-party caches, and more. Combine heap profilers with system metrics.

Should I use Valgrind or ASan first?

In dev builds, ASan catches many bug classes quickly. Valgrind is slower but useful for some heap patterns. Follow your team’s CI policy.

What questions does Heaptrack answer?

It shows which call paths allocate how much—great for excessive allocation or cache blowups, not only “classic” leaks.

Memory still creeps up after the fix—why?

Review design: container reserve, pooling, cache caps. One leak fix often is not the whole story.

C++ Memory Leak Debugging Case Study | Fixing a Production Server Memory Spike

2026년 3월 30일 · 22분 읽기 · 수정 2026년 3월 30일 Advanced

이 글의 핵심

Real-world C++ production memory leak debugging with Valgrind, AddressSanitizer, and Heaptrack.

Introduction

In production, memory leaks are bugs that slowly kill a server. This article walks through a real leak we hit: from first symptoms through root cause, fix, and prevention.

What you will learn

How to spot memory leak symptoms early
How to use Valgrind, ASan, and Heaptrack in practice
Strategies for tracing leaks in a large codebase
Coding patterns that help prevent leaks

Symptom: server memory keeps growing
Initial analysis: monitoring data
Tool choice: Valgrind vs ASan vs Heaptrack
First pass with Valgrind
Fast reproduction with ASan
Allocation patterns with Heaptrack
Root cause: accumulating event listeners
Fix: RAII and smart pointers
Verification: comparing memory profiles
Prevention: ASan in CI
Closing thoughts

1. Symptom: server memory keeps growing

What we saw

We ran a chat server. Starting three days after deploy, memory use grew roughly linearly.

# Right after deploy
$ ps aux | grep chat_server
user  12345  0.5  2.1  524288  ...  ./chat_server

# Three days later
$ ps aux | grep chat_server
user  12345  0.5  8.7  2162688  ...  ./chat_server

# Seven days later (killed by OOM killer)
[  123.456] Out of memory: Killed process 12345 (chat_server)

Early hypotheses

Are connection objects not freed properly?
Is a log buffer growing without bound?
Is a cache growing forever?

2. Initial analysis: monitoring data

Prometheus metrics

// Metrics collection added to the server
class MemoryMetrics {
public:
    static size_t getCurrentRSS() {
        std::ifstream stat("/proc/self/status");
        std::string line;
        while (std::getline(stat, line)) {
            if (line.find("VmRSS:") == 0) {
                std::istringstream iss(line);
                std::string key, value, unit;
                iss >> key >> value >> unit;
                return std::stoull(value) * 1024; // KB to bytes
            }
        }
        return 0;
    }
};

// Send metrics periodically
void reportMetrics() {
    auto rss = MemoryMetrics::getCurrentRSS();
    prometheus_gauge_set(memory_rss_bytes, rss);
}

Pattern

From Grafana:

Memory growth rate: ~50 MB per hour
Connection count: stable (100–200)
Throughput: unchanged

Conclusion: it is not “memory per connection” but something that accumulates over time.

3. Tool choice: Valgrind vs ASan vs Heaptrack

Comparison

Tool	Strengths	Weaknesses	Best for
Valgrind	Accurate leak detection	Very slow (10–50×)	Dev, small repro cases
ASan	Fast (~2×), many bug classes	Needs recompile	CI, integration tests
Heaptrack	Allocation visualization	Not ideal for “leak only”	Memory profiling

Strategy

Try ASan for quick reproduction
If it does not repro, use Valgrind for deeper analysis
Use Heaptrack for allocation hotspots

4. First pass with Valgrind

Build and run

# Debug symbols, no optimization
$ g++ -g -O0 -std=c++17 *.cpp -o chat_server

# Run under Valgrind
$ valgrind --leak-check=full --show-leak-kinds=all \
           --track-origins=yes --log-file=valgrind.log \
           ./chat_server

Problem

The server became too slow to reproduce real load. After 10 minutes, memory growth was tiny.

==12345== HEAP SUMMARY:
==12345==     in use at exit: 1,234,567 bytes in 1,234 blocks
==12345==   total heap usage: 12,345 allocs, 11,111 frees, 123,456,789 bytes allocated

Conclusion: Valgrind is too slow to replay production-like load.

5. Fast reproduction with ASan

ASan build

# Recompile with ASan
$ g++ -g -O1 -fsanitize=address -fno-omit-frame-pointer \
      -std=c++17 *.cpp -o chat_server_asan

$ export ASAN_OPTIONS=detect_leaks=1:log_path=asan.log

Load test

# Simulate real traffic
$ ./load_test.sh --connections=200 --duration=600s

Result

In 10 minutes the leak reproduced; ASan reported:

=================================================================
==23456==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 48000 byte(s) in 1000 object(s) allocated from:
    #0 0x7f123456 in operator new(unsigned long)
    #1 0x7f234567 in EventManager::subscribe(std::string const&, EventCallback)
    #2 0x7f345678 in ChatRoom::addUser(User*)
    #3 0x7f456789 in Server::handleJoin(Connection*)
    ...

SUMMARY: AddressSanitizer: 48000 byte(s) leaked in 1000 allocations.

Finding: leak originates in EventManager::subscribe.

6. Allocation patterns with Heaptrack

Running Heaptrack

$ heaptrack ./chat_server

$ heaptrack_gui heaptrack.chat_server.12345.gz

Findings

From the Heaptrack flame graph:

EventManager::subscribe accounts for 35% of allocations
Allocations keep growing; almost no corresponding frees
Call stack: ChatRoom::addUser → subscribe

7. Root cause: accumulating event listeners

Buggy code

class EventManager {
    std::unordered_map<std::string, std::vector<EventCallback*>> listeners_;

public:
    void subscribe(const std::string& event, EventCallback callback) {
        // Bug: allocated with new but never freed
        auto* cb = new EventCallback(std::move(callback));
        listeners_[event].push_back(cb);
    }
    
    void publish(const std::string& event, const EventData& data) {
        if (auto it = listeners_.find(event); it != listeners_.end()) {
            for (auto* cb : it->second) {
                (*cb)(data);
            }
        }
    }
    
    // Destructor does not free listeners!
    ~EventManager() = default;
};

class ChatRoom {
    EventManager& eventMgr_;
    
public:
    void addUser(User* user) {
        // Register a listener on every join
        eventMgr_.subscribe("message", [user](const EventData& data) {
            user->sendMessage(data);
        });
        
        // User leaves but listeners remain!
    }
};

Why it leaked

Every addUser did new EventCallback
After a user left, pointers stayed in listeners_
The destructor did not free them
1000 joins → 1000 allocations → 0 frees ≈ 48 KB leak (scaled up in production)

8. Fix: RAII and smart pointers

Option 1: smart pointers

class EventManager {
    using CallbackPtr = std::shared_ptr<EventCallback>;
    std::unordered_map<std::string, std::vector<CallbackPtr>> listeners_;

public:
    // Returns subscription id for later unsubscribe
    size_t subscribe(const std::string& event, EventCallback callback) {
        auto cb = std::make_shared<EventCallback>(std::move(callback));
        listeners_[event].push_back(cb);
        return reinterpret_cast<size_t>(cb.get());
    }
    
    void unsubscribe(const std::string& event, size_t id) {
        auto& cbs = listeners_[event];
        cbs.erase(
            std::remove_if(cbs.begin(), cbs.end(),
                [id](const CallbackPtr& cb) {
                    return reinterpret_cast<size_t>(cb.get()) == id;
                }),
            cbs.end()
        );
    }
    
    ~EventManager() = default; // shared_ptr cleans up
};

Option 2: RAII wrapper

class Subscription {
    EventManager* mgr_;
    std::string event_;
    size_t id_;

public:
    Subscription(EventManager* mgr, std::string event, size_t id)
        : mgr_(mgr), event_(std::move(event)), id_(id) {}
    
    ~Subscription() {
        if (mgr_) {
            mgr_->unsubscribe(event_, id_);
        }
    }
    
    Subscription(Subscription&& other) noexcept
        : mgr_(other.mgr_), event_(std::move(other.event_)), id_(other.id_) {
        other.mgr_ = nullptr;
    }
    
    Subscription(const Subscription&) = delete;
    Subscription& operator=(const Subscription&) = delete;
};

class ChatRoom {
    EventManager& eventMgr_;
    std::vector<Subscription> subscriptions_;

public:
    void addUser(User* user) {
        auto id = eventMgr_.subscribe("message", [user](const EventData& data) {
            user->sendMessage(data);
        });
        
        subscriptions_.emplace_back(&eventMgr_, "message", id);
    }
    
    void removeUser(User* user) {
        // Removing from subscriptions_ triggers unsubscribe
        // (in practice, map users to subscriptions)
    }
};

9. Verification: comparing memory profiles

Before

$ heaptrack ./chat_server_before
# After 10 minutes
Peak heap memory: 2.1 GB
Total allocations: 1,234,567
Total deallocations: 234,567
Leaked: 1,000,000 allocations

After

$ heaptrack ./chat_server_after
# After 10 minutes
Peak heap memory: 156 MB
Total allocations: 1,234,567
Total deallocations: 1,234,565
Leaked: 2 allocations (static objects)

ASan final check

$ ./chat_server_asan
# After 10 min load test, exit
=================================================================
==45678==ERROR: LeakSanitizer: 0 byte(s) leaked in 0 allocation(s).

Success: the leak is gone.

10. Prevention: ASan in CI

GitHub Actions

# .github/workflows/sanitizers.yml
name: Memory Sanitizers

on: [push, pull_request]

jobs:
  asan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Build with ASan
        run: |
          cmake -DCMAKE_BUILD_TYPE=Debug \
                -DCMAKE_CXX_FLAGS="-fsanitize=address -fno-omit-frame-pointer" \
                -B build
          cmake --build build
      
      - name: Run tests with ASan
        run: |
          export ASAN_OPTIONS=detect_leaks=1:halt_on_error=1
          cd build && ctest --output-on-failure

Code review checklist

If new is used, is there a matching delete (or smart pointer)?
Can this be a smart pointer?
Is RAII used for resource acquisition?
For callbacks/listeners, is there an unsubscribe path?

11. Lessons and best practices

Takeaways

Detect early: wire up memory monitoring from day one of deploys
Combine tools: Valgrind, ASan, Heaptrack for different situations
RAII: acquisition is initialization; release in destructors
Automate: sanitizers in CI to catch regressions

Patterns that help avoid leaks

// Bad: manual memory
class BadCache {
    std::map<std::string, Data*> cache_;
public:
    void add(const std::string& key, Data* data) {
        cache_[key] = data; // who deletes?
    }
};

// Good: smart pointers
class GoodCache {
    std::map<std::string, std::unique_ptr<Data>> cache_;
public:
    void add(const std::string& key, std::unique_ptr<Data> data) {
        cache_[key] = std::move(data);
    }
};

// Better: value semantics
class BestCache {
    std::map<std::string, Data> cache_;
public:
    void add(const std::string& key, Data data) {
        cache_[key] = std::move(data);
    }
};

Closing thoughts

What we learned:

Leaks often show up slowly—monitoring is non-optional
Picking the right tool cuts debugging time sharply
RAII and smart pointers are the baseline for memory safety
CI sanitizers catch regressions early

If you are fighting memory issues in production, use this workflow systematically.

FAQ

Q1. Can we run ASan in production?

Roughly 2× overhead is common; route a fraction of traffic to an ASan build, or replay production traffic in staging.

Q2. Valgrind says “still reachable”—is that a leak?

“Still reachable” means memory still pointed to at exit. Fine for static singletons; if it grows over time, treat it as a leak.

Q3. Don’t smart pointers cause leaks via cycles?

Break shared_ptr cycles with weak_ptr; prefer unique_ptr when ownership is clear.

C++ smart pointers guide
C++ RAII
C++ Valgrind
C++ ASan debugging

Checklists

Memory leak debugging

Memory-safe coding

new/delete pairing or smart pointers
RAII for resources
Unsubscribe path for callbacks/listeners
Check for cycles (weak_ptr)
Exception safety (still freed on throw?)

Keywords

C++, memory leak, debugging, Valgrind, ASan, AddressSanitizer, Heaptrack, production, case study, RAII, smart pointers, profiling, CI/CD