C++ Memory Leak Debugging Case Study | Fixing a Production Server Memory Spike

C++ Memory Leak Debugging Case Study | Fixing a Production Server Memory Spike

이 글의 핵심

Real-world C++ production memory leak debugging with Valgrind, AddressSanitizer, and Heaptrack.

Introduction

In production, memory leaks are bugs that slowly kill a server. This article walks through a real leak we hit: from first symptoms through root cause, fix, and prevention.

What you will learn

  • How to spot memory leak symptoms early
  • How to use Valgrind, ASan, and Heaptrack in practice
  • Strategies for tracing leaks in a large codebase
  • Coding patterns that help prevent leaks

Table of contents

  1. Symptom: server memory keeps growing
  2. Initial analysis: monitoring data
  3. Tool choice: Valgrind vs ASan vs Heaptrack
  4. First pass with Valgrind
  5. Fast reproduction with ASan
  6. Allocation patterns with Heaptrack
  7. Root cause: accumulating event listeners
  8. Fix: RAII and smart pointers
  9. Verification: comparing memory profiles
  10. Prevention: ASan in CI
  11. Closing thoughts

1. Symptom: server memory keeps growing

What we saw

We ran a chat server. Starting three days after deploy, memory use grew roughly linearly.

# Right after deploy
$ ps aux | grep chat_server
user  12345  0.5  2.1  524288  ...  ./chat_server

# Three days later
$ ps aux | grep chat_server
user  12345  0.5  8.7  2162688  ...  ./chat_server

# Seven days later (killed by OOM killer)
[  123.456] Out of memory: Killed process 12345 (chat_server)

Early hypotheses

  1. Are connection objects not freed properly?
  2. Is a log buffer growing without bound?
  3. Is a cache growing forever?

2. Initial analysis: monitoring data

Prometheus metrics

// Metrics collection added to the server
class MemoryMetrics {
public:
    static size_t getCurrentRSS() {
        std::ifstream stat("/proc/self/status");
        std::string line;
        while (std::getline(stat, line)) {
            if (line.find("VmRSS:") == 0) {
                std::istringstream iss(line);
                std::string key, value, unit;
                iss >> key >> value >> unit;
                return std::stoull(value) * 1024; // KB to bytes
            }
        }
        return 0;
    }
};

// Send metrics periodically
void reportMetrics() {
    auto rss = MemoryMetrics::getCurrentRSS();
    prometheus_gauge_set(memory_rss_bytes, rss);
}

Pattern

From Grafana:

  • Memory growth rate: ~50 MB per hour
  • Connection count: stable (100–200)
  • Throughput: unchanged

Conclusion: it is not “memory per connection” but something that accumulates over time.


3. Tool choice: Valgrind vs ASan vs Heaptrack

Comparison

ToolStrengthsWeaknessesBest for
ValgrindAccurate leak detectionVery slow (10–50×)Dev, small repro cases
ASanFast (~2×), many bug classesNeeds recompileCI, integration tests
HeaptrackAllocation visualizationNot ideal for “leak only”Memory profiling

Strategy

  1. Try ASan for quick reproduction
  2. If it does not repro, use Valgrind for deeper analysis
  3. Use Heaptrack for allocation hotspots

4. First pass with Valgrind

Build and run

# Debug symbols, no optimization
$ g++ -g -O0 -std=c++17 *.cpp -o chat_server

# Run under Valgrind
$ valgrind --leak-check=full --show-leak-kinds=all \
           --track-origins=yes --log-file=valgrind.log \
           ./chat_server

Problem

The server became too slow to reproduce real load. After 10 minutes, memory growth was tiny.

==12345== HEAP SUMMARY:
==12345==     in use at exit: 1,234,567 bytes in 1,234 blocks
==12345==   total heap usage: 12,345 allocs, 11,111 frees, 123,456,789 bytes allocated

Conclusion: Valgrind is too slow to replay production-like load.


5. Fast reproduction with ASan

ASan build

# Recompile with ASan
$ g++ -g -O1 -fsanitize=address -fno-omit-frame-pointer \
      -std=c++17 *.cpp -o chat_server_asan

$ export ASAN_OPTIONS=detect_leaks=1:log_path=asan.log

Load test

# Simulate real traffic
$ ./load_test.sh --connections=200 --duration=600s

Result

In 10 minutes the leak reproduced; ASan reported:

=================================================================
==23456==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 48000 byte(s) in 1000 object(s) allocated from:
    #0 0x7f123456 in operator new(unsigned long)
    #1 0x7f234567 in EventManager::subscribe(std::string const&, EventCallback)
    #2 0x7f345678 in ChatRoom::addUser(User*)
    #3 0x7f456789 in Server::handleJoin(Connection*)
    ...

SUMMARY: AddressSanitizer: 48000 byte(s) leaked in 1000 allocations.

Finding: leak originates in EventManager::subscribe.


6. Allocation patterns with Heaptrack

Running Heaptrack

$ heaptrack ./chat_server

$ heaptrack_gui heaptrack.chat_server.12345.gz

Findings

From the Heaptrack flame graph:

  1. EventManager::subscribe accounts for 35% of allocations
  2. Allocations keep growing; almost no corresponding frees
  3. Call stack: ChatRoom::addUsersubscribe

7. Root cause: accumulating event listeners

Buggy code

class EventManager {
    std::unordered_map<std::string, std::vector<EventCallback*>> listeners_;

public:
    void subscribe(const std::string& event, EventCallback callback) {
        // Bug: allocated with new but never freed
        auto* cb = new EventCallback(std::move(callback));
        listeners_[event].push_back(cb);
    }
    
    void publish(const std::string& event, const EventData& data) {
        if (auto it = listeners_.find(event); it != listeners_.end()) {
            for (auto* cb : it->second) {
                (*cb)(data);
            }
        }
    }
    
    // Destructor does not free listeners!
    ~EventManager() = default;
};

class ChatRoom {
    EventManager& eventMgr_;
    
public:
    void addUser(User* user) {
        // Register a listener on every join
        eventMgr_.subscribe("message", [user](const EventData& data) {
            user->sendMessage(data);
        });
        
        // User leaves but listeners remain!
    }
};

Why it leaked

  1. Every addUser did new EventCallback
  2. After a user left, pointers stayed in listeners_
  3. The destructor did not free them
  4. 1000 joins → 1000 allocations → 0 frees ≈ 48 KB leak (scaled up in production)

8. Fix: RAII and smart pointers

Option 1: smart pointers

class EventManager {
    using CallbackPtr = std::shared_ptr<EventCallback>;
    std::unordered_map<std::string, std::vector<CallbackPtr>> listeners_;

public:
    // Returns subscription id for later unsubscribe
    size_t subscribe(const std::string& event, EventCallback callback) {
        auto cb = std::make_shared<EventCallback>(std::move(callback));
        listeners_[event].push_back(cb);
        return reinterpret_cast<size_t>(cb.get());
    }
    
    void unsubscribe(const std::string& event, size_t id) {
        auto& cbs = listeners_[event];
        cbs.erase(
            std::remove_if(cbs.begin(), cbs.end(),
                [id](const CallbackPtr& cb) {
                    return reinterpret_cast<size_t>(cb.get()) == id;
                }),
            cbs.end()
        );
    }
    
    ~EventManager() = default; // shared_ptr cleans up
};

Option 2: RAII wrapper

class Subscription {
    EventManager* mgr_;
    std::string event_;
    size_t id_;

public:
    Subscription(EventManager* mgr, std::string event, size_t id)
        : mgr_(mgr), event_(std::move(event)), id_(id) {}
    
    ~Subscription() {
        if (mgr_) {
            mgr_->unsubscribe(event_, id_);
        }
    }
    
    Subscription(Subscription&& other) noexcept
        : mgr_(other.mgr_), event_(std::move(other.event_)), id_(other.id_) {
        other.mgr_ = nullptr;
    }
    
    Subscription(const Subscription&) = delete;
    Subscription& operator=(const Subscription&) = delete;
};

class ChatRoom {
    EventManager& eventMgr_;
    std::vector<Subscription> subscriptions_;

public:
    void addUser(User* user) {
        auto id = eventMgr_.subscribe("message", [user](const EventData& data) {
            user->sendMessage(data);
        });
        
        subscriptions_.emplace_back(&eventMgr_, "message", id);
    }
    
    void removeUser(User* user) {
        // Removing from subscriptions_ triggers unsubscribe
        // (in practice, map users to subscriptions)
    }
};

9. Verification: comparing memory profiles

Before

$ heaptrack ./chat_server_before
# After 10 minutes
Peak heap memory: 2.1 GB
Total allocations: 1,234,567
Total deallocations: 234,567
Leaked: 1,000,000 allocations

After

$ heaptrack ./chat_server_after
# After 10 minutes
Peak heap memory: 156 MB
Total allocations: 1,234,567
Total deallocations: 1,234,565
Leaked: 2 allocations (static objects)

ASan final check

$ ./chat_server_asan
# After 10 min load test, exit
=================================================================
==45678==ERROR: LeakSanitizer: 0 byte(s) leaked in 0 allocation(s).

Success: the leak is gone.


10. Prevention: ASan in CI

GitHub Actions

# .github/workflows/sanitizers.yml
name: Memory Sanitizers

on: [push, pull_request]

jobs:
  asan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Build with ASan
        run: |
          cmake -DCMAKE_BUILD_TYPE=Debug \
                -DCMAKE_CXX_FLAGS="-fsanitize=address -fno-omit-frame-pointer" \
                -B build
          cmake --build build
      
      - name: Run tests with ASan
        run: |
          export ASAN_OPTIONS=detect_leaks=1:halt_on_error=1
          cd build && ctest --output-on-failure

Code review checklist

  • If new is used, is there a matching delete (or smart pointer)?
  • Can this be a smart pointer?
  • Is RAII used for resource acquisition?
  • For callbacks/listeners, is there an unsubscribe path?

11. Lessons and best practices

Takeaways

  1. Detect early: wire up memory monitoring from day one of deploys
  2. Combine tools: Valgrind, ASan, Heaptrack for different situations
  3. RAII: acquisition is initialization; release in destructors
  4. Automate: sanitizers in CI to catch regressions

Patterns that help avoid leaks

// Bad: manual memory
class BadCache {
    std::map<std::string, Data*> cache_;
public:
    void add(const std::string& key, Data* data) {
        cache_[key] = data; // who deletes?
    }
};

// Good: smart pointers
class GoodCache {
    std::map<std::string, std::unique_ptr<Data>> cache_;
public:
    void add(const std::string& key, std::unique_ptr<Data> data) {
        cache_[key] = std::move(data);
    }
};

// Better: value semantics
class BestCache {
    std::map<std::string, Data> cache_;
public:
    void add(const std::string& key, Data data) {
        cache_[key] = std::move(data);
    }
};

Closing thoughts

What we learned:

  • Leaks often show up slowly—monitoring is non-optional
  • Picking the right tool cuts debugging time sharply
  • RAII and smart pointers are the baseline for memory safety
  • CI sanitizers catch regressions early

If you are fighting memory issues in production, use this workflow systematically.


FAQ

Q1. Can we run ASan in production?

Roughly 2× overhead is common; route a fraction of traffic to an ASan build, or replay production traffic in staging.

Q2. Valgrind says “still reachable”—is that a leak?

“Still reachable” means memory still pointed to at exit. Fine for static singletons; if it grows over time, treat it as a leak.

Q3. Don’t smart pointers cause leaks via cycles?

Break shared_ptr cycles with weak_ptr; prefer unique_ptr when ownership is clear.


  • C++ smart pointers guide
  • C++ RAII
  • C++ Valgrind
  • C++ ASan debugging

Checklists

Memory leak debugging

  • Memory monitoring (Prometheus, Grafana)
  • Pattern analysis (linear, stepped, periodic)
  • Tool choice (Valgrind, ASan, Heaptrack)
  • Repro environment (load tests)
  • Call stack analysis (where allocations happen)
  • Root cause (why no free?)
  • Fix (RAII, smart pointers)
  • Verify (profile before/after)
  • Sanitizers in CI
  • Update review guidelines

Memory-safe coding

  • new/delete pairing or smart pointers
  • RAII for resources
  • Unsubscribe path for callbacks/listeners
  • Check for cycles (weak_ptr)
  • Exception safety (still freed on throw?)

Keywords

C++, memory leak, debugging, Valgrind, ASan, AddressSanitizer, Heaptrack, production, case study, RAII, smart pointers, profiling, CI/CD