C++ Crash Debugging Case Study | Fixing an Intermittent Segmentation Fault

C++ Crash Debugging Case Study | Fixing an Intermittent Segmentation Fault

이 글의 핵심

Intermittent segfault debugging with core dumps, gdb, rr, and TSan—real production case.

Introduction

“Sometimes the server dies” is among the hardest reports. This post covers a hard-to-reproduce intermittent crash solved with core dumps, gdb, and rr.

What you will learn

  • How to configure and use core dumps
  • gdb techniques at the crash site
  • Using rr when local repro fails
  • Debugging data races in multithreaded code

Table of contents

  1. Symptom: intermittent SIGSEGV
  2. Core dump setup
  3. gdb: analyzing the crash
  4. Hypothesis: dangling pointer?
  5. Reproduction fails locally
  6. Recording with rr
  7. Reverse debugging
  8. Root cause: data race
  9. Fix: synchronization
  10. Validation with TSan
  11. Closing thoughts

1. Symptom: intermittent SIGSEGV

Production

Roughly 1–2 segfaults per day:

$ dmesg | tail
[12345.678] chat_server[23456]: segfault at 0 ip 00007f1234567890 sp 00007fff12345678 error 4 in chat_server

Characteristics

  • No reliable repro in dev
  • Intermittent, seemingly random
  • More frequent under load

2. Core dump setup

System

$ ulimit -c unlimited

$ sudo sysctl -w kernel.core_pattern=/var/coredumps/core.%e.%p.%t

$ sudo mkdir -p /var/coredumps
$ sudo chmod 1777 /var/coredumps

Server

$ ulimit -c
unlimited

$ ./chat_server

Next morning

$ ls -lh /var/coredumps/
-rw------- 1 user user 1.2G Mar 30 03:42 core.chat_server.23456.1711756920

3. gdb: analyzing the crash

Load core

$ gdb ./chat_server /var/coredumps/core.chat_server.23456.1711756920

(gdb) bt
#0  0x00007f1234567890 in std::vector<Message>::operator[] (this=0x0, __n=5)
    at /usr/include/c++/11/bits/stl_vector.h:1046
#1  0x00007f2345678901 in ChatRoom::broadcast (this=0x7f3456789012, msg=...)
    at src/chat_room.cpp:145
...

Finding

  • Crash in ChatRoom::broadcast with this=0x0 (null dereference)
  • Thread: Asio worker

4. Hypothesis: dangling pointer?

Suspect code

class Connection {
    ChatRoom* room_; // raw pointer
    
public:
    void handleMessage(const std::string& msg) {
        if (room_) {
            room_->broadcast(msg);
        }
    }
    
    void leaveRoom() {
        room_ = nullptr;
    }
};

class ChatRoom {
    std::vector<Connection*> connections_;
    
public:
    void broadcast(const std::string& msg) {
        for (auto* conn : connections_) {
            conn->send(msg);
        }
    }
};

Hypotheses

  1. Connection still points at a destroyed ChatRoom
  2. Multithreaded race clearing room_ while in use

5. Reproduction fails locally

$ ./load_test.sh --users=1000 --duration=600
# No crash

Why

  • Timing-dependent races
  • Different CPU and load vs production
  • Rare interleaving

Conclusion: debug without easy repro → try rr


6. Recording with rr

Setup

$ sudo apt install rr

$ echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid

Record

$ rr record ./chat_server

rr: Saving execution to trace directory `/root/.local/share/rr/chat_server-0'.

7. Reverse debugging

$ rr replay /root/.local/share/rr/chat_server-0

(rr) c
Program received signal SIGSEGV, Segmentation fault.
0x00007f1234567890 in ChatRoom::broadcast (this=0x0, msg=...)

Watch room_

(rr) reverse-continue
(rr) watch -l room_

(rr) reverse-continue
Old value = (ChatRoom*) 0x7f3456789012
New value = (ChatRoom*) 0x0

(rr) bt
#0  Connection::leaveRoom() at src/connection.cpp:67
#1  ChatRoom::removeConnection() at src/chat_room.cpp:89
#2  Server::handleDisconnect() at src/server.cpp:123

Insight

Thread A: inside handleMessage (using room_)
Thread B: leaveRoom sets room_ = nullptr
Data race


8. Root cause: data race

Interleaving

// Thread A
void Connection::handleMessage(const std::string& msg) {
    if (room_) {              // room_ looks valid
        room_->broadcast(msg); // B may clear room_ here → crash
    }
}

// Thread B
void Connection::leaveRoom() {
    room_ = nullptr;
}

Timeline

Time  | Thread A                    | Thread B
------|-----------------------------|-----------------------
t0    | if (room_) { // true        |
t1    |                             | room_ = nullptr;
t2    | room_->broadcast(msg);      |
      | SIGSEGV                     |

9. Fix: synchronization

Mutex

class Connection {
    mutable std::mutex roomMutex_;
    ChatRoom* room_;
    
public:
    void handleMessage(const std::string& msg) {
        std::lock_guard<std::mutex> lock(roomMutex_);
        if (room_) {
            room_->broadcast(msg);
        }
    }
    
    void leaveRoom() {
        std::lock_guard<std::mutex> lock(roomMutex_);
        room_ = nullptr;
    }
};

shared_ptr + weak_ptr

class Connection {
    std::weak_ptr<ChatRoom> room_;
    
public:
    void handleMessage(const std::string& msg) {
        if (auto room = room_.lock()) {
            room->broadcast(msg);
        }
    }
    
    void setRoom(std::shared_ptr<ChatRoom> room) {
        room_ = room;
    }
    
    void leaveRoom() {
        room_.reset();
    }
};

Asio strand (serialize handlers)

class Connection {
    boost::asio::strand<boost::asio::io_context::executor_type> strand_;
    ChatRoom* room_;
    
public:
    void handleMessage(const std::string& msg) {
        boost::asio::post(strand_, [this, msg]() {
            if (room_) {
                room_->broadcast(msg);
            }
        });
    }
    
    void leaveRoom() {
        boost::asio::post(strand_, [this]() {
            room_ = nullptr;
        });
    }
};

10. TSan validation

$ g++ -g -O1 -fsanitize=thread -std=c++17 *.cpp -o chat_server_tsan
$ ./chat_server_tsan

WARNING: ThreadSanitizer: data race (pid=12345)
  Write of size 8 at 0x7f1234567890 by thread T2:
    #0 Connection::leaveRoom() src/connection.cpp:67
    
  Previous read of size 8 at 0x7f1234567890 by thread T1:
    #0 Connection::handleMessage() src/connection.cpp:45
...
SUMMARY: ThreadSanitizer: data race src/connection.cpp:67 in Connection::leaveRoom()

11. After the fix

Load / soak

$ ./chat_server_tsan
# 24h run → 0 races reported

# Production: 1 week → 0 crashes (example)

Overhead (illustrative)

ApproachOverheadSafety
Mutex~5%High
weak_ptr~10%Very high
Strand~2%High (Asio)

We chose strand (already on Asio).


12. Lessons

Takeaways

  1. Enable core dumps in production where policy allows
  2. rr is powerful for “can’t repro” bugs
  3. TSan in CI catches races early
  4. Strand / locks for shared mutable state

Intermittent crash workflow

graph TD
    A[Crash] --> B{Core dump?}
    B -->|Yes| C[gdb backtrace]
    B -->|No| D[Fix core settings, wait]
    C --> E{Repro locally?}
    E -->|Yes| F[gdb]
    E -->|No| G[rr record]
    G --> H[rr replay / reverse]
    H --> I[Root cause]
    I --> J[Fix]
    J --> K[TSan / ASan]

Patterns

// Bad: unsynchronized shared state
class BadConnection {
    ChatRoom* room_;
    void handleMessage(const std::string& msg) {
        if (room_) room_->broadcast(msg); // race
    }
};

// Good: mutex
class GoodConnection {
    std::mutex mutex_;
    ChatRoom* room_;
    void handleMessage(const std::string& msg) {
        std::lock_guard<std::mutex> lock(mutex_);
        if (room_) room_->broadcast(msg);
    }
};

13. More techniques

(gdb) watch room_
(gdb) continue

(gdb) break Connection::handleMessage if room_ == 0

(gdb) info threads
(gdb) thread 2
(gdb) bt

Closing thoughts

  1. Core dumps pinpointed the faulting instruction
  2. gdb showed the stack
  3. rr made nondeterminism debuggable
  4. TSan confirmed the race
  5. Strand serialized access safely

You can fix “unreproducible” crashes with the right tools.


FAQ

Q1. rr in production?

Possible with overhead; some teams run a subset of hosts under rr for nasty bugs.

Q2. Cores are huge

Pipe core_pattern to a compressor, or cap size with ulimit -c.

Q3. TSan + ASan together?

No—one sanitizer per process. Run separate CI jobs.


  • C++ debugging tips
  • C++ thread safety
  • C++ Asio strand
  • C++ smart pointers

Checklists

Crash debugging

  • Core dump policy
  • Collect crash logs
  • gdb backtrace
  • Try repro
  • If not, rr
  • TSan if race suspected
  • Fix and soak-test
  • Sanitizers in CI

Thread safety

  • Identify shared mutable state
  • Protect with mutex / strand / atomics
  • Prefer smart pointers for lifetime
  • TSan on tests
  • Review deadlock risk

Keywords

C++, crash, segmentation fault, core dump, gdb, rr, reverse debugging, data race, ThreadSanitizer, TSan, multithreading, case study