C++ Crash Debugging Case Study | Fixing an Intermittent Segmentation Fault
이 글의 핵심
Intermittent segfault debugging with core dumps, gdb, rr, and TSan—real production case.
Introduction
“Sometimes the server dies” is among the hardest reports. This post covers a hard-to-reproduce intermittent crash solved with core dumps, gdb, and rr.
What you will learn
- How to configure and use core dumps
- gdb techniques at the crash site
- Using rr when local repro fails
- Debugging data races in multithreaded code
Table of contents
- Symptom: intermittent SIGSEGV
- Core dump setup
- gdb: analyzing the crash
- Hypothesis: dangling pointer?
- Reproduction fails locally
- Recording with rr
- Reverse debugging
- Root cause: data race
- Fix: synchronization
- Validation with TSan
- Closing thoughts
1. Symptom: intermittent SIGSEGV
Production
Roughly 1–2 segfaults per day:
$ dmesg | tail
[12345.678] chat_server[23456]: segfault at 0 ip 00007f1234567890 sp 00007fff12345678 error 4 in chat_server
Characteristics
- No reliable repro in dev
- Intermittent, seemingly random
- More frequent under load
2. Core dump setup
System
$ ulimit -c unlimited
$ sudo sysctl -w kernel.core_pattern=/var/coredumps/core.%e.%p.%t
$ sudo mkdir -p /var/coredumps
$ sudo chmod 1777 /var/coredumps
Server
$ ulimit -c
unlimited
$ ./chat_server
Next morning
$ ls -lh /var/coredumps/
-rw------- 1 user user 1.2G Mar 30 03:42 core.chat_server.23456.1711756920
3. gdb: analyzing the crash
Load core
$ gdb ./chat_server /var/coredumps/core.chat_server.23456.1711756920
(gdb) bt
#0 0x00007f1234567890 in std::vector<Message>::operator[] (this=0x0, __n=5)
at /usr/include/c++/11/bits/stl_vector.h:1046
#1 0x00007f2345678901 in ChatRoom::broadcast (this=0x7f3456789012, msg=...)
at src/chat_room.cpp:145
...
Finding
- Crash in
ChatRoom::broadcastwiththis=0x0(null dereference) - Thread: Asio worker
4. Hypothesis: dangling pointer?
Suspect code
class Connection {
ChatRoom* room_; // raw pointer
public:
void handleMessage(const std::string& msg) {
if (room_) {
room_->broadcast(msg);
}
}
void leaveRoom() {
room_ = nullptr;
}
};
class ChatRoom {
std::vector<Connection*> connections_;
public:
void broadcast(const std::string& msg) {
for (auto* conn : connections_) {
conn->send(msg);
}
}
};
Hypotheses
Connectionstill points at a destroyedChatRoom- Multithreaded race clearing
room_while in use
5. Reproduction fails locally
$ ./load_test.sh --users=1000 --duration=600
# No crash
Why
- Timing-dependent races
- Different CPU and load vs production
- Rare interleaving
Conclusion: debug without easy repro → try rr
6. Recording with rr
Setup
$ sudo apt install rr
$ echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid
Record
$ rr record ./chat_server
rr: Saving execution to trace directory `/root/.local/share/rr/chat_server-0'.
7. Reverse debugging
$ rr replay /root/.local/share/rr/chat_server-0
(rr) c
Program received signal SIGSEGV, Segmentation fault.
0x00007f1234567890 in ChatRoom::broadcast (this=0x0, msg=...)
Watch room_
(rr) reverse-continue
(rr) watch -l room_
(rr) reverse-continue
Old value = (ChatRoom*) 0x7f3456789012
New value = (ChatRoom*) 0x0
(rr) bt
#0 Connection::leaveRoom() at src/connection.cpp:67
#1 ChatRoom::removeConnection() at src/chat_room.cpp:89
#2 Server::handleDisconnect() at src/server.cpp:123
Insight
Thread A: inside handleMessage (using room_)
Thread B: leaveRoom sets room_ = nullptr
→ Data race
8. Root cause: data race
Interleaving
// Thread A
void Connection::handleMessage(const std::string& msg) {
if (room_) { // room_ looks valid
room_->broadcast(msg); // B may clear room_ here → crash
}
}
// Thread B
void Connection::leaveRoom() {
room_ = nullptr;
}
Timeline
Time | Thread A | Thread B
------|-----------------------------|-----------------------
t0 | if (room_) { // true |
t1 | | room_ = nullptr;
t2 | room_->broadcast(msg); |
| SIGSEGV |
9. Fix: synchronization
Mutex
class Connection {
mutable std::mutex roomMutex_;
ChatRoom* room_;
public:
void handleMessage(const std::string& msg) {
std::lock_guard<std::mutex> lock(roomMutex_);
if (room_) {
room_->broadcast(msg);
}
}
void leaveRoom() {
std::lock_guard<std::mutex> lock(roomMutex_);
room_ = nullptr;
}
};
shared_ptr + weak_ptr
class Connection {
std::weak_ptr<ChatRoom> room_;
public:
void handleMessage(const std::string& msg) {
if (auto room = room_.lock()) {
room->broadcast(msg);
}
}
void setRoom(std::shared_ptr<ChatRoom> room) {
room_ = room;
}
void leaveRoom() {
room_.reset();
}
};
Asio strand (serialize handlers)
class Connection {
boost::asio::strand<boost::asio::io_context::executor_type> strand_;
ChatRoom* room_;
public:
void handleMessage(const std::string& msg) {
boost::asio::post(strand_, [this, msg]() {
if (room_) {
room_->broadcast(msg);
}
});
}
void leaveRoom() {
boost::asio::post(strand_, [this]() {
room_ = nullptr;
});
}
};
10. TSan validation
$ g++ -g -O1 -fsanitize=thread -std=c++17 *.cpp -o chat_server_tsan
$ ./chat_server_tsan
WARNING: ThreadSanitizer: data race (pid=12345)
Write of size 8 at 0x7f1234567890 by thread T2:
#0 Connection::leaveRoom() src/connection.cpp:67
Previous read of size 8 at 0x7f1234567890 by thread T1:
#0 Connection::handleMessage() src/connection.cpp:45
...
SUMMARY: ThreadSanitizer: data race src/connection.cpp:67 in Connection::leaveRoom()
11. After the fix
Load / soak
$ ./chat_server_tsan
# 24h run → 0 races reported
# Production: 1 week → 0 crashes (example)
Overhead (illustrative)
| Approach | Overhead | Safety |
|---|---|---|
| Mutex | ~5% | High |
| weak_ptr | ~10% | Very high |
| Strand | ~2% | High (Asio) |
We chose strand (already on Asio).
12. Lessons
Takeaways
- Enable core dumps in production where policy allows
- rr is powerful for “can’t repro” bugs
- TSan in CI catches races early
- Strand / locks for shared mutable state
Intermittent crash workflow
graph TD
A[Crash] --> B{Core dump?}
B -->|Yes| C[gdb backtrace]
B -->|No| D[Fix core settings, wait]
C --> E{Repro locally?}
E -->|Yes| F[gdb]
E -->|No| G[rr record]
G --> H[rr replay / reverse]
H --> I[Root cause]
I --> J[Fix]
J --> K[TSan / ASan]
Patterns
// Bad: unsynchronized shared state
class BadConnection {
ChatRoom* room_;
void handleMessage(const std::string& msg) {
if (room_) room_->broadcast(msg); // race
}
};
// Good: mutex
class GoodConnection {
std::mutex mutex_;
ChatRoom* room_;
void handleMessage(const std::string& msg) {
std::lock_guard<std::mutex> lock(mutex_);
if (room_) room_->broadcast(msg);
}
};
13. More techniques
(gdb) watch room_
(gdb) continue
(gdb) break Connection::handleMessage if room_ == 0
(gdb) info threads
(gdb) thread 2
(gdb) bt
Closing thoughts
- Core dumps pinpointed the faulting instruction
- gdb showed the stack
- rr made nondeterminism debuggable
- TSan confirmed the race
- Strand serialized access safely
You can fix “unreproducible” crashes with the right tools.
FAQ
Q1. rr in production?
Possible with overhead; some teams run a subset of hosts under rr for nasty bugs.
Q2. Cores are huge
Pipe core_pattern to a compressor, or cap size with ulimit -c.
Q3. TSan + ASan together?
No—one sanitizer per process. Run separate CI jobs.
Related posts
- C++ debugging tips
- C++ thread safety
- C++ Asio strand
- C++ smart pointers
Checklists
Crash debugging
- Core dump policy
- Collect crash logs
- gdb backtrace
- Try repro
- If not, rr
- TSan if race suspected
- Fix and soak-test
- Sanitizers in CI
Thread safety
- Identify shared mutable state
- Protect with mutex / strand / atomics
- Prefer smart pointers for lifetime
- TSan on tests
- Review deadlock risk
Keywords
C++, crash, segmentation fault, core dump, gdb, rr, reverse debugging, data race, ThreadSanitizer, TSan, multithreading, case study