Why are intermittent crashes hard to reproduce?

Scheduling, thread interleaving, and memory layout change symptoms. Core dumps, backtraces, and logging last-known-good state matter.

When should I use rr (record and replay)?

When you need deterministic replay of nondeterministic failures—if gdb alone is not enough and your environment supports rr.

No core dump is written—what now?

Check ulimit, container settings, and systemd core patterns. Production often disables dumps; reproduce in staging too.

What to suspect first in multithreaded segfaults?

Races, use-after-free, bad lock ordering. ThreadSanitizer and minimal repros narrow scope.

C++ Crash Debugging Case Study | Fixing an Intermittent Segmentation Fault

2026년 3월 30일 · 24분 읽기 · 수정 2026년 3월 30일 Advanced

이 글의 핵심

Intermittent segfault debugging with core dumps, gdb, rr, and TSan—real production case.

Introduction

“Sometimes the server dies” is among the hardest reports. This post covers a hard-to-reproduce intermittent crash solved with core dumps, gdb, and rr.

What you will learn

How to configure and use core dumps
gdb techniques at the crash site
Using rr when local repro fails
Debugging data races in multithreaded code

Symptom: intermittent SIGSEGV
Core dump setup
gdb: analyzing the crash
Hypothesis: dangling pointer?
Reproduction fails locally
Recording with rr
Reverse debugging
Root cause: data race
Fix: synchronization
Validation with TSan
Closing thoughts

1. Symptom: intermittent SIGSEGV

Production

Roughly 1–2 segfaults per day:

$ dmesg | tail
[12345.678] chat_server[23456]: segfault at 0 ip 00007f1234567890 sp 00007fff12345678 error 4 in chat_server

Characteristics

No reliable repro in dev
Intermittent, seemingly random
More frequent under load

2. Core dump setup

System

$ ulimit -c unlimited

$ sudo sysctl -w kernel.core_pattern=/var/coredumps/core.%e.%p.%t

$ sudo mkdir -p /var/coredumps
$ sudo chmod 1777 /var/coredumps

Server

$ ulimit -c
unlimited

$ ./chat_server

Next morning

$ ls -lh /var/coredumps/
-rw------- 1 user user 1.2G Mar 30 03:42 core.chat_server.23456.1711756920

3. gdb: analyzing the crash

Load core

$ gdb ./chat_server /var/coredumps/core.chat_server.23456.1711756920

(gdb) bt
#0  0x00007f1234567890 in std::vector<Message>::operator[] (this=0x0, __n=5)
    at /usr/include/c++/11/bits/stl_vector.h:1046
#1  0x00007f2345678901 in ChatRoom::broadcast (this=0x7f3456789012, msg=...)
    at src/chat_room.cpp:145
...

Finding

Crash in ChatRoom::broadcast with this=0x0 (null dereference)
Thread: Asio worker

4. Hypothesis: dangling pointer?

Suspect code

class Connection {
    ChatRoom* room_; // raw pointer
    
public:
    void handleMessage(const std::string& msg) {
        if (room_) {
            room_->broadcast(msg);
        }
    }
    
    void leaveRoom() {
        room_ = nullptr;
    }
};

class ChatRoom {
    std::vector<Connection*> connections_;
    
public:
    void broadcast(const std::string& msg) {
        for (auto* conn : connections_) {
            conn->send(msg);
        }
    }
};

Hypotheses

Connection still points at a destroyed ChatRoom
Multithreaded race clearing room_ while in use

5. Reproduction fails locally

$ ./load_test.sh --users=1000 --duration=600
# No crash

Why

Timing-dependent races
Different CPU and load vs production
Rare interleaving

Conclusion: debug without easy repro → try rr

6. Recording with rr

Setup

$ sudo apt install rr

$ echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid

Record

$ rr record ./chat_server

rr: Saving execution to trace directory `/root/.local/share/rr/chat_server-0'.

7. Reverse debugging

$ rr replay /root/.local/share/rr/chat_server-0

(rr) c
Program received signal SIGSEGV, Segmentation fault.
0x00007f1234567890 in ChatRoom::broadcast (this=0x0, msg=...)

Watch `room_`

(rr) reverse-continue
(rr) watch -l room_

(rr) reverse-continue
Old value = (ChatRoom*) 0x7f3456789012
New value = (ChatRoom*) 0x0

(rr) bt
#0  Connection::leaveRoom() at src/connection.cpp:67
#1  ChatRoom::removeConnection() at src/chat_room.cpp:89
#2  Server::handleDisconnect() at src/server.cpp:123

Insight

Thread A: inside handleMessage (using room_)
Thread B: leaveRoom sets room_ = nullptr
→ Data race

8. Root cause: data race

Interleaving

// Thread A
void Connection::handleMessage(const std::string& msg) {
    if (room_) {              // room_ looks valid
        room_->broadcast(msg); // B may clear room_ here → crash
    }
}

// Thread B
void Connection::leaveRoom() {
    room_ = nullptr;
}

Timeline

Time  | Thread A                    | Thread B
------|-----------------------------|-----------------------
t0    | if (room_) { // true        |
t1    |                             | room_ = nullptr;
t2    | room_->broadcast(msg);      |
      | SIGSEGV                     |

9. Fix: synchronization

Mutex

class Connection {
    mutable std::mutex roomMutex_;
    ChatRoom* room_;
    
public:
    void handleMessage(const std::string& msg) {
        std::lock_guard<std::mutex> lock(roomMutex_);
        if (room_) {
            room_->broadcast(msg);
        }
    }
    
    void leaveRoom() {
        std::lock_guard<std::mutex> lock(roomMutex_);
        room_ = nullptr;
    }
};

shared_ptr + weak_ptr

class Connection {
    std::weak_ptr<ChatRoom> room_;
    
public:
    void handleMessage(const std::string& msg) {
        if (auto room = room_.lock()) {
            room->broadcast(msg);
        }
    }
    
    void setRoom(std::shared_ptr<ChatRoom> room) {
        room_ = room;
    }
    
    void leaveRoom() {
        room_.reset();
    }
};

Asio strand (serialize handlers)

class Connection {
    boost::asio::strand<boost::asio::io_context::executor_type> strand_;
    ChatRoom* room_;
    
public:
    void handleMessage(const std::string& msg) {
        boost::asio::post(strand_, [this, msg]() {
            if (room_) {
                room_->broadcast(msg);
            }
        });
    }
    
    void leaveRoom() {
        boost::asio::post(strand_, [this]() {
            room_ = nullptr;
        });
    }
};

10. TSan validation

$ g++ -g -O1 -fsanitize=thread -std=c++17 *.cpp -o chat_server_tsan

$ ./chat_server_tsan

WARNING: ThreadSanitizer: data race (pid=12345)
  Write of size 8 at 0x7f1234567890 by thread T2:
    #0 Connection::leaveRoom() src/connection.cpp:67
    
  Previous read of size 8 at 0x7f1234567890 by thread T1:
    #0 Connection::handleMessage() src/connection.cpp:45
...
SUMMARY: ThreadSanitizer: data race src/connection.cpp:67 in Connection::leaveRoom()

11. After the fix

Load / soak

$ ./chat_server_tsan
# 24h run → 0 races reported

# Production: 1 week → 0 crashes (example)

Overhead (illustrative)

Approach	Overhead	Safety
Mutex	~5%	High
weak_ptr	~10%	Very high
Strand	~2%	High (Asio)

We chose strand (already on Asio).

12. Lessons

Takeaways

Enable core dumps in production where policy allows
rr is powerful for “can’t repro” bugs
TSan in CI catches races early
Strand / locks for shared mutable state

Intermittent crash workflow

graph TD
    A[Crash] --> B{Core dump?}
    B -->|Yes| C[gdb backtrace]
    B -->|No| D[Fix core settings, wait]
    C --> E{Repro locally?}
    E -->|Yes| F[gdb]
    E -->|No| G[rr record]
    G --> H[rr replay / reverse]
    H --> I[Root cause]
    I --> J[Fix]
    J --> K[TSan / ASan]

Patterns

// Bad: unsynchronized shared state
class BadConnection {
    ChatRoom* room_;
    void handleMessage(const std::string& msg) {
        if (room_) room_->broadcast(msg); // race
    }
};

// Good: mutex
class GoodConnection {
    std::mutex mutex_;
    ChatRoom* room_;
    void handleMessage(const std::string& msg) {
        std::lock_guard<std::mutex> lock(mutex_);
        if (room_) room_->broadcast(msg);
    }
};

13. More techniques

(gdb) watch room_
(gdb) continue

(gdb) break Connection::handleMessage if room_ == 0

(gdb) info threads
(gdb) thread 2
(gdb) bt

Closing thoughts

Core dumps pinpointed the faulting instruction
gdb showed the stack
rr made nondeterminism debuggable
TSan confirmed the race
Strand serialized access safely

You can fix “unreproducible” crashes with the right tools.

FAQ

Q1. rr in production?

Possible with overhead; some teams run a subset of hosts under rr for nasty bugs.

Q2. Cores are huge

Pipe core_pattern to a compressor, or cap size with ulimit -c.

Q3. TSan + ASan together?

No—one sanitizer per process. Run separate CI jobs.

C++ debugging tips
C++ thread safety
C++ Asio strand
C++ smart pointers

Checklists

Crash debugging

Thread safety

Keywords

C++, crash, segmentation fault, core dump, gdb, rr, reverse debugging, data race, ThreadSanitizer, TSan, multithreading, case study