본문으로 건너뛰기 C++ vs Go | Performance, Concurrency, and Selection Guide [#47-1]

C++ vs Go | Performance, Concurrency, and Selection Guide [#47-1]

C++ vs Go | Performance, Concurrency, and Selection Guide [#47-1]

이 글의 핵심

C++ vs Go from a production angle: failure scenarios from bad stack choices, concurrency models (threads/Asio vs goroutines), memory and switching costs, benchmarks, mistakes to avoid, selection guide, and patterns.

Introduction: the “C++ or Go?” moment

Why compare them?

In backend and server work, C++ and Go both advertise high performance and concurrency. C++ keeps control in the developer’s hands with threads and event loops (Asio); Go’s goroutines (lightweight tasks scheduled by the Go runtime) and channels (communication between goroutines) let the runtime schedule tens of thousands of tasks with an M:N model. This article compares the two from a practical angle: real failure scenarios, a full comparison table, common mistakes, a selection guide, and production patterns.

What this article covers:

  • Problem scenarios: what goes wrong when the stack choice mismatches the requirements
  • Concurrency models: C++ threads/Asio vs Go goroutines and M:N scheduling
  • Full comparison: performance, memory, types, ecosystem, builds
  • Common mistakes: patterns to avoid in each language
  • Selection guide: when to pick which
  • Production patterns: designs that show up in real systems

Related posts: C++ in practice #7 — threads, Understanding Go through a C++ developer’s mental model.

A mental model

Treat the topic as a system of interlocking parts. Choices in one layer (storage, networking, observability) affect others, so the article grounds trade-offs in numbers and patterns.


Production note: this draws on large-scale C++ experience — pitfalls and debugging angles that textbooks often skip.

Table of contents

  1. Problem scenarios: when stack choice fails
  2. Concurrency model comparison
  3. Context switching and memory cost
  4. Performance and trade-offs
  5. Full C++ vs Go comparison
  6. Common mistakes
  7. Selection guide: what to choose when
  8. Production patterns
  9. Summary and checklist

1. Problem scenarios: when stack choice fails

Wrong technology choices tend to produce problems like these:

flowchart LR
    subgraph Mismatch[Stack mismatch]
        A[Requirements] -->|choose| B[Language / stack]
        B -->|expect| C[Behavior]
        A -.->|mismatch| C
        C --> D[Slowdowns / delays / ops pain]
    end

Scenario 1: “We built a web API in C++ and velocity collapsed”

Situation: A startup chose C++ for a REST API because “performance matters.” They used one thread per connection; at ~10k concurrent connections memory passed 80GB, and shipping slipped three months while they hand-rolled JSON, HTTP, and data access.

Why: For web/API servers, delivery speed and operational simplicity usually dominate. You may need C++’s peak performance, but most CRUD and microservices are fine in Go or Node.js. C++ often means assembling networking, parsing, and persistence yourself — high upfront cost.

What to do: Split “performance matters” into nanosecond-scale latency vs throughput. If you are in the millisecond range, Go or Rust may fit better. Reserve C++ for tight control: games, HFT, embedded, and similar domains.

Scenario 2: “We built HFT routing in Go and GC pauses broke latency”

Situation: A firm implemented order routing in Go because “Go is great at concurrency.” Under load, GC pauses reached milliseconds, failing a nanosecond-class latency target.

Why: Go’s GC can introduce multi-millisecond stop-the-world pauses. HFT and ultra-low-latency trading usually need GC-free C++ or Rust. Go fits web/API/microservices; not extreme tail latency.

What to do: If latency must stay below microseconds, prefer C++ or Rust. Go fits millisecond-class APIs and batch-style work.

Scenario 3: “One thread per connection for 100k users — OOM”

Situation: A chat server used one std::thread per connection. Around 10k connections, thread stacks alone exceeded 80GB and the process OOM’d.

Why: OS threads default to ~1–8MB stack each. 100k threads implies hundreds of GB. One thread per connection does not scale.

What to do: In C++, use an Asio-style event loop with a small thread pool and non-blocking I/O for many sockets. In Go, one goroutine per connection is idiomatic; stacks start around kilobytes so 100k is feasible.

Scenario 4: “We spawned goroutines for CPU work and got no speedup”

Situation: An image-resizing service launched one goroutine per request for CPU-heavy work. Only GOMAXPROCS (defaults to CPU count) ran in parallel; the rest waited — throughput capped by core count.

Why: Goroutines use M:N scheduling; OS threads stay near core count. Extra goroutines do not create extra parallel CPU execution beyond cores. They shine for I/O-bound work.

What to do: Use a worker pool with about one goroutine per core for CPU-bound stages, or move hot paths to C++ and call via cgo (mind cgo overhead).

Scenario 5: “Asio callbacks nested five levels deep — maintenance hell”

Situation: An Asio HTTP server chained async_readasync_writeasync_read five levels deep; error handling and timeouts scattered across callbacks.

Why: Classic Asio is callback-heavy; complex async flows hurt readability. Go’s go func() plus channels often read like synchronous code.

What to do: Consider C++20 coroutines or a coroutine library; or split I/O-heavy services into Go microservices.

Scenario 6: “Our Go CLI binary is 50MB+”

Situation: A Go CLI produced a single static binary — great — but 15–30MB+ after linking the runtime and standard library, painful for embedded or small Lambda bundles.

Why: Go statically links dependencies by default. Even with -ldflags="-s -w", runtime, GC, and scheduler stay in the binary.

What to do: If size is critical, C++ or Rust static binaries can be much smaller. In Go, minimize with -trimpath, -ldflags="-s -w", and optionally UPX (with care).

Scenario 7: “C++ rebuilds take 30 minutes and CI times out”

Situation: Touching one header triggered 20–30 minute full rebuilds; CI hit job limits on every commit.

Why: C++ header dependencies are heavy. Popular headers (<iostream>, Boost, …) inflate compile units; templates live in headers and instantiate widely.

What to do: PCH, C++20 modules, ccache, incremental build tuning, or split services so hot code builds in smaller units. Go’s package-level incremental builds are usually fast.


2. Concurrency model comparison

Analogy: Concurrency is like one cook switching between pots; parallelism is multiple cooks on different dishes.

Model comparison

C++: OS threads + event loop

  • std::thread: 1:1 with OS threads. Creation/teardown is costly; each thread has a large stack (often 1–8MB). Tens of thousands of one thread per connection is painful.
  • Asio: One (or few) threads run an event loop handling many sockets with non-blocking I/O. Completion handlers may fan out to a pool — thread count stays small vs connection count.
  • Control: You design scheduling, memory, and locking — more complexity, finer latency/throughput tuning.
flowchart TB
    subgraph Cpp[C++ model]
        T1[OS thread 1]
        T2[OS thread 2]
        T3[OS thread N]
        E[Event loop]
        S[~10k sockets]
        E --> S
        T1 --> E
        T2 --> E
        T3 --> E
    end

Runnable example (minimal C++ thread):

// Paste and run: g++ -std=c++17 -pthread -o cpp_concurrent cpp_concurrent.cpp && ./cpp_concurrent
#include <iostream>
#include <thread>
int main() {
    std::thread t([]{ std::cout << "C++ OS thread\n"; });
    std::cout << "main\n";
    t.join();
    return 0;
}

Go: goroutines + M:N scheduling

  • Goroutine: lightweight coroutine; KB-scale stack that grows as needed. Tens or hundreds of thousands of goroutines still map to roughly core-count OS threads.
  • M:N: Many goroutines (N) map to M OS threads; context switches often happen in user space and cost less than full thread switches.
  • Channels: idiomatic synchronization and communication — prefer “message passing” over “shared memory + locks” when possible.
flowchart TB
    subgraph Go[Go model]
        M1[OS thread 1]
        M2[OS thread 2]
        G1[Goroutine 1]
        G2[Goroutine 2]
        G3[Goroutine ...]
        GN[Goroutine ~100k]
        M1 --> G1
        M1 --> G2
        M2 --> G3
        M2 --> GN
    end

Runnable example (Go goroutine):

// go run main.go
package main
import (
    "fmt"
    "sync"
)
func main() {
    var wg sync.WaitGroup
    wg.Add(1)
    go func() {
        defer wg.Done()
        fmt.Println("Go goroutine")
    }()
    fmt.Println("main")
    wg.Wait()
}

Same job, different style

Task: fetch 10 URLs concurrently and collect results.

C++ (std::async):

#include <future>
#include <vector>
#include <string>
std::vector<std::string> fetchAll(const std::vector<std::string>& urls) {
    std::vector<std::future<std::string>> futures;
    for (const auto& url : urls) {
        futures.push_back(std::async(std::launch::async, [url]() {
            return fetchUrl(url);  // HTTP request
        }));
    }
    std::vector<std::string> results;
    for (auto& f : futures) {
        results.push_back(f.get());
    }
    return results;
}

Go (goroutines + wait group):

func fetchAll(urls []string) []string {
    results := make([]string, len(urls))
    var wg sync.WaitGroup
    for i, url := range urls {
        wg.Add(1)
        go func(idx int, u string) {
            defer wg.Done()
            results[idx] = fetchURL(u)
        }(i, url)
    }
    wg.Wait()
    return results
}

Difference: C++ gathers with std::future; Go synchronizes with sync.WaitGroup. Passing i and url as parameters avoids classic loop-variable capture bugs.

In one line

  • C++: you design threads and event loops — maximum control, higher complexity.
  • Go: spawn many goroutines; the runtime schedules — easier for I/O-heavy work and many concurrent connections.

3. Context switching and memory cost

C++ threads

  • Stack: often 1–8MB per thread by default. 10k threads can mean tens of GB of stack alone — one thread per connection rarely scales.
  • Context switch: kernel involvement; cache/TLB effects can land in microseconds. More threads → more switching cost.

Go goroutines

  • Stack: starts around KB; grows on demand. 100k goroutines still use far less stack memory than 100k OS threads.
  • Switching: user-space scheduler; often nanoseconds to microseconds — lighter than full thread switches. Blocking on I/O parks a goroutine and runs others — one goroutine per connection is natural.

Stack size (conceptual)

ItemC++ std::threadGo goroutine
Initial stack1–8MB (typical default)~2KB
Growthfixed (tunable)grows as needed
~10k instances~10–80GB stacks~20MB order of magnitude
~100kimpractical~200MB order of magnitude

Takeaway

  • Many concurrent connections, I/O-heavy: goroutines win on memory and switching.
  • CPU-bound or nanosecond latency: tune C++ thread counts, event loops, lock-free structures, etc.

Analogy: think of memory like a building — stacks are fast but small “elevators”; the heap is a large “warehouse.” Pointers are addresses on a slip of paper.

4. Performance and trade-offs

CPU-bound

  • C++: native code, compile-time optimization, direct cache/layout control — usually ahead on raw compute vs Go. Keep thread count near core count to limit switching.
  • Go: GC and runtime overhead; for extreme CPU performance, C++ often wins.

I/O-bound and many connections

  • C++: Asio event loop + few threads saves memory and switching — but callbacks and strand design add implementation cost.
  • Go: one goroutine per connection is idiomatic; blocking-style code is fine — simple and often fast enough for web/API workloads.

Latency

  • Ultra-low (sub-microsecond): C++ with explicit control; no GC pauses matters.
  • Milliseconds: Go can hit targets with tuning.

Benchmarks (illustrative)

WorkloadC++GoNotes
Pure compute (~1e9 ops)~100ms~150msC++ ahead
HTTP ~10k QPSsimilarsimilardepends on implementation
~100k concurrent echoAsio typicalgoroutines naturalfaster to build in Go
GC pausesnonems possibleC++ for ultra-low tail

Echo server sketch (~10k connections)

C++ Asio: one io_context + thread pool (e.g. 4–8 threads); register ~10k sockets; chain async_acceptasync_readasync_write. Memory: a few threads × MB stacks + socket buffers — tens of MB order.

Go: go handleConn(conn) per connection; ~10k goroutines → ~20MB stack total; use net.Conn in blocking style — simple; runtime schedules work.

Conclusion: both can handle ~10k connections; Go is quicker to write; C++ offers finer control at higher engineering cost.


5. Full C++ vs Go comparison

Overview table

TopicC++Go
MemoryManual / RAII / smart pointersGC
Concurrencystd::thread, AsioGoroutines, channels
TypesStatic, strong, templatesStatic, strong, interfaces
GenericsTemplates (compile time)Generics (Go 1.18+)
Errorstry/catcherror returns, panic/recover
BuildSlower (headers)Faster (packages)
Binarystatic/dynamic linksingle static binary
Deploydependency complexitygo build
Learning curvesteepgentler
EcosystemBoost, Qt, huge library spacestrong stdlib, go get
Performanceextreme control“enough” for many workloads
Latencyno GC, nanosecond controlGC pauses possible

Types and memory

C++:

  • Explicit memory control: new/delete or smart pointers.
  • RAII for resource lifetimes.
  • Compile-time polymorphism with templates.

Go:

  • GC reclaims memory; focus on allocation patterns.
  • defer for cleanup.
  • Runtime polymorphism with interfaces.

Ecosystem

C++: Boost.Asio, nlohmann/json, spdlog, gRPC, Protobuf, etc. — usually CMake + vcpkg/Conan.

Go: rich standard library — net/http, encoding/json, log, context. Dependencies via modules.

Syntax cheat sheet

FeatureC++Go
Concurrent executionstd::thread t(f); t.join();go f()
Synchronizationstd::mutex, std::atomicsync.Mutex, channels
Async resultsstd::future, std::asyncchannels, errgroup
CleanupRAII, destructorsdefer
Errorstry/catch, optionalerror returns
Modules#include, namespacesimport, packages

Error handling

C++: exceptions or std::expected (C++23), error codes — avoid exception cost with result types.

// C++: exception or error-style
std::optional<int> parse(const std::string& s) {
    try {
        return std::stoi(s);
    } catch (...) {
        return std::nullopt;
    }
}

Go: error returns — if err != nil is idiomatic.

// Go: error return
func parse(s string) (int, error) {
    n, err := strconv.Atoi(s)
    if err != nil {
        return 0, err
    }
    return n, nil
}

Build and deploy

TopicC++Go
Build timeslow (headers, templates)fast (incremental packages)
Dependenciesvcpkg, Conan, system packagesgo.mod, go.sum
Cross-compileper-toolchain setupGOOS, GOARCH
Deploywatch dynamic libssingle binary
Dockeroften larger base + toolchainscratch + binary works

6. Common mistakes

C++ mistakes

Mistake 1: one thread per connection

// Bad: 10k connections => 10k threads
void handle_client(int fd) {
    std::thread([fd]() {
        // ~1–8MB stack each => 10k threads => 10–80GB stacks
        process_request(fd);
    }).detach();
}

Fix: use an Asio event loop with a small thread pool.

// Better: event loop + thread pool
boost::asio::io_context ioc;
for (int i = 0; i < std::thread::hardware_concurrency(); ++i) {
    std::thread([&ioc]() { ioc.run(); }).detach();
}
// register many sockets on ioc

Mistake 2: shared_ptr everywhere

// Bad: shared_ptr on everything
void process(std::shared_ptr<Request> req) {
    auto resp = std::make_shared<Response>();  // atomic refcounts
    // ...
}

Fix: prefer unique_ptr for single ownership; shared_ptr only when sharing is required.

Mistake 3: shared data without synchronization

// Bad: data race
int counter = 0;
std::thread t1([&]() { ++counter; });
std::thread t2([&]() { ++counter; });

Fix: std::mutex or std::atomic.

Go mistakes

Mistake 1: unbounded goroutines for CPU work

// Bad: 10k goroutines doing CPU work => scheduling overhead
for i := 0; i < 10000; i++ {
    go cpuHeavyTask()
}

Fix: worker pool sized to cores.

// Better: worker pool
jobs := make(chan int, 100)
for w := 0; w < runtime.NumCPU(); w++ {
    go func() {
        for j := range jobs {
            cpuHeavyTask(j)
        }
    }()
}

Mistake 2: goroutine leak — channel never closed

// Bad: range waits forever if ch never closed
ch := make(chan int)
go func() {
    for v := range ch {
        process(v)
    }
}()

Fix: sender calls close(ch) when work is done.

Mistake 3: loop variable captured in goroutine

// Bad: data race / wrong index — loop variable captured
for i := 0; i < 10; i++ {
    go func() {
        process(items[i])
    }()
}

Fix: copy or pass by parameter.

// Better
for i := 0; i < 10; i++ {
    go func(idx int) {
        process(items[idx])
    }(i)
}

Mistake 4: send/receive on nil channel

// Bad: nil channel blocks forever
var ch chan int
ch <- 1
<-ch

Fix: ch := make(chan int) before use.

Mistake 5: unbuffered channel deadlock

// Bad: no receiver => sender blocks forever
ch := make(chan int)
ch <- 1

Fix: buffer, or start receiver before send.

Shared pitfalls

Mistake 6: deadlock

C++: lock A then B in one thread and B then A in another.

Go: two channels waiting on each other.

Fix: consistent lock order; unidirectional channel patterns; select with timeouts.


7. Selection guide: what to choose when

Decision flow

flowchart TD
    A[Analyze requirements] --> B{Latency target?}
    B -->|Sub-microsecond| C[C++]
    B -->|Milliseconds| D{Concurrent connections?}
    D -->|Very high| E{Velocity priority?}
    E -->|Yes| F[Go]
    E -->|No| G[C++ Asio]
    D -->|Thousands| H{Team skills?}
    H -->|Strong C++| I[C++]
    H -->|Mixed| J[Go]

Goals

GoalC++Go
Peak CPU / low latencystronglimited
Many connections + speedneeds designstrong
Memory / thread controlstrong (event loop)strong (goroutines)
Team ramp / maintenanceheavierlighter
Single-binary deploypossible with static linkstrong
Existing C/C++ stackstrongvia cgo

Domains

  • Prefer C++: game servers, HFT, embedded, heavy legacy C++, nanosecond-level control.
  • Prefer Go: web, APIs, microservices, fast shipping, simple operations.

Migration

C++ → Go: cgo works but watch cgo cost and GC pauses blocking native threads. Common pattern: split with gRPC/HTTP — C++ core, Go edge.

Go → C++: rewrite hot modules; keep Go for the rest — typical hybrid.

Hiring

  • C++: harder to hire; steep curve; demand in games, finance, systems.
  • Go: easier onboarding; strong cloud/DevOps demand.

Scenario summary

ScenarioLikely choiceWhy
REST / microservicesGospeed of development, stdlib, single binary
Game server (100k+ CCU)C++memory/latency control, engine integration
HFT / ultra-low latencyC++no GC, fine control
K8s / Docker toolingGoecosystem
Image/video CPU pipelineC++ or GoCPU-bound → C++; I/O-heavy → Go
Chat / realtime fanoutGogoroutine-per-conn ergonomics
Embedded / IoT edgeC++constraints, direct hardware

8. Production patterns

C++ patterns

Pattern 1: Asio + thread pool

boost::asio::io_context ioc;
boost::asio::signal_set signals(ioc, SIGINT, SIGTERM);
signals.async_wait([&](auto, auto) { ioc.stop(); });
std::vector<std::thread> threads;
for (unsigned i = 0; i < std::thread::hardware_concurrency(); ++i) {
    threads.emplace_back([&ioc]() { ioc.run(); });
}
for (auto& t : threads) t.join();

Pattern 2: strand for per-connection ordering

auto strand = boost::asio::make_strand(ioc);
boost::asio::async_read(socket, buffer, boost::asio::bind_executor(strand,  {
    // runs sequentially on this strand
}));

Pattern 3: pool allocator for hot paths

template<typename T>
using pool_alloc = boost::pool_allocator<T>;
std::vector<int, pool_alloc<int>> vec;

Go patterns

Pattern 1: context for cancel and timeout

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
req, _ := http.NewRequestWithContext(ctx, "GET", "https://api.example.com", nil)
resp, err := http.DefaultClient.Do(req)
if err != nil {
    if ctx.Err() == context.DeadlineExceeded {
        // handle timeout
    }
}
defer resp.Body.Close()

Pattern 2: errgroup for goroutine groups

g, ctx := errgroup.WithContext(ctx)
for _, url := range urls {
    url := url
    g.Go(func() error {
        resp, err := fetch(ctx, url)
        if err != nil {
            return err
        }
        return process(resp)
    })
}
if err := g.Wait(); err != nil {
    return err
}

Pattern 3: worker pool

func worker(id int, jobs <-chan Job, results chan<- Result) {
    for j := range jobs {
        results <- process(j)
    }
}
jobs := make(chan Job, 100)
results := make(chan Result, 100)
for w := 0; w < runtime.NumCPU(); w++ {
    go worker(w, jobs, results)
}

Hybrid architecture

  • C++ core + Go API: HFT engine in C++; REST/gRPC in Go.
  • Go service + C++ lib: cgo — mind overhead and GC blocking native code.

Pattern 4: C++ RAII for connection cleanup

class Connection {
    boost::asio::ip::tcp::socket socket_;
public:
    Connection(boost::asio::io_context& ioc) : socket_(ioc) {}
    ~Connection() {
        boost::system::error_code ec;
        socket_.shutdown(boost::asio::ip::tcp::socket::shutdown_both, ec);
    }
};

Pattern 5: Go defer

func processFile(path string) error {
    f, err := os.Open(path)
    if err != nil {
        return err
    }
    defer f.Close()
    return nil
}

Pattern 6: C++ thread_local

thread_local std::mt19937 rng(std::random_device{}());

Pattern 7: Go sync.Once

var once sync.Once
var config *Config
func getConfig() *Config {
    once.Do(func() {
        config = loadConfig()
    })
    return config
}

Pattern 8: channel vs mutex

Channels: passing data between goroutines; “don’t share memory, communicate.”

ch := make(chan int, 10)
go producer(ch)
go consumer(ch)

Mutex: protecting shared mutable state.

var mu sync.Mutex
var cache map[string]string
func get(key string) string {
    mu.Lock()
    defer mu.Unlock()
    return cache[key]
}

Go proverb: “Don’t communicate by sharing memory; share memory by communicating.”

Pattern 9: C++ atomic counter

std::atomic<int> counter{0};
counter.fetch_add(1, std::memory_order_relaxed);

Pattern 10: C++ shared_mutex reader/writer

std::shared_mutex mtx;
std::shared_lock read_lock(mtx);
std::unique_lock write_lock(mtx);

9. Summary and checklist

Highlights

TopicC++Go
Concurrencythreads, Asiogoroutines, channels
Memorymanual / RAIIGC
Latencyultra-low possibleGC pauses
Velocityslower to shipfaster for many services
Deploycomplex depssingle binary

Principles:

  1. Measure latency, throughput, and team skills first.
  2. Ultra-low latency → C++; fast iteration and ops → consider Go.
  3. Avoid one thread per connection; use Asio in C++, goroutines in Go.
  4. CPU-bound → worker pools; I/O-bound → goroutines / event loops.

Checklist

  • Sub-microsecond latency? → consider C++
  • Tens of thousands of connections? → Go goroutines or C++ Asio
  • No senior C++ engineers? → lean Go
  • Heavy existing C++ stack? → likely stay on C++
  • Single-binary deploy critical? → Go shines
  • GC pauses unacceptable? → C++

Profiling

C++: perf, VTune, gprof; Valgrind for leaks/races; std::chrono::high_resolution_clock for timing.

Go: go tool pprof, -race, runtime/debug for GC stats.

import "runtime/debug"
var stats debug.GCStats
debug.ReadGCStats(&stats)

FAQ

When do I use this at work?

A. When choosing server stacks, designing microservices, or comparing concurrency models — use the scenarios, tables, and patterns above as a checklist.

A. Follow previous post links at the bottom of each article, or open the C++ series index for the full sequence.

Where do I go deeper?

A. cppreference, Go documentation, Effective Go, plus Boost.Asio and Go net docs.

Can I use C++ and Go together?

A. Yes — gRPC/HTTP hybrids are common: C++ core (engine/HFT) + Go APIs. cgo is possible; account for call overhead and GC blocking native threads.


Next steps

One-line summary: knowing how C++ and Go differ on performance and concurrency makes stack choices clearer. Next, read Go for C++ developers (#47-2).

Next post: [C++ vs other languages #47-2] Understanding Go through a C++ developer’s mental model


References


Keywords

C++ vs Go, language comparison, goroutines, concurrency, performance, Asio, channels — useful search terms for this topic.


See also