Where do I start with performance work?

Measure before guessing. Establish a baseline on representative requests, profile CPU and allocations, then cut complexity and copies first.

perf/gprof output is hard to read.

Narrow from functions with the largest share of time. Inlining can change symbol names—align optimization levels and use flame graphs.

Does multithreading always speed things up?

No. Lock contention, false sharing, and cache misses can make it slower. Separate CPU-bound vs I/O-bound bottlenecks first.

How do I prevent regressions after optimization?

Keep benchmarks/load tests in CI; track latency, throughput, and memory caps; compare on the same scenarios when behavior changes.

C++ Performance Optimization Case Study | 200ms API Latency Cut to 20ms

2026년 3월 30일 · 25분 읽기 · 수정 2026년 3월 30일 Advanced

이 글의 핵심

Real C++ API latency improvement: profiling, algorithms, memory, and multithreading.

Introduction

We started from “the API feels slow.” This post shares how we followed measure → analyze → optimize → verify and cut latency from 200ms to about 20ms—roughly 10×.

What you will learn

A structured way to find bottlenecks
Practical use of perf, gprof, and Valgrind
Algorithm, memory, and threading techniques
How to quantify improvements

Problem: API too slow
Measurement: benchmark baseline
Profiling: hotspots with perf
Bottleneck 1: O(n²) algorithm
Optimization 1: hash map → O(n) lookups
Bottleneck 2: string copies
Optimization 2: string_view and moves
Bottleneck 3: JSON serialization
Optimization 3: threading and pooling
Final results: 10× faster
Closing thoughts

1. Problem: API too slow

Situation

A REST endpoint that returns a user list was reported as slow:

# 100 users
$ curl -w "@curl-format.txt" http://api.example.com/users
time_total: 0.203s  # 200ms

Requirements

Goal: p50 latency under 50ms
Constraint: keep the existing API contract
Environment: Ubuntu 22.04, g++ 11, 4 CPU cores

2. Measurement: benchmark baseline

Simple benchmark harness

// benchmark.cpp
#include <chrono>
#include <iostream>
#include <vector>

using namespace std::chrono;

class Benchmark {
    std::vector<double> samples_;
    
public:
    template<typename Func>
    void run(const std::string& name, Func&& func, int iterations = 100) {
        samples_.clear();
        
        for (int i = 0; i < 10; ++i) {
            func();
        }
        
        for (int i = 0; i < iterations; ++i) {
            auto start = steady_clock::now();
            func();
            auto end = steady_clock::now();
            
            auto duration = duration_cast<microseconds>(end - start).count();
            samples_.push_back(duration / 1000.0); // ms
        }
        
        std::sort(samples_.begin(), samples_.end());
        double p50 = samples_[samples_.size() / 2];
        double p95 = samples_[samples_.size() * 95 / 100];
        double p99 = samples_[samples_.size() * 99 / 100];
        
        std::cout << name << ":\n"
                  << "  p50: " << p50 << "ms\n"
                  << "  p95: " << p95 << "ms\n"
                  << "  p99: " << p99 << "ms\n";
    }
};

Baseline

$ ./benchmark
getUserList (100 users):
  p50: 203.5ms
  p95: 215.2ms
  p99: 223.1ms

3. Profiling: hotspots with perf

perf

$ g++ -O2 -g -std=c++17 *.cpp -o server

$ perf record -g ./server
$ perf report

Samples: 10K of event 'cycles'
  65.23%  server  [.] UserManager::findUsersByRole
  18.45%  server  [.] std::string::string(std::string const&)
  12.34%  server  [.] json::serialize
   2.98%  server  [.] other

Findings

findUsersByRole ~65% of time
String copies ~18%
JSON serialization ~12%

4. Bottleneck 1: O(n²) work

Original code

class UserManager {
    std::vector<User> users_; // 10,000 users

public:
    std::vector<User> findUsersByRole(const std::string& role) {
        std::vector<User> result;
        
        // Costly nested work per user
        for (const auto& user : users_) {
            for (const auto& r : user.roles) {
                if (r == role) {
                    result.push_back(user);
                    break;
                }
            }
        }
        
        return result;
    }
};

Complexity

n users, ~m roles per user on average
String comparisons dominate

5. Optimization 1: hash map index

Improved code

class UserManager {
    std::vector<User> users_;
    std::unordered_map<std::string, std::vector<size_t>> roleIndex_;

public:
    void buildIndex() {
        roleIndex_.clear();
        for (size_t i = 0; i < users_.size(); ++i) {
            for (const auto& role : users_[i].roles) {
                roleIndex_[role].push_back(i);
            }
        }
    }
    
    std::vector<User> findUsersByRole(const std::string& role) {
        std::vector<User> result;
        
        if (auto it = roleIndex_.find(role); it != roleIndex_.end()) {
            result.reserve(it->second.size());
            for (size_t idx : it->second) {
                result.push_back(users_[idx]);
            }
        }
        
        return result;
    }
};

Result

$ ./benchmark
getUserList (100 users):
  p50: 85.3ms  # 203.5ms → 85.3ms (~2.4×)

Effect: CPU share of the hot path dropped (65% → ~28% in later profiles).

6. Bottleneck 2: string copies

Issue

for (size_t idx : it->second) {
    result.push_back(users_[idx]); // full User copy
}

struct User {
    std::string id;
    std::string name;
    std::string email;
    std::vector<std::string> roles;
};

7. Optimization 2: string_view and moves

Return references

std::vector<const User*> findUsersByRole(const std::string& role) const {
    std::vector<const User*> result;
    
    if (auto it = roleIndex_.find(role); it != roleIndex_.end()) {
        result.reserve(it->second.size());
        for (size_t idx : it->second) {
            result.push_back(&users_[idx]);
        }
    }
    
    return result;
}

string_view in serialization

std::string serializeUserView(std::string_view id, std::string_view name) {
    std::ostringstream oss;
    oss << "{\"id\":\"" << id << "\","
        << "\"name\":\"" << name << "\"}";
    return oss.str();
}

Result

$ ./benchmark
getUserList (100 users):
  p50: 42.1ms  # 85.3ms → 42.1ms (~2×)

8. Bottleneck 3: JSON serialization

Issue

std::string toJson(const std::vector<const User*>& users) {
    std::string json = "[";
    for (size_t i = 0; i < users.size(); ++i) {
        json += serializeUser(*users[i]); // repeated +=
        if (i < users.size() - 1) {
            json += ",";
        }
    }
    json += "]";
    return json;
}

Repeated string += can reallocate often → worst-case O(n²) in string length.

9. Optimization 3: threading and reservation

Reserve capacity

std::string toJson(const std::vector<const User*>& users) {
    std::string json;
    json.reserve(users.size() * 100);
    
    json = "[";
    for (size_t i = 0; i < users.size(); ++i) {
        json += serializeUser(*users[i]);
        if (i < users.size() - 1) {
            json += ",";
        }
    }
    json += "]";
    return json;
}

Parallel serialization

#include <thread>
#include <future>

std::string toJsonParallel(const std::vector<const User*>& users) {
    if (users.size() < 100) {
        return toJson(users);
    }
    
    size_t numThreads = std::thread::hardware_concurrency();
    size_t chunkSize = (users.size() + numThreads - 1) / numThreads;
    
    std::vector<std::future<std::string>> futures;
    
    for (size_t i = 0; i < numThreads; ++i) {
        size_t start = i * chunkSize;
        size_t end = std::min(start + chunkSize, users.size());
        
        if (start >= users.size()) break;
        
        futures.push_back(std::async(std::launch::async, [&, start, end]() {
            std::string chunk;
            chunk.reserve((end - start) * 100);
            
            for (size_t j = start; j < end; ++j) {
                chunk += serializeUser(*users[j]);
                if (j < end - 1) chunk += ",";
            }
            return chunk;
        }));
    }
    
    std::string result = "[";
    for (size_t i = 0; i < futures.size(); ++i) {
        result += futures[i].get();
        if (i < futures.size() - 1) result += ",";
    }
    result += "]";
    
    return result;
}

Result

$ ./benchmark
getUserList (100 users):
  p50: 20.3ms  # 42.1ms → 20.3ms (~2×)

getUserList (1000 users):
  p50: 45.2ms

10. Final results: ~10× improvement

Stage-by-stage

Stage	Change	p50	Factor
0	Baseline	203.5ms	—
1	Role index	85.3ms	~2.4×
2	Pointer views, less copying	42.1ms	~2×
3	reserve + parallel JSON	20.3ms	~2×
Total	Combined	20.3ms	~10×

CPU profile shift

Before: findUsersByRole dominated; string copy + JSON next.
After: JSON + I/O + hash lookup more balanced.

11. Lessons and more ideas

Takeaways

Don’t optimize without measurement
Fix algorithmic cost first
Remove needless copies (references, moves, string_view)
Parallelize last, after serial optimizations

Optimization flow

graph TD
    A[Performance issue] --> B[Measure and profile]
    B --> C{Find bottleneck}
    C --> D[Improve algorithmic complexity]
    D --> E[Memory / copies]
    E --> F[Compiler options]
    F --> G[Multithreading]
    G --> H[Verify and ship]

Tooling

Tool	Use	Example
perf	CPU hotspots	`perf record -g ./app && perf report`
gprof	Per-function time	`g++ -pg ... && ./app && gprof app`
Valgrind Callgrind	Call graphs	`valgrind --tool=callgrind ./app`
Heaptrack	Allocations	`heaptrack ./app`

12. More ideas

Short TTL cache

class UserManager {
    std::unordered_map<std::string, std::vector<const User*>> cache_;
    std::chrono::steady_clock::time_point cacheTime_;
    
public:
    std::vector<const User*> findUsersByRole(const std::string& role) {
        auto now = std::chrono::steady_clock::now();
        
        if (now - cacheTime_ < std::chrono::seconds(5)) {
            if (auto it = cache_.find(role); it != cache_.end()) {
                return it->second;
            }
        }
        
        auto result = findUsersByRoleImpl(role);
        cache_[role] = result;
        cacheTime_ = now;
        
        return result;
    }
};

DB index

CREATE INDEX idx_user_roles ON users USING GIN(roles);

Compress responses

// gzip JSON to reduce transfer time
#include <zlib.h>

Closing thoughts

Performance work is measure → analyze → optimize → verify, repeated.

perf located the real hotspots
Indexing gave the largest win
Fewer copies helped again
Parallel JSON finished the job

~10× faster responses materially improved UX.

FAQ

Q1. When should we optimize?

Only when you have measurable pain. “It feels slow” without numbers usually means more complex code for little gain.

Q2. Should we use -O3?

Often -O2 is enough; -O3 can grow code and hurt caches. Measure.

Q3. Thread first?

Optimize the serial path first. Parallel slow code is just “fast slow code.”

C++ profiling guide
C++ complexity
C++ multithreading
C++ string_view

Checklists

Performance optimization

Code review

Keywords

C++, performance optimization, profiling, perf, gprof, bottleneck, algorithm, complexity, hash map, string_view, move semantics, multithreading, parallelism, benchmark, case study

이 글의 핵심

Introduction

What you will learn

Table of contents

1. Problem: API too slow

Situation

Requirements

2. Measurement: benchmark baseline

Simple benchmark harness

Baseline

3. Profiling: hotspots with perf

perf

Findings

4. Bottleneck 1: O(n²) work

Original code

Complexity

5. Optimization 1: hash map index

Improved code

Result

6. Bottleneck 2: string copies

Issue

7. Optimization 2: string_view and moves

Return references

string_view in serialization

Result

8. Bottleneck 3: JSON serialization

Issue

9. Optimization 3: threading and reservation

Reserve capacity

Parallel serialization

Result

10. Final results: ~10× improvement

Stage-by-stage

CPU profile shift

11. Lessons and more ideas

Takeaways

Optimization flow

Tooling

12. More ideas

Short TTL cache

DB index

Compress responses

Closing thoughts

FAQ

Related posts

Checklists

Performance optimization

Code review

Keywords