C++ Performance Optimization Case Study | 200ms API Latency Cut to 20ms
이 글의 핵심
Real C++ API latency improvement: profiling, algorithms, memory, and multithreading.
Introduction
We started from “the API feels slow.” This post shares how we followed measure → analyze → optimize → verify and cut latency from 200ms to about 20ms—roughly 10×.
What you will learn
- A structured way to find bottlenecks
- Practical use of perf, gprof, and Valgrind
- Algorithm, memory, and threading techniques
- How to quantify improvements
Table of contents
- Problem: API too slow
- Measurement: benchmark baseline
- Profiling: hotspots with perf
- Bottleneck 1: O(n²) algorithm
- Optimization 1: hash map → O(n) lookups
- Bottleneck 2: string copies
- Optimization 2: string_view and moves
- Bottleneck 3: JSON serialization
- Optimization 3: threading and pooling
- Final results: 10× faster
- Closing thoughts
1. Problem: API too slow
Situation
A REST endpoint that returns a user list was reported as slow:
# 100 users
$ curl -w "@curl-format.txt" http://api.example.com/users
time_total: 0.203s # 200ms
Requirements
- Goal: p50 latency under 50ms
- Constraint: keep the existing API contract
- Environment: Ubuntu 22.04, g++ 11, 4 CPU cores
2. Measurement: benchmark baseline
Simple benchmark harness
// benchmark.cpp
#include <chrono>
#include <iostream>
#include <vector>
using namespace std::chrono;
class Benchmark {
std::vector<double> samples_;
public:
template<typename Func>
void run(const std::string& name, Func&& func, int iterations = 100) {
samples_.clear();
for (int i = 0; i < 10; ++i) {
func();
}
for (int i = 0; i < iterations; ++i) {
auto start = steady_clock::now();
func();
auto end = steady_clock::now();
auto duration = duration_cast<microseconds>(end - start).count();
samples_.push_back(duration / 1000.0); // ms
}
std::sort(samples_.begin(), samples_.end());
double p50 = samples_[samples_.size() / 2];
double p95 = samples_[samples_.size() * 95 / 100];
double p99 = samples_[samples_.size() * 99 / 100];
std::cout << name << ":\n"
<< " p50: " << p50 << "ms\n"
<< " p95: " << p95 << "ms\n"
<< " p99: " << p99 << "ms\n";
}
};
Baseline
$ ./benchmark
getUserList (100 users):
p50: 203.5ms
p95: 215.2ms
p99: 223.1ms
3. Profiling: hotspots with perf
perf
$ g++ -O2 -g -std=c++17 *.cpp -o server
$ perf record -g ./server
$ perf report
Samples: 10K of event 'cycles'
65.23% server [.] UserManager::findUsersByRole
18.45% server [.] std::string::string(std::string const&)
12.34% server [.] json::serialize
2.98% server [.] other
Findings
findUsersByRole~65% of time- String copies ~18%
- JSON serialization ~12%
4. Bottleneck 1: O(n²) work
Original code
class UserManager {
std::vector<User> users_; // 10,000 users
public:
std::vector<User> findUsersByRole(const std::string& role) {
std::vector<User> result;
// Costly nested work per user
for (const auto& user : users_) {
for (const auto& r : user.roles) {
if (r == role) {
result.push_back(user);
break;
}
}
}
return result;
}
};
Complexity
- n users, ~m roles per user on average
- String comparisons dominate
5. Optimization 1: hash map index
Improved code
class UserManager {
std::vector<User> users_;
std::unordered_map<std::string, std::vector<size_t>> roleIndex_;
public:
void buildIndex() {
roleIndex_.clear();
for (size_t i = 0; i < users_.size(); ++i) {
for (const auto& role : users_[i].roles) {
roleIndex_[role].push_back(i);
}
}
}
std::vector<User> findUsersByRole(const std::string& role) {
std::vector<User> result;
if (auto it = roleIndex_.find(role); it != roleIndex_.end()) {
result.reserve(it->second.size());
for (size_t idx : it->second) {
result.push_back(users_[idx]);
}
}
return result;
}
};
Result
$ ./benchmark
getUserList (100 users):
p50: 85.3ms # 203.5ms → 85.3ms (~2.4×)
Effect: CPU share of the hot path dropped (65% → ~28% in later profiles).
6. Bottleneck 2: string copies
Issue
for (size_t idx : it->second) {
result.push_back(users_[idx]); // full User copy
}
struct User {
std::string id;
std::string name;
std::string email;
std::vector<std::string> roles;
};
7. Optimization 2: string_view and moves
Return references
std::vector<const User*> findUsersByRole(const std::string& role) const {
std::vector<const User*> result;
if (auto it = roleIndex_.find(role); it != roleIndex_.end()) {
result.reserve(it->second.size());
for (size_t idx : it->second) {
result.push_back(&users_[idx]);
}
}
return result;
}
string_view in serialization
std::string serializeUserView(std::string_view id, std::string_view name) {
std::ostringstream oss;
oss << "{\"id\":\"" << id << "\","
<< "\"name\":\"" << name << "\"}";
return oss.str();
}
Result
$ ./benchmark
getUserList (100 users):
p50: 42.1ms # 85.3ms → 42.1ms (~2×)
8. Bottleneck 3: JSON serialization
Issue
std::string toJson(const std::vector<const User*>& users) {
std::string json = "[";
for (size_t i = 0; i < users.size(); ++i) {
json += serializeUser(*users[i]); // repeated +=
if (i < users.size() - 1) {
json += ",";
}
}
json += "]";
return json;
}
Repeated string += can reallocate often → worst-case O(n²) in string length.
9. Optimization 3: threading and reservation
Reserve capacity
std::string toJson(const std::vector<const User*>& users) {
std::string json;
json.reserve(users.size() * 100);
json = "[";
for (size_t i = 0; i < users.size(); ++i) {
json += serializeUser(*users[i]);
if (i < users.size() - 1) {
json += ",";
}
}
json += "]";
return json;
}
Parallel serialization
#include <thread>
#include <future>
std::string toJsonParallel(const std::vector<const User*>& users) {
if (users.size() < 100) {
return toJson(users);
}
size_t numThreads = std::thread::hardware_concurrency();
size_t chunkSize = (users.size() + numThreads - 1) / numThreads;
std::vector<std::future<std::string>> futures;
for (size_t i = 0; i < numThreads; ++i) {
size_t start = i * chunkSize;
size_t end = std::min(start + chunkSize, users.size());
if (start >= users.size()) break;
futures.push_back(std::async(std::launch::async, [&, start, end]() {
std::string chunk;
chunk.reserve((end - start) * 100);
for (size_t j = start; j < end; ++j) {
chunk += serializeUser(*users[j]);
if (j < end - 1) chunk += ",";
}
return chunk;
}));
}
std::string result = "[";
for (size_t i = 0; i < futures.size(); ++i) {
result += futures[i].get();
if (i < futures.size() - 1) result += ",";
}
result += "]";
return result;
}
Result
$ ./benchmark
getUserList (100 users):
p50: 20.3ms # 42.1ms → 20.3ms (~2×)
getUserList (1000 users):
p50: 45.2ms
10. Final results: ~10× improvement
Stage-by-stage
| Stage | Change | p50 | Factor |
|---|---|---|---|
| 0 | Baseline | 203.5ms | — |
| 1 | Role index | 85.3ms | ~2.4× |
| 2 | Pointer views, less copying | 42.1ms | ~2× |
| 3 | reserve + parallel JSON | 20.3ms | ~2× |
| Total | Combined | 20.3ms | ~10× |
CPU profile shift
Before: findUsersByRole dominated; string copy + JSON next.
After: JSON + I/O + hash lookup more balanced.
11. Lessons and more ideas
Takeaways
- Don’t optimize without measurement
- Fix algorithmic cost first
- Remove needless copies (references, moves,
string_view) - Parallelize last, after serial optimizations
Optimization flow
graph TD
A[Performance issue] --> B[Measure and profile]
B --> C{Find bottleneck}
C --> D[Improve algorithmic complexity]
D --> E[Memory / copies]
E --> F[Compiler options]
F --> G[Multithreading]
G --> H[Verify and ship]
Tooling
| Tool | Use | Example |
|---|---|---|
| perf | CPU hotspots | perf record -g ./app && perf report |
| gprof | Per-function time | g++ -pg ... && ./app && gprof app |
| Valgrind Callgrind | Call graphs | valgrind --tool=callgrind ./app |
| Heaptrack | Allocations | heaptrack ./app |
12. More ideas
Short TTL cache
class UserManager {
std::unordered_map<std::string, std::vector<const User*>> cache_;
std::chrono::steady_clock::time_point cacheTime_;
public:
std::vector<const User*> findUsersByRole(const std::string& role) {
auto now = std::chrono::steady_clock::now();
if (now - cacheTime_ < std::chrono::seconds(5)) {
if (auto it = cache_.find(role); it != cache_.end()) {
return it->second;
}
}
auto result = findUsersByRoleImpl(role);
cache_[role] = result;
cacheTime_ = now;
return result;
}
};
DB index
CREATE INDEX idx_user_roles ON users USING GIN(roles);
Compress responses
// gzip JSON to reduce transfer time
#include <zlib.h>
Closing thoughts
Performance work is measure → analyze → optimize → verify, repeated.
- perf located the real hotspots
- Indexing gave the largest win
- Fewer copies helped again
- Parallel JSON finished the job
~10× faster responses materially improved UX.
FAQ
Q1. When should we optimize?
Only when you have measurable pain. “It feels slow” without numbers usually means more complex code for little gain.
Q2. Should we use -O3?
Often -O2 is enough; -O3 can grow code and hurt caches. Measure.
Q3. Thread first?
Optimize the serial path first. Parallel slow code is just “fast slow code.”
Related posts
- C++ profiling guide
- C++ complexity
- C++ multithreading
- C++ string_view
Checklists
Performance optimization
- Clear numeric goals
- Baseline benchmarks
- Profile (perf, gprof, Valgrind)
- Identify top 3 hotspots
- Analyze complexity
- Apply changes
- Re-benchmark
- Regression tests
- Ship and monitor
Code review
- Hidden O(n²)+ loops?
- Unnecessary copies?
- Repeated
string +=? -
reserve()where needed? - Parallel-safe where parallelized?
- Caching considered?
Keywords
C++, performance optimization, profiling, perf, gprof, bottleneck, algorithm, complexity, hash map, string_view, move semantics, multithreading, parallelism, benchmark, case study