C++ Observability: Prometheus and Grafana for Server Monitoring
이 글의 핵심
Build a pipeline: expose metrics from C++ servers, scrape with Prometheus, and visualize in Grafana.
Introduction: see why it got slow—with data
You need metrics to respond
Parts 43-1 and 43-2 covered RPC and security. In operations, metrics (request counts, latency, CPU usage, etc.) are essential. Prometheus uses a pull model: the server scrapes targets over HTTP and stores time series. Grafana visualizes them on dashboards.
To expose Prometheus text format from C++, define Counter, Gauge, and Histogram, and serve them as text on a path such as /metrics. You can use prometheus-cpp or implement a minimal exporter yourself.
This article covers:
- Prometheus metric types: Counter, Gauge, Histogram, labels
- Exposing metrics from C++: library vs manual, thread safety
- Grafana: Prometheus data source and dashboard examples
- Scenarios, common errors, production patterns
Real-world scenarios
Scenario 1: API suddenly slow, root cause unknown
Situation: C++ gRPC server latency spikes to 10s at 2 AM
Problem: Logs alone do not show *where* it blocks
Result: Without Prometheus metrics you cannot see RPS, latency distribution, or error-rate trends
→ Hours to triage, incident response drags
Scenario 2: Memory usage creeps up
Situation: C++ server memory rises for 3 days straight
Problem: No time series for heap, connections, or queue depth
Result: Without Gauge metrics you only suspect leaks, with no evidence
→ Restart as a band-aid, root cause unfixed
Scenario 3: One endpoint has high errors
Situation: Overall error rate 1%, but /api/payment is 30%
Problem: Without per-path metrics you cannot pinpoint the route
Result: Only a global Counter → no fine-grained analysis
→ You optimize the wrong thing or miss the bad path
Scenario 4: Regression after deploy
Situation: Users feel slowness after a new release
Problem: No p99 or RPS to compare before/after
Result: Without Histograms you cannot compare percentiles
→ Roll back blindly or leave the outage running
This article shows how to prevent those issues with a Prometheus + Grafana pipeline and complete examples.
Table of contents
- Prometheus metrics
- Exposing metrics from C++
- Prometheus configuration and scraping
- Grafana integration
- End-to-end Prometheus + Grafana examples
- Common errors and fixes
- Best practices
- Production patterns
- Implementation checklist
- Summary
1. Prometheus metrics
Counter, Gauge, Histogram
- Counter: monotonically increasing (requests, bytes). Use rate() for per-second increase.
- Gauge: goes up and down (connections, queue length, memory).
- Histogram: distributions (latency). Expose buckets plus sum and count; use histogram_quantile in Prometheus for percentiles.
- Labels: attach labels (e.g. method, path, status) for filtering and grouping. Keep cardinality bounded.
Histogram bucket hints (seconds)
| Service type | Suggested buckets (s) | Notes |
|---|---|---|
| Low-latency API | 0.001, 0.005, 0.01, 0.025, 0.05, 0.1 | ms-scale latency |
| Typical API | 0.005, 0.025, 0.1, 0.5, 1.0, 2.5 | REST/gRPC |
| Batch | 1, 5, 10, 30, 60, 120 | Long jobs |
Prometheus text format example
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/api"} 1234
http_requests_total{method="POST",path="/api"} 567
# HELP http_request_duration_seconds Request duration in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 50
http_request_duration_seconds_bucket{le="0.1"} 100
http_request_duration_seconds_bucket{le="0.5"} 200
http_request_duration_seconds_bucket{le="1.0"} 250
http_request_duration_seconds_bucket{le="+Inf"} 300
http_request_duration_seconds_sum 45.2
http_request_duration_seconds_count 300
Collection architecture
flowchart LR
subgraph Cpp["C++ server"]
M[/metrics endpoint]
end
subgraph Prom["Prometheus"]
S[Scrape]
TS[Time series DB]
end
subgraph Graf["Grafana"]
D[Dashboards]
A[Alerts]
end
M -->|HTTP GET| S
S --> TS
TS -->|PromQL| D
TS -->|Alert rules| A
Scrape sequence
sequenceDiagram
participant P as Prometheus
participant C as C++ server
loop scrape_interval (e.g. 15s)
P->>C: GET /metrics
C->>C: call export_metrics()
C->>P: 200 OK, text/plain
P->>P: parse and store in TSDB
end
2. Exposing metrics from C++
Library vs manual
- prometheus-cpp: register Counter/Gauge/Histogram and serialize to text. Under multi-threaded access, protect with atomics or locks.
- Manual: std::atomic counters and per-bucket counts, assemble strings in the
/metricshandler. Set Content-Type: text/plain; charset=utf-8 as expected. - Placement: put metrics on an admin port or separate path, use auth and network isolation so they are not public.
Minimal manual example
Increment request_count with fetch_add(1, memory_order_relaxed) on each request; export_metrics() returns a Prometheus line (name value\n). The /metrics handler returns that body. memory_order_relaxed is enough for a simple counter; use seq_cst if you need ordering across metrics.
#include <atomic>
#include <string>
// Conceptual single Counter
std::atomic<uint64_t> request_count{0};
void on_request() {
request_count.fetch_add(1, std::memory_order_relaxed);
}
std::string export_metrics() {
return "http_requests_total " + std::to_string(request_count.load()) + "\n";
}
Manual Counter, Gauge, Histogram
#include <atomic>
#include <string>
#include <sstream>
#include <mutex>
#include <chrono>
// Per-label counters (path, method) — simplified global example
struct Metrics {
std::atomic<uint64_t> requests_total{0};
std::atomic<uint64_t> errors_total{0};
std::atomic<uint64_t> active_connections{0};
std::atomic<uint64_t> queue_length{0};
// Histogram buckets: 5ms, 25ms, 100ms, 500ms, 1s, +Inf
static constexpr double buckets[] = {0.005, 0.025, 0.1, 0.5, 1.0, -1}; // -1 = +Inf
std::atomic<uint64_t> duration_bucket_5ms{0};
std::atomic<uint64_t> duration_bucket_25ms{0};
std::atomic<uint64_t> duration_bucket_100ms{0};
std::atomic<uint64_t> duration_bucket_500ms{0};
std::atomic<uint64_t> duration_bucket_1s{0};
std::atomic<uint64_t> duration_bucket_inf{0};
std::atomic<double> duration_sum{0};
std::atomic<uint64_t> duration_count{0};
void record_request(bool error, double duration_sec) {
requests_total.fetch_add(1, std::memory_order_relaxed);
if (error) errors_total.fetch_add(1, std::memory_order_relaxed);
auto add_bucket = [this](std::atomic<uint64_t>& b) {
b.fetch_add(1, std::memory_order_relaxed);
};
if (duration_sec <= 0.005) add_bucket(duration_bucket_5ms);
else if (duration_sec <= 0.025) add_bucket(duration_bucket_25ms);
else if (duration_sec <= 0.1) add_bucket(duration_bucket_100ms);
else if (duration_sec <= 0.5) add_bucket(duration_bucket_500ms);
else if (duration_sec <= 1.0) add_bucket(duration_bucket_1s);
add_bucket(duration_bucket_inf);
double expected;
do {
expected = duration_sum.load(std::memory_order_relaxed);
} while (!duration_sum.compare_exchange_weak(
expected, expected + duration_sec, std::memory_order_relaxed));
duration_count.fetch_add(1, std::memory_order_relaxed);
}
void connection_opened() {
active_connections.fetch_add(1, std::memory_order_relaxed);
}
void connection_closed() {
active_connections.fetch_sub(1, std::memory_order_relaxed);
}
void queue_inc() { queue_length.fetch_add(1, std::memory_order_relaxed); }
void queue_dec() { queue_length.fetch_sub(1, std::memory_order_relaxed); }
std::string export_prometheus() const {
std::ostringstream out;
out << "# HELP http_requests_total Total HTTP requests\n";
out << "# TYPE http_requests_total counter\n";
out << "http_requests_total " << requests_total.load() << "\n";
out << "# HELP http_errors_total Total HTTP errors\n";
out << "# TYPE http_errors_total counter\n";
out << "http_errors_total " << errors_total.load() << "\n";
out << "# HELP http_active_connections Active connections\n";
out << "# TYPE http_active_connections gauge\n";
out << "http_active_connections " << active_connections.load() << "\n";
out << "# HELP http_queue_length Current queue length\n";
out << "# TYPE http_queue_length gauge\n";
out << "http_queue_length " << queue_length.load() << "\n";
out << "# HELP http_request_duration_seconds Request duration\n";
out << "# TYPE http_request_duration_seconds histogram\n";
out << "http_request_duration_seconds_bucket{le=\"0.005\"} " << duration_bucket_5ms.load() << "\n";
out << "http_request_duration_seconds_bucket{le=\"0.025\"} " << duration_bucket_25ms.load() << "\n";
out << "http_request_duration_seconds_bucket{le=\"0.1\"} " << duration_bucket_100ms.load() << "\n";
out << "http_request_duration_seconds_bucket{le=\"0.5\"} " << duration_bucket_500ms.load() << "\n";
out << "http_request_duration_seconds_bucket{le=\"1\"} " << duration_bucket_1s.load() << "\n";
out << "http_request_duration_seconds_bucket{le=\"+Inf\"} " << duration_bucket_inf.load() << "\n";
out << "http_request_duration_seconds_sum " << duration_sum.load() << "\n";
out << "http_request_duration_seconds_count " << duration_count.load() << "\n";
return out.str();
}
};
prometheus-cpp example
#include <prometheus/counter.h>
#include <prometheus/gauge.h>
#include <prometheus/histogram.h>
#include <prometheus/registry.h>
#include <prometheus/exposer.h>
#include <memory>
int main() {
// Expose /metrics on port 8080
prometheus::Exposer exposer{"127.0.0.1:8080"};
auto registry = std::make_shared<prometheus::Registry>();
// Counter with labels for path and method
auto& request_counter = prometheus::BuildCounter()
.Name("http_requests_total")
.Help("Total HTTP requests")
.Labels({{"service", "cpp-server"}})
.Register(*registry);
auto& get_requests = request_counter.Add({{"method", "GET"}, {"path", "/api"}});
auto& post_requests = request_counter.Add({{"method", "POST"}, {"path", "/api"}});
// Gauge: active connections
auto& conn_gauge = prometheus::BuildGauge()
.Name("http_active_connections")
.Help("Active connections")
.Register(*registry);
// Histogram: latency (5ms, 25ms, 100ms, 500ms, 1s)
auto& duration_hist = prometheus::BuildHistogram()
.Name("http_request_duration_seconds")
.Help("Request duration")
.Buckets({0.005, 0.025, 0.1, 0.5, 1.0})
.Register(*registry);
auto& get_duration = duration_hist.Add({{"method", "GET"}}, std::vector<double>{0.005, 0.025, 0.1, 0.5, 1.0});
exposer.RegisterCollectable(registry);
// During request handling
get_requests.Increment();
conn_gauge.Increment();
auto start = std::chrono::steady_clock::now();
// ... handle request ...
auto elapsed = std::chrono::duration<double>(std::chrono::steady_clock::now() - start).count();
get_duration.Observe(elapsed);
conn_gauge.Decrement();
return 0;
}
Building prometheus-cpp
# vcpkg (recommended)
vcpkg install prometheus-cpp
# CMakeLists.txt
find_package(prometheus-cpp CONFIG REQUIRED)
target_link_libraries(my_server prometheus::prometheus)
# Or FetchContent without a submodule
include(FetchContent)
FetchContent_Declare(
prometheus-cpp
GIT_REPOSITORY https://github.com/jupp0r/prometheus-cpp.git
GIT_TAG v1.2.2
)
FetchContent_MakeAvailable(prometheus-cpp)
target_link_libraries(my_server prometheus::prometheus)
3. Prometheus configuration and scraping
Basic prometheus.yml
global:
scrape_interval: 15s # default scrape interval
evaluation_interval: 15s # alert rule evaluation
alerting:
alertmanagers:
- static_configs:
- targets: []
rule_files: []
scrape_configs:
- job_name: 'cpp-server'
scrape_interval: 10s # scrape C++ server every 10s
scrape_timeout: 5s
static_configs:
- targets: ['localhost:8080']
labels:
env: 'production'
service: 'cpp-api'
Dynamic targets (service discovery)
# Scrape C++ pods in Kubernetes
scrape_configs:
- job_name: 'cpp-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: ${1}:${2}
target_label: __address__
4. Grafana integration
Data source
- Add Prometheus as a Grafana data source and query with PromQL.
- URL:
http://prometheus:9090(Docker/K8s) orhttp://localhost:9090
Useful PromQL examples
# Requests per second
rate(http_requests_total[5m])
# p99 latency (seconds)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Error rate (%)
100 * sum(rate(http_errors_total[5m])) / sum(rate(http_requests_total[5m]))
# Active connections (Gauge — no rate)
http_active_connections
# Queue length
http_queue_length
Dashboard panels
- Graphs: RPS, latency percentiles (p50, p95, p99), error rate over time
- Single stat: current connections, queue length
- Table: requests by path, errors by method
- Alerts: e.g. p99 > 1s, error rate > 5% → Slack/email
More PromQL
# p50, p95, p99
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Average latency (sum/count)
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
# RPS per instance
sum by (instance) (rate(http_requests_total[5m]))
# Errors in 5 minutes
increase(http_errors_total[5m])
Grafana alert channel (Slack example)
# Configuration → Alerting → Contact points → New contact point
# Type: Slack
# Webhook URL: https://hooks.slack.com/services/xxx/yyy/zzz
# Channel: #alerts-cpp-server
Dashboard variables (filter by instance)
# Dashboard Settings → Variables → New variable
# Name: instance
# Type: Query
# Data source: Prometheus
# Query: label_values(http_requests_total, instance)
# Multi-value: Yes
# In panel queries: {instance=~"$instance"}
5. End-to-end Prometheus + Grafana examples
Full stack with Docker Compose
# docker-compose.yml
version: '3.8'
services:
cpp-server:
build: .
ports:
- "8080:8080"
environment:
- METRICS_PORT=8080
prometheus:
image: prom/prometheus:v2.47.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=15d'
grafana:
image: grafana/grafana:10.2.0
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana-data:/var/lib/grafana
depends_on:
- prometheus
volumes:
grafana-data:
C++ server + /metrics (Boost.Beast sketch)
#include <boost/beast/core.hpp>
#include <boost/beast/http.hpp>
#include <boost/asio.hpp>
#include <atomic>
#include <chrono>
#include <string>
#include <thread>
namespace beast = boost::beast;
namespace http = beast::http;
namespace net = boost::asio;
// Global metrics (prefer singleton or DI in production)
std::atomic<uint64_t> g_requests_total{0};
std::atomic<uint64_t> g_errors_total{0};
std::atomic<uint64_t> g_active_connections{0};
void handle_metrics(http::request<http::string_body> const& req,
http::response<http::string_body>& res) {
res.set(http::field::content_type, "text/plain; charset=utf-8");
res.body() = "# HELP http_requests_total Total requests\n"
"# TYPE http_requests_total counter\n"
"http_requests_total " + std::to_string(g_requests_total.load()) + "\n"
"# HELP http_errors_total Total errors\n"
"# TYPE http_errors_total counter\n"
"http_errors_total " + std::to_string(g_errors_total.load()) + "\n"
"# HELP http_active_connections Active connections\n"
"# TYPE http_active_connections gauge\n"
"http_active_connections " + std::to_string(g_active_connections.load()) + "\n";
res.prepare_payload();
}
// For /metrics call handle_metrics; other paths run business logic
Grafana dashboard JSON (core panels)
{
"panels": [
{
"title": "RPS",
"type": "timeseries",
"targets": [{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{instance}}"
}]
},
{
"title": "p99 latency (s)",
"type": "timeseries",
"targets": [{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "p99"
}]
},
{
"title": "Error rate (%)",
"type": "timeseries",
"targets": [{
"expr": "100 * sum(rate(http_errors_total[5m])) / sum(rate(http_requests_total[5m]))",
"legendFormat": "error_rate"
}]
},
{
"title": "Active connections",
"type": "stat",
"targets": [{
"expr": "http_active_connections",
"legendFormat": "connections"
}]
}
]
}
6. Common errors and fixes
1. Prometheus: “connection refused” or “context deadline exceeded”
Cause: /metrics port closed, firewall, or network isolation.
Fix:
curl -v http://localhost:8080/metrics
docker exec prometheus wget -qO- http://cpp-server:8080/metrics
# Use Docker service names or K8s Service names
scrape_configs:
- job_name: 'cpp-server'
static_configs:
- targets: ['cpp-server:8080']
2. “parse error” or “invalid character”
Cause: Text format does not match Prometheus exposition format.
Fix:
# Bad: commas, spaces, bad escaping
http_requests_total 1234,567
# Good
# HELP http_requests_total Total requests
# TYPE http_requests_total counter
http_requests_total 1234
http_requests_total{path="/api"} 100
- Include HELP and TYPE where appropriate
- Escape
"and\inside label values - One sample per line:
name{labels} valueorname value
3. Grafana: “No data”
Cause: PromQL typo, time range, or metric name mismatch.
Fix:
{__name__=~"http_.*"}
rate(http_requests_total[5m])
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
4. Label cardinality explosion
Cause: Using raw paths or user IDs as labels.
Fix:
// Risky: thousands of paths → thousands of series
request_counter.Add({{"path", user_provided_path}});
// Safer: normalize templates
std::string normalize_path(const std::string& path) {
if (path.find("/api/users/") == 0) return "/api/users/:id";
if (path.find("/api/orders/") == 0) return "/api/orders/:id";
return path;
}
5. Histogram races with atomics
Cause: Bucket updates and sum/count not consistent as a group.
Fix: use atomics per bucket and CAS loop for sum, or one mutex around record_request.
6. Slow /metrics (hundreds of ms)
Cause: Heavy allocation or lock contention during export.
Fix: cache serialized text in a thread-local buffer and refresh when metrics change.
7. “Out of order” or duplicate samples
Cause: Clock jumps after restart, or duplicate scrape jobs for the same target.
Fix: ensure one job per target.
7. Best practices
Naming
- Counter:
_totalsuffix (e.g.http_requests_total) - Units:
_seconds,_bytes, etc. - Lowercase snake_case
Labels
- Bound cardinality (hundreds of combinations, not millions)
- Static labels: env, service, region
- Avoid high-cardinality dynamic values (user_id, request_id)
Scrape intervals
- App: 10–15s
- Infra: 30s–1m
- Expensive metrics: 1–5m
Securing /metrics
- Internal networks only
- Basic Auth or mTLS
- Separate port from user-facing API (e.g. 8080 API, 9090 metrics)
Performance
| Topic | Recommendation |
|---|---|
| Atomics | memory_order_relaxed when order across metrics does not matter |
| Histogram | Prefer atomic buckets on hot paths over a global mutex |
| Export | Minimize string building on each scrape |
| Labels | Keep label count small; cardinality < 100 for typical setups |
8. Production patterns
Pattern 1: separate metrics port
// API :8080, metrics :9090 (bind to internal IP)
void run_metrics_server(const std::string& bind_addr, uint16_t port) {
tcp::acceptor acceptor(ctx, {net::ip::make_address(bind_addr), port});
}
Pattern 2: initialize metrics at startup
void init_metrics() {
g_requests_total.store(0);
g_errors_total.store(0);
}
Pattern 3: RAII request scope
struct ScopedRequestMetrics {
Metrics& m;
std::chrono::steady_clock::time_point start;
bool error = false;
ScopedRequestMetrics(Metrics& metrics) : m(metrics), start(std::chrono::steady_clock::now()) {
m.connection_opened();
}
~ScopedRequestMetrics() {
auto dur = std::chrono::duration<double>(
std::chrono::steady_clock::now() - start).count();
m.record_request(error, dur);
m.connection_closed();
}
};
Pattern 4: Prometheus alert rules
groups:
- name: cpp-server
rules:
- alert: HighErrorRate
expr: 100 * sum(rate(http_errors_total[5m])) / sum(rate(http_requests_total[5m])) > 5
for: 2m
labels:
severity: critical
annotations:
summary: "C++ server error rate {{ $value | humanize }}% exceeded"
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "p99 latency exceeded 1s"
9. Implementation checklist
- Implement /metrics on the C++ server
- Expose Counters (requests, errors)
- Expose Gauges (connections, queue length)
- Expose Histogram (latency) with sensible buckets
- Set
Content-Type: text/plain; charset=utf-8 - Add scrape_config in prometheus.yml
- Add Prometheus data source in Grafana
- Panels for RPS, p99, error rate, connections
- Alert rules (errors, latency)
- Secure metrics port/path
- Review label cardinality
10. Summary
| Topic | Summary |
|---|---|
| Prometheus | Counter/Gauge/Histogram, labels, pull, text exposition |
| C++ | Atomics, library or manual serialization, /metrics handler |
| Grafana | PromQL, dashboards, alerts |
| Production | Split ports, alert rules, bounded labels |
Series 43 covered gRPC/Protobuf → secure coding/OpenSSL → Observability (Prometheus + Grafana) for large distributed systems.
Related posts (internal links)
- Rust vs C++ memory safety #47-3
- C++ network errors #28-3
- Clang-Tidy and Cppcheck #41-1
Practical tips
- Start from compiler warnings and minimal reproducers when debugging.
- Measure before optimizing; define SLOs and metrics first.
- Align code review with team conventions.
Keywords (SEO)
Prometheus, Grafana, C++ monitoring, prometheus-cpp, metrics, Observability
FAQ
When is this useful in production?
A. Whenever you run C++ services in production and need scrapeable metrics, dashboards, and alerts. Use the examples above as templates.
prometheus-cpp vs manual?
A. prometheus-cpp: full feature set and labels when you can take the dependency. Manual: minimal deps, embedded systems, or very small metric sets.
What should I read next?
A. Follow the Previous/Next links at the bottom of each article, or the C++ series index.
Where to go deeper?
A. Prometheus docs, Grafana docs, prometheus-cpp.
One-line summary: Prometheus scrapes your C++ /metrics; Grafana turns them into dashboards and alerts. Next: C++26 preview #44-1.
Previous: Secure coding & OpenSSL #43-2
Next: C++26 preview #44-1
Related posts
- constexpr basics #43-1
- gRPC & Protobuf #43-1
- Advanced constexpr #43-2
- Secure coding & OpenSSL #43-2
- Monitoring dashboard #50-6