더 깊이 공부하려면?

cppreference와 해당 라이브러리 공식 문서를 참고하세요. 글 말미의 참고 자료 링크도 활용하면 좋습니다.

C++ 실시간 모니터링 대시보드 | Grafana·Prometheus 통합 [#50-6]

Q: 이 내용을 실무에서 언제 쓰나요?

서비스 헬스 체크, 성능 모니터링, 장애 감지, SLA 추적 등 프로덕션 시스템 운영에 필수적입니다. 실무에서는 위 본문의 예제와 선택 가이드를 참고해 적용하면 됩니다.

Q: 선행으로 읽으면 좋은 글은?

각 글 하단의 이전 글 링크를 따라가면 순서대로 배울 수 있습니다. C++ 시리즈 목차에서 전체 흐름을 확인할 수 있습니다.

2026년 3월 27일 · 25분 읽기 · 수정 2026년 3월 27일 고급 실습

이 글의 핵심

C++ 실시간 모니터링 대시보드에 대한 실전 가이드입니다. Grafana·Prometheus 통합 [#50-6] 등을 예제와 함께 설명합니다.

들어가며: “왜 느려졌는지, 언제 터졌는지” 데이터로 보기

모니터링이 없으면 장애 대응이 불가능하다

C++ REST API 서버나 게임 서버를 프로덕션에 올렸을 때, “갑자기 느려졌어요”, “어제 새벽에 에러가 났는데 원인을 모르겠어요” 같은 상황을 겪은 적이 있나요? 로그만으로는 트렌드, 백분위수, 실시간 상태를 파악하기 어렵습니다. Prometheus가 메트릭을 수집하고 Grafana가 대시보드로 시각화하면, 언제, 어디서, 얼마나 문제가 발생했는지 한눈에 볼 수 있습니다.

이 글에서 다루는 것:

C++ 서버에서 Prometheus 형식 메트릭 노출 (Counter, Gauge, Histogram)
Grafana 대시보드 설계 및 실전 패널 구성
알람 규칙 설정 및 알림 채널 연동
자주 발생하는 에러와 해결법
프로덕션 환경 성능 최적화 패턴

요구 환경: C++17 이상, HTTP 서버 (Boost.Beast 등)

이 글을 읽으면:

실시간 모니터링 대시보드를 구축할 수 있습니다.
장애 감지 및 SLA 추적이 가능해집니다.
프로덕션 수준의 관측성(Observability)을 확보할 수 있습니다.

문제 시나리오: 모니터링이 필요한 상황

시나리오 1: 새벽 장애, 원인 불명
고객이 “어제 새벽 3시쯤 API가 5분간 응답 없었어요”라고 보고합니다. 로그를 뒤져도 언제부터, 어떤 엔드포인트에서, 얼마나 많은 요청이 실패했는지 파악하기 어렵습니다. Grafana 대시보드에 에러율·지연 시간 그래프가 있으면, 해당 시간대를 바로 확인할 수 있습니다.

시나리오 2: 점진적 성능 저하
서버가 처음에는 잘 돌아가다가, 며칠 후부터 응답 시간이 서서히 늘어납니다. 메모리 누수, 연결 풀 고갈, 디스크 I/O 병목 등 원인이 다양합니다. Prometheus Histogram으로 p50, p95, p99 지연 시간을 추적하면, 언제부터 악화되었는지 트렌드를 볼 수 있습니다.

시나리오 3: 트래픽 급증 시 대응
블랙 프라이데이, 이벤트 등으로 트래픽이 10배로 늘었을 때, RPS, 연결 수, CPU/메모리를 실시간으로 보지 못하면 스케일 아웃 시점을 놓칩니다. Grafana 대시보드에서 임계치를 넘으면 알람을 받도록 설정해 두면, 수동 확인 없이 대응할 수 있습니다.

시나리오 4: SLA 보장
”99.9% 가용성”, “p99 지연 100ms 이하” 같은 SLA를 고객과 약속했다면, 업타임, 에러율, 백분위수를 지속적으로 측정해야 합니다. Prometheus + Grafana로 이 지표들을 수집·시각화하면, SLA 준수 여부를 데이터로 증명할 수 있습니다.

시나리오 5: 다중 인스턴스 비교
Kubernetes에서 파드 10개가 돌아갈 때, 특정 파드만 메모리가 높거나 에러율이 높을 수 있습니다. 라벨(instance, pod)로 구분해 메트릭을 수집하면, 문제 인스턴스를 빠르게 식별할 수 있습니다.

개념을 잡는 비유

이 글의 주제는 여러 부품이 맞물리는 시스템으로 보시면 이해가 빠릅니다. 한 레이어(저장·네트워크·관측)의 선택이 옆 레이어에도 영향을 주므로, 본문에서는 트레이드오프를 숫자와 패턴으로 정리합니다.

1. 시스템 아키텍처

전체 구조

C++ 애플리케이션 → Prometheus → Grafana 파이프라인의 데이터 흐름입니다.

flowchart LR
    subgraph Cpp["C++ 애플리케이션"]
        M1["/metrics\n엔드포인트"]
        M2["Counter\nGauge\nHistogram"]
        M1 --> M2
    end

    subgraph Prom["Prometheus"]
        P1["Scrape\n(주기적 Pull)"]
        P2["TSDB\n(시계열 DB)"]
        P1 --> P2
    end

    subgraph Graf["Grafana"]
        G1["대시보드"]
        G2["알람"]
        G1 --> G2
    end

    Cpp -->|"HTTP GET\n텍스트 포맷"| Prom
    Prom -->|"PromQL\n쿼리"| Graf

핵심 포인트:

Pull 방식: Prometheus가 C++ 서버의 /metrics를 주기적으로 요청해 메트릭을 가져옵니다.
텍스트 포맷: Prometheus Exposition Format으로 이름{라벨} 값 형태의 플레인 텍스트를 반환합니다.
라벨: method, path, status 등으로 메트릭을 세분화해 필터·그룹화할 수 있습니다.

시퀀스 다이어그램

sequenceDiagram
    participant Cpp as C++ 서버
    participant Prom as Prometheus
    participant Graf as Grafana

    loop scrape_interval (15s)
        Prom->>Cpp: GET /metrics
        Cpp-->>Prom: 200 OK (텍스트 메트릭)
        Prom->>Prom: TSDB에 저장
    end

    User->>Graf: 대시보드 조회
    Graf->>Prom: PromQL 쿼리 (rate, histogram_quantile 등)
    Prom-->>Graf: 시계열 데이터
    Graf-->>User: 그래프/패널 렌더링

    alt 임계치 초과
        Prom->>Graf: 알람 트리거
        Graf->>User: 알림 (Slack, 이메일 등)
    end

2. Prometheus 메트릭 구현

메트릭 타입 개요

타입	용도	예시
Counter	단조 증가 (요청 수, 바이트)	`http_requests_total`
Gauge	증감 가능 (연결 수, 큐 길이)	`active_connections`
Histogram	분포 (지연 시간)	`http_request_duration_seconds`

C++ 메트릭 레지스트리 구현

스레드 안전하게 Counter, Gauge, Histogram을 관리하고 Prometheus 텍스트 포맷으로 직렬화합니다.

// metrics_registry.hpp
#pragma once

#include <atomic>
#include <cmath>
#include <limits>
#include <map>
#include <mutex>
#include <string>
#include <vector>

namespace monitoring {

// Counter: 단조 증가 (요청 수, 에러 수)
class Counter {
public:
    void inc(double delta = 1.0) {
        value_.fetch_add(
            static_cast<uint64_t>(delta * 1e9),
            std::memory_order_relaxed
        );
    }
    uint64_t get() const {
        return value_.load(std::memory_order_relaxed) / 1e9;
    }
private:
    std::atomic<uint64_t> value_{0};
};

// Gauge: 증감 가능 (연결 수, 메모리)
class Gauge {
public:
    void set(double v) {
        value_.store(
            static_cast<int64_t>(v * 1e9),
            std::memory_order_relaxed
        );
    }
    void inc(double delta = 1.0) {
        value_.fetch_add(
            static_cast<int64_t>(delta * 1e9),
            std::memory_order_relaxed
        );
    }
    void dec(double delta = 1.0) {
        value_.fetch_sub(
            static_cast<int64_t>(delta * 1e9),
            std::memory_order_relaxed
        );
    }
    double get() const {
        return static_cast<double>(
            value_.load(std::memory_order_relaxed)
        ) / 1e9;
    }
private:
    std::atomic<int64_t> value_{0};
};

// Histogram: 지연 시간 분포 (버킷 기반)
class Histogram {
public:
    explicit Histogram(const std::vector<double>& buckets)
        : buckets_(buckets) {
        for (double b : buckets_) {
            counts_[b] = 0;
        }
        counts_[std::numeric_limits<double>::infinity()] = 0;
    }

    void observe(double value) {
        std::lock_guard<std::mutex> lock(mutex_);
        sum_.fetch_add(static_cast<uint64_t>(value * 1e9),
                       std::memory_order_relaxed);
        count_.fetch_add(1, std::memory_order_relaxed);
        for (double b : buckets_) {
            if (value <= b) {
                ++counts_[b];
                break;
            }
        }
        ++counts_[std::numeric_limits<double>::infinity()];
    }

    std::string export_prometheus(const std::string& name,
                                   const std::map<std::string, std::string>& labels) const;

private:
    std::vector<double> buckets_;
    std::map<double, uint64_t> counts_;
    std::atomic<uint64_t> sum_{0};
    std::atomic<uint64_t> count_{0};
    mutable std::mutex mutex_;
};

// 레지스트리: 모든 메트릭 중앙 관리
class MetricsRegistry {
public:
    Counter& counter(const std::string& name,
                     const std::map<std::string, std::string>& labels = {});
    Gauge& gauge(const std::string& name,
                 const std::map<std::string, std::string>& labels = {});
    Histogram& histogram(const std::string& name,
                        const std::vector<double>& buckets,
                        const std::map<std::string, std::string>& labels = {});

    std::string export_prometheus() const;

private:
    std::string format_labels(const std::map<std::string, std::string>& labels) const;

    mutable std::mutex mutex_;
    std::map<std::string, std::shared_ptr<Counter>> counters_;
    std::map<std::string, std::shared_ptr<Gauge>> gauges_;
    std::map<std::string, std::shared_ptr<Histogram>> histograms_;
};

} // namespace monitoring

메트릭 직렬화 (Prometheus 포맷)

// metrics_registry.cpp
#include "metrics_registry.hpp"
#include <sstream>

namespace monitoring {

std::string MetricsRegistry::format_labels(
    const std::map<std::string, std::string>& labels) const {
    if (labels.empty()) return "";
    std::ostringstream oss;
    oss << "{";
    bool first = true;
    for (const auto& [k, v] : labels) {
        if (!first) oss << ",";
        oss << k << "=\"" << v << "\"";
        first = false;
    }
    oss << "}";
    return oss.str();
}

namespace {
std::string format_labels(const std::map<std::string, std::string>& labels) {
    if (labels.empty()) return "";
    std::ostringstream oss;
    oss << "{";
    bool first = true;
    for (const auto& [k, v] : labels) {
        if (!first) oss << ",";
        oss << k << "=\"" << v << "\"";
        first = false;
    }
    oss << "}";
    return oss.str();
}
} // namespace

std::string Histogram::export_prometheus(
    const std::string& name,
    const std::map<std::string, std::string>& labels) const {
    std::lock_guard<std::mutex> lock(mutex_);
    std::ostringstream oss;
    for (const auto& [bucket, cnt] : counts_) {
        auto lbl = labels;
        lbl["le"] = std::isinf(bucket) ? "+Inf" : std::to_string(bucket);
        oss << name << "_bucket" << format_labels(lbl) << " " << cnt << "\n";
    }
    oss << name << "_sum" << format_labels(labels) << " "
        << (sum_.load() / 1e9) << "\n";
    oss << name << "_count" << format_labels(labels) << " "
        << count_.load() << "\n";
    return oss.str();
}

std::string MetricsRegistry::export_prometheus() const {
    std::lock_guard<std::mutex> lock(mutex_);
    std::ostringstream oss;
    oss << "# HELP http_requests_total Total HTTP requests\n";
    oss << "# TYPE http_requests_total counter\n";
    for (const auto& [key, c] : counters_) {
        oss << key << " " << c->get() << "\n";
    }
    for (const auto& [key, g] : gauges_) {
        oss << key << " " << g->get() << "\n";
    }
    for (const auto& [key, h] : histograms_) {
        // Histogram export (구현 생략, 위 export_prometheus 활용)
    }
    return oss.str();
}

} // namespace monitoring

HTTP 핸들러에서 메트릭 노출

// main.cpp - /metrics 엔드포인트
#include "metrics_registry.hpp"
#include <boost/beast.hpp>
#include <iostream>

namespace beast = boost::beast;
namespace http = beast::http;

monitoring::MetricsRegistry g_registry;

void handle_metrics(beast::tcp_stream& stream,
                    http::request<http::string_body> const& req) {
    std::string body = g_registry.export_prometheus();

    http::response<http::string_body> res{http::status::ok, req.version()};
    res.set(http::field::server, "CppServer/1.0");
    res.set(http::field::content_type,
            "text/plain; charset=utf-8; version=0.0.4");
    res.body() = body;
    res.prepare_payload();

    http::write(stream, res);
}

주의점:

Content-Type: text/plain; charset=utf-8 — Prometheus가 기대하는 헤더
/metrics는 내부 네트워크에서만 접근 가능하게 제한하는 것이 보안에 좋습니다.

3. Grafana 대시보드 구성

Prometheus 데이터 소스 연결

Grafana → Configuration → Data Sources → Add data source
Prometheus 선택
URL: http://prometheus:9090 (Docker 환경) 또는 http://localhost:9090
Save & Test

핵심 패널 구성

패널	PromQL	설명
RPS	`rate(http_requests_total[5m])`	초당 요청 수
에러율	`rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100`	5xx 비율
p99 지연	`histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))`	99백분위 지연
활성 연결	`active_connections`	현재 연결 수

PromQL 쿼리 예시

# RPS (path별)
sum(rate(http_requests_total[5m])) by (path)

# 에러율 (5xx)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# p50, p95, p99 지연 (path별)
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path))

대시보드 JSON 예시 (일부)

{
  "panels": [
    {
      "title": "RPS (요청/초)",
      "type": "timeseries",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[5m])) by (method, path)",
          "legendFormat": "{{method}} {{path}}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "reqps",
          "min": 0
        }
      }
    },
    {
      "title": "p99 지연 시간 (초)",
      "type": "timeseries",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path))",
          "legendFormat": "{{path}}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "s",
          "min": 0
        }
      }
    }
  ]
}

4. 완전한 모니터링 예제

예제 1: REST API 서버에 메트릭 통합

요청 수, 지연 시간, 에러 수를 측정하는 미들웨어입니다.

// metrics_middleware.hpp
#pragma once

#include "metrics_registry.hpp"
#include <chrono>
#include <string>

inline monitoring::MetricsRegistry& get_metrics() {
    static monitoring::MetricsRegistry reg;
    return reg;
}

// HTTP 요청 처리 시 호출
inline void record_request(const std::string& method,
                           const std::string& path,
                           int status,
                           double duration_sec) {
    auto& reg = get_metrics();

    // Counter: 요청 수 (라벨로 method, path, status 구분)
    reg.counter("http_requests_total", {
        {"method", method},
        {"path", path},
        {"status", std::to_string(status)}
    }).inc();

    // Histogram: 지연 시간 (path별)
    reg.histogram("http_request_duration_seconds",
                  {0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0},
                  {{"path", path}}).observe(duration_sec);

    // Gauge: 활성 연결 수는 연결 시 +1, 종료 시 -1 (별도 처리)
}

예제 2: 연결 수 Gauge 관리

// connection_tracker.hpp
#pragma once

#include "metrics_registry.hpp"
#include <memory>

class ConnectionTracker {
public:
    ConnectionTracker() {
        get_metrics().gauge("active_connections", {}).inc();
    }
    ~ConnectionTracker() {
        get_metrics().gauge("active_connections", {}).dec();
    }
};

// 사용: 각 연결 소켓이 살아 있는 동안 ConnectionTracker 인스턴스 유지
void handle_connection(beast::tcp_stream& stream) {
    ConnectionTracker tracker;
    // ... 요청 처리 ...
}

예제 3: 메모리 사용량 Gauge

// 주기적으로 메모리 사용량 갱신 (별도 스레드 또는 타이머)
// 상단에 #include <fstream> 필요
void update_memory_metric() {
#ifdef __linux__
    // /proc/self/status에서 VmRSS 읽기
    std::ifstream status("/proc/self/status");
    std::string line;
    while (std::getline(status, line)) {
        if (line.find("VmRSS:") == 0) {
            size_t kb = std::stoull(line.substr(7));
            get_metrics().gauge("process_resident_memory_bytes", {})
                .set(static_cast<double>(kb * 1024));
            break;
        }
    }
#endif
}

예제 4: Docker Compose로 전체 스택 실행

# docker-compose.monitoring.yml
version: '3.8'

services:
  cpp-api:
    build: .
    ports:
      - "8080:8080"
    environment:
      - METRICS_PORT=8081
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 3

  prometheus:
    image: prom/prometheus:v2.47.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=15d'
      - '--web.enable-lifecycle'

  grafana:
    image: grafana/grafana:10.2.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana-data:/var/lib/grafana
    depends_on:
      - prometheus

volumes:
  grafana-data:

예제 5: Prometheus 설정 (scrape_configs)

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'cpp-api'
    static_configs:
      - targets: ['cpp-api:8080']
    metrics_path: /metrics
    scrape_interval: 10s
    scrape_timeout: 5s

5. 알람 설정

Prometheus 알람 규칙

# prometheus-alerts.yml
groups:
  - name: cpp-api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "에러율 5% 초과"
          description: "5분간 5xx 에러율이 5%를 넘었습니다."

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
          > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 지연 1초 초과"

      - alert: NoData
        expr: up{job="cpp-api"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "C++ API 서버 다운"

Grafana 알림 채널 (Slack 예시)

Grafana → Alerting → Contact points → New contact point
Type: Slack
Webhook URL: https://hooks.slack.com/services/xxx/yyy/zzz
채널: #alerts

알람 평가 흐름

flowchart LR
    P[Prometheus] -->|규칙 평가| A{조건 충족?}
    A -->|Yes| B[Alertmanager]
    B --> C[Slack/이메일/PagerDuty]
    A -->|No| D[정상]

6. 자주 발생하는 에러와 해결법

문제 1: “parse error” — Prometheus가 메트릭을 파싱하지 못함

증상: Prometheus 로그에 parse error 또는 invalid metric format 출력

원인: Prometheus Exposition Format 규격 위반

해결법:

// ❌ 잘못된 예: 라벨 값에 이스케이프 안 함
oss << "http_requests_total{path=\"/api/users\"}  ";  // 따옴표 내부 " 문제

// ✅ 올바른 예: 특수문자 이스케이프
std::string escape_label_value(const std::string& v) {
    std::string result;
    for (char c : v) {
        if (c == '\\') result += "\\\\";
        else if (c == '"') result += "\\\"";
        else if (c == '\n') result += "\\n";
        else result += c;
    }
    return result;
}

추가: 메트릭 이름은 [a-zA-Z_:][a-zA-Z0-9_:]* 패턴을 따라야 합니다. -는 사용 불가.

문제 2: “No data points” — Grafana에 그래프가 안 나옴

증상: Grafana 패널에 “No data” 표시

원인:

Prometheus가 scrape 대상에 연결할 수 없음
PromQL 쿼리 오류
시간 범위에 데이터 없음

해결법:

# 1. Prometheus 타겟 상태 확인
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job, health, lastError}'

# 2. 수동으로 메트릭 확인
curl http://cpp-api:8080/metrics

# 3. PromQL 직접 테스트 (Prometheus UI → Graph)
rate(http_requests_total[5m])

체크리스트:

C++ 서버 /metrics 응답 확인
prometheus.yml의 targets 호스트명/포트 확인
Docker 네트워크에서 서로 접근 가능한지 확인

문제 3: “Cardinality explosion” — 메모리 사용량 급증

증상: Prometheus 메모리가 수 GB로 증가, OOM

원인: 라벨 조합이 폭발적으로 증가 (예: path에 사용자 ID를 넣으면 사용자 수만큼 시계열 생성)

해결법:

// ❌ 나쁜 예: path에 무제한 카디널리티
labels["path"] = req.path;  // /users/12345, /users/67890, ... 수백만 조합

// ✅ 좋은 예: path 정규화 (템플릿화)
std::string normalize_path(const std::string& path) {
    // /users/12345 → /users/:id
    if (path.find("/users/") == 0) return "/users/:id";
    if (path.find("/orders/") == 0) return "/orders/:id";
    return path;
}
labels["path"] = normalize_path(req.path);

권장: 라벨당 고유 값은 수십~수백 개 이하로 유지

문제 4: Histogram 버킷 설정 오류

증상: histogram_quantile 결과가 비정상 (0 또는 Inf)

원인: 버킷 범위가 실제 값과 맞지 않음

해결법:

// ❌ 나쁜 예: 버킷이 너무 큼 (모든 요청이 0.001초인데 버킷이 1초부터)
Histogram h({1.0, 2.0, 5.0});  // 10ms 요청은 모든 버킷에 포함되지 않음

// ✅ 좋은 예: 예상 지연 범위에 맞게 설정
// API 서버: 대부분 1ms~1초
Histogram h({0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0});

문제 5: 스레드 안전성 — 데이터 레이스

증상: 간헐적 잘못된 메트릭 값, 크래시

원인: 여러 스레드가 동일 Counter/Gauge에 lock 없이 접근

해결법:

// ✅ Counter/Gauge: std::atomic 사용 (이미 위 구현에 적용)
// ✅ Histogram: observe() 내부에서 mutex 사용
// ✅ export_prometheus(): 읽기 시 일관된 스냅샷 (필요 시 mutex)

문제 6: “/metrics” 응답 지연

증상: /metrics 요청이 수 초 걸림, Prometheus scrape timeout

원인: 메트릭 직렬화 시 모든 시계열을 순회하며 문자열 조합

해결법:

메트릭 수를 제한 (카디널리티 관리)
직렬화 결과를 주기적으로 캐시 (예: 1초마다 갱신, 읽기 시 캐시 반환)

// 캐시 예시
std::string get_cached_metrics() {
    static std::atomic<std::time_t> last_update{0};
    static std::mutex cache_mutex;
    static std::string cached;

    auto now = std::time(nullptr);
    if (now - last_update.load() >= 1) {
        std::lock_guard<std::mutex> lock(cache_mutex);
        if (now - last_update.load() >= 1) {
            cached = g_registry.export_prometheus();
            last_update.store(now);
        }
    }
    return cached;
}

7. 성능 최적화 팁

1. 메트릭 수집 오버헤드 최소화

// ❌ 매 요청마다 lock
void record_request(...) {
    std::lock_guard<std::mutex> lock(global_mutex);  // 병목
    counter.inc();
}

// ✅ atomic 연산만 사용 (Counter/Gauge)
void record_request(...) {
    counter.inc();  // fetch_add, lock 없음
}

2. Histogram observe 최적화

Histogram은 버킷 업데이트 시 lock이 필요합니다. 초당 수만 요청이면 병목이 될 수 있습니다.

대안:

슬라이딩 윈도우 또는 근사 Histogram (예: t-digest) 사용
또는 샘플링: 1/100 요청만 observe

void record_request_sampled(double duration) {
    static std::atomic<uint64_t> counter{0};
    if (counter.fetch_add(1) % 100 == 0) {
        histogram.observe(duration);
    }
}

3. export_prometheus 비동기화

/metrics 호출 시 직렬화가 오래 걸리면 HTTP 스레드를 블로킹합니다. 미리 직렬화해 두고 요청 시 즉시 반환하는 방식이 좋습니다.

4. Prometheus scrape 간격 조정

# prometheus.yml
scrape_configs:
  - job_name: 'cpp-api'
    scrape_interval: 10s   # 기본 15s보다 짧게 → 더 세밀한 데이터
    scrape_timeout: 5s     # timeout 짧게 → 실패 시 빠르게 포기

트래픽이 적으면 scrape_interval: 30s로 늘려 Prometheus 부하를 줄일 수 있습니다.

5. Grafana 쿼리 최적화

rate(...[5m]) 대신 rate(...[1m])는 더 많은 샘플을 사용해 부하 증가
대시보드 패널 수를 줄이면 쿼리 수 감소
Recording rules로 자주 쓰는 쿼리를 미리 계산해 두기

# prometheus.yml - recording rules
rule_files:
  - "recording_rules.yml"

# recording_rules.yml
groups:
  - name: cpp-api-aggregations
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

8. 프로덕션 패턴

패턴 1: 메트릭 포트 분리

API 포트(8080)와 메트릭 포트(8081)를 분리해, 메트릭 엔드포인트를 내부 네트워크에서만 노출합니다.

# Kubernetes Service
apiVersion: v1
kind: Service
metadata:
  name: cpp-api-metrics
spec:
  selector:
    app: cpp-api
  ports:
    - name: metrics
      port: 8081
      targetPort: 8081
  # ClusterIP만 사용 → 외부 접근 불가

패턴 2: 다중 인스턴스 라벨링

Kubernetes에서 파드별로 구분하려면 instance 또는 pod 라벨을 활용합니다.

# Prometheus scrape_configs (Kubernetes 서비스 디스커버리)
scrape_configs:
  - job_name: 'cpp-api'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_ip]
        target_label: instance
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod

패턴 3: SLO/SLA 대시보드

99.9% 가용성, p99 < 100ms 같은 SLA를 전용 대시보드로 모니터링합니다.

PromQL 예시:

에러 버짓: 1 - (sum(rate(http_requests_total{status=~"5.."}[30d])) / sum(rate(http_requests_total[30d])))
지연 버짓: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[30d]))

패턴 4: 단계별 알람 (Escalation)

Warning: p99 지연 500ms 초과 → Slack
Critical: p99 지연 1초 초과 또는 에러율 5% → PagerDuty

# alertmanager.yml
route:
  receiver: 'slack-warning'
  group_by: ['alertname']
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack-warning'

패턴 5: 메트릭 보존 정책

# prometheus.yml
global:
  scrape_interval: 15s

# 15일 보존 (디스크 사용량 고려)
storage:
  tsdb:
    retention.time: 15d
    retention.size: 50GB

9. 구현 체크리스트

환경 설정

Prometheus 설치 및 prometheus.yml 설정
Grafana 설치 및 Prometheus 데이터 소스 연결
C++ 서버에 /metrics 엔드포인트 구현

메트릭 구현

Counter: http_requests_total (method, path, status 라벨)
Histogram: http_request_duration_seconds (적절한 버킷)
Gauge: active_connections, process_resident_memory_bytes (선택)
라벨 카디널리티 제한 (path 정규화)

대시보드

RPS 패널
에러율 패널
p50/p95/p99 지연 패널
연결 수 / 메모리 패널 (해당 시)

알람

HighErrorRate 규칙
HighLatency 규칙
NoData (타겟 다운) 규칙
Slack/이메일 연동

보안

/metrics 내부 네트워크만 노출
Grafana 관리자 비밀번호 변경
Prometheus --web.enable-admin-api 비활성화 (프로덕션)

성능

메트릭 직렬화 캐시 (선택)
Histogram 샘플링 (고트래픽 시)
Recording rules로 쿼리 부하 감소

정리

항목	설명
아키텍처	C++ → Prometheus (Pull) → Grafana
메트릭	Counter, Gauge, Histogram + 라벨
대시보드	RPS, 에러율, p99, 연결 수
알람	에러율, 지연, 타겟 다운
프로덕션	포트 분리, 카디널리티 제한, SLO 추적

핵심 원칙:

Pull 방식으로 Prometheus가 메트릭 수집
라벨 카디널리티를 제한해 메모리 폭발 방지
알람으로 사후 대응이 아닌 사전 감지
SLA/SLO를 데이터로 증명

자주 묻는 질문 (FAQ)

Q. 이 내용을 실무에서 언제 쓰나요?

A. 서비스 헬스 체크, 성능 모니터링, 장애 감지, SLA 추적 등 프로덕션 시스템 운영에 필수적입니다. 실무에서는 위 본문의 예제와 선택 가이드를 참고해 적용하면 됩니다.

Q. 선행으로 읽으면 좋은 글은?

A. 각 글 하단의 이전 글 링크를 따라가면 순서대로 배울 수 있습니다. C++ 시리즈 목차에서 전체 흐름을 확인할 수 있습니다.

Q. prometheus-cpp 라이브러리를 써야 하나요?

A. 필수는 아닙니다. 간단한 Counter/Gauge는 std::atomic으로 직접 구현해도 됩니다. Histogram과 라벨 관리가 복잡해지면 prometheus-cpp를 사용하는 것이 유지보수에 유리합니다.

Q. Grafana 대시보드를 코드로 관리하려면?

A. Grafana Provisioning API로 JSON 대시보드를 배포할 수 있습니다. 또는 grafana/dashboard-as-code 같은 도구를 활용하세요.

한 줄 요약: Grafana·Prometheus 통합으로 C++ 서버의 실시간 모니터링 대시보드를 구축하고, 장애 감지와 SLA 추적을 데이터 기반으로 수행할 수 있습니다.

다음 글: [C++ 실전 가이드 #51-1] 프로파일링 도구 마스터

이전 글: [C++ 실전 가이드 #50-5] 프로덕션 배포 자동화

C++ 시리즈 전체 보기
C++ Adapter Pattern 완벽 가이드 | 인터페이스 변환과 호환성
C++ ADL |
C++ Aggregate Initialization |

이 글의 핵심

들어가며: “왜 느려졌는지, 언제 터졌는지” 데이터로 보기

모니터링이 없으면 장애 대응이 불가능하다

문제 시나리오: 모니터링이 필요한 상황

개념을 잡는 비유

목차

1. 시스템 아키텍처

전체 구조

시퀀스 다이어그램

2. Prometheus 메트릭 구현

메트릭 타입 개요

C++ 메트릭 레지스트리 구현

메트릭 직렬화 (Prometheus 포맷)

HTTP 핸들러에서 메트릭 노출

3. Grafana 대시보드 구성

Prometheus 데이터 소스 연결

핵심 패널 구성

PromQL 쿼리 예시

대시보드 JSON 예시 (일부)

4. 완전한 모니터링 예제

예제 1: REST API 서버에 메트릭 통합

예제 2: 연결 수 Gauge 관리

예제 3: 메모리 사용량 Gauge

예제 4: Docker Compose로 전체 스택 실행

예제 5: Prometheus 설정 (scrape_configs)

5. 알람 설정

Prometheus 알람 규칙

Grafana 알림 채널 (Slack 예시)

알람 평가 흐름

6. 자주 발생하는 에러와 해결법

문제 1: “parse error” — Prometheus가 메트릭을 파싱하지 못함

문제 2: “No data points” — Grafana에 그래프가 안 나옴

문제 3: “Cardinality explosion” — 메모리 사용량 급증

문제 4: Histogram 버킷 설정 오류

문제 5: 스레드 안전성 — 데이터 레이스

문제 6: “/metrics” 응답 지연

7. 성능 최적화 팁

1. 메트릭 수집 오버헤드 최소화

2. Histogram observe 최적화

3. export_prometheus 비동기화

4. Prometheus scrape 간격 조정

5. Grafana 쿼리 최적화

8. 프로덕션 패턴

패턴 1: 메트릭 포트 분리

패턴 2: 다중 인스턴스 라벨링

패턴 3: SLO/SLA 대시보드

패턴 4: 단계별 알람 (Escalation)

패턴 5: 메트릭 보존 정책

9. 구현 체크리스트

환경 설정

메트릭 구현

대시보드

알람

보안

성능

정리

자주 묻는 질문 (FAQ)

Q. 이 내용을 실무에서 언제 쓰나요?

Q. 선행으로 읽으면 좋은 글은?

Q. prometheus-cpp 라이브러리를 써야 하나요?

Q. Grafana 대시보드를 코드로 관리하려면?

관련 글