Kubernetes Complete Guide — Architecture, Scheduling, Services, etcd, Controllers & Production

Kubernetes Complete Guide — Architecture, Scheduling, Services, etcd, Controllers & Production

이 글의 핵심

Beyond everyday YAML, this guide explains how Kubernetes actually works: how Pods are scheduled, how Services are implemented on the node, how etcd stores cluster state under Raft, how controllers reconcile desired state, and which production patterns matter in real clusters.

What This Guide Covers

This article is a complete guide that pairs everyday Kubernetes objects with internals: the Pod scheduling pipeline, Service data plane behavior and kube-proxy modes, etcd consensus and data model, controller reconciliation, and production-grade operational patterns. The depth matches the Korean edition (kubernetes-complete-guide.md).


1. Architecture at a Glance

Kubernetes is built from a declarative API and control loops. You record desired state via kubectl apply; controllers continuously drive the live cluster toward that intent.

Control plane (typical):

  • kube-apiserver: The single API front door. After authn/authz/admission, it persists objects to etcd; other components watch/list through it.
  • etcd: Distributed key-value store with Raft replication (see §4).
  • kube-scheduler: Assigns unscheduled Pods to nodes (see §3).
  • kube-controller-manager: Runs many controllers (Deployments, ReplicaSets, Nodes, etc.) (see §5).
  • cloud-controller-manager (when applicable): Integrates with cloud load balancers, routes, and node lifecycle.

Worker nodes:

  • kubelet: Manages Pod lifecycle on the node via the CRI (e.g., containerd).
  • kube-proxy: Programs node-level forwarding rules for Services (see §6).
  • CNI plugin: Pod networking; CSI: storage.

2. Core Objects (Practical Summary)

Pod

The smallest deployable unit—one or more containers sharing network and storage namespaces. Prefer Deployments (or StatefulSets/DaemonSets) over naked Pods.

Deployment

Declares replicas and rolling update strategy. ReplicaSet controllers create/maintain Pods; Deployment manages ReplicaSets across revisions.

Service

Provides a stable ClusterIP and DNS name for a set of Pods selected by labels. Endpoints are published via Endpoints / EndpointSlice objects.

Ingress

Routes external HTTP(S) traffic to Services by host/path. Requires an Ingress controller (e.g., NGINX Ingress).

ConfigMap / Secret

Inject configuration and secrets. Plan for encryption at rest, RBAC, rotation, and sometimes external secret stores for production.


3. Pod Scheduling Algorithm

Scheduling is performed by kube-scheduler (or a custom scheduler). The scheduling framework is easiest to reason about as filter → score → bind.

3.1 Filtering (Predicates)

Build the set of feasible nodes. If none qualify, the Pod stays Pending with a message in Events.

Common filters include:

  • Resources: Do requests fit allocatable CPU/memory on the node?
  • Host ports: Conflicts with other Pods using the same host port
  • Selectors / affinity: nodeSelector, nodeAffinity, podAffinity / podAntiAffinity
  • Taints and tolerations: Whether the Pod may land on tainted nodes
  • Volume topology: PVC binding to zones/regions, CSI constraints

3.2 Scoring (Priorities)

Rank feasible nodes with weighted scores. Highest score wins (with tie-breaking rules). Examples:

  • Balanced allocation across nodes
  • Affinity weights for soft preferences
  • Topology spread vs. locality trade-offs

3.3 Binding

The scheduler issues a Bind to set spec.nodeName. The kubelet then pulls images and starts containers.

3.4 Operations Notes

  • For Pending Pods, read kubectl describe pod: “0/X nodes are available: …” often points to resources, taints, affinity, or volumes.
  • Omitting requests makes scheduling and capacity planning unreliable—set requests/limits and probes for production services.
  • Use Scheduling Profiles and plugins when you need GPU, local SSD, multi-tenancy tiers, or custom scoring.

4. etcd Consensus and Data Model

4.1 Raft

etcd uses Raft for leader election and replicated log commits. Clusters usually run an odd member count (3, 5, …) to preserve quorum under failures. If quorum is lost, writes can stop even if some reads still work—this is why control-plane HA and member placement matter.

4.2 Role in Kubernetes

  • The API server persists Kubernetes API objects in etcd as the source of truth.
  • Watch streams enable controllers and the scheduler to react to changes—foundation of the reconcile pattern.

4.3 Keys and Objects

Objects are stored as values under hierarchical keys; think in terms of API group, resource type, namespace, name. The exact prefix layout can vary by version, but mentally model it as a structured tree of persisted API objects.

4.4 Operations and Security

  • Backup/restore drills are mandatory for disaster recovery—without etcd snapshots, rebuilding the control plane may lose the desired state you cannot reconstruct from nodes alone.
  • Encryption at rest, strict TLS, and network isolation for etcd become non-negotiable at scale.
  • Very large numbers of objects or high churn can stress watch traffic and API latency—mind label cardinality and object counts.

5. Controller Reconciliation Loops

Kubernetes “self-healing” is implemented by controllers reconciling observed state to desired state.

5.1 Typical Pattern

  1. Shared informers maintain a local cache of API objects.
  2. Changes enqueue keys (namespace/name) into a workqueue.
  3. Workers call a reconcile function: create missing child objects, delete unneeded ones, patch fields.
  4. Transient errors retry with backoff.

5.2 Deployment → ReplicaSet → Pod

The Deployment controller manages ReplicaSets per revision; the ReplicaSet controller matches Pod count to replicas. Rolling updates gradually shift traffic by scaling ReplicaSets up/down according to strategy (maxSurge, maxUnavailable).

5.3 Why It Matters for Debugging

  • “Slow to converge” may indicate controller queue backlog, API latency, or admission webhook timeouts.
  • “Duplicates” or ownership bugs often trace to incorrect labels/selectors or conflicting controllers—inspect owner references and controller logs.

6. Service Networking and kube-proxy Modes

A Service exposes a virtual IP (ClusterIP) that resolves via CoreDNS inside the cluster. EndpointSlices track ready Pod IP:port backends. kube-proxy translates Service VIP traffic to Pod IPs on each node.

6.1 iptables Mode

kube-proxy programs iptables (or nftables-backed chains where applicable). Traffic to the Service VIP is DNAT’d to a chosen backend Pod IP. Simple and ubiquitous; very large rule counts can add traversal cost.

6.2 IPVS Mode

Uses the Linux IPVS dataplane for load balancing. Can scale better in some large environments and offers pluggable schedulers—validate platform/CNI/kernel compatibility first.

6.3 userspace Mode (Legacy)

Early kube-proxy proxied in userspace—not recommended today due to performance limits.

6.4 NodePort and LoadBalancer

  • NodePort exposes a static high port on every node—mind security groups/firewalls.
  • LoadBalancer provisions a cloud L4 LB in many providers; data-path details vary (health checks, source IP preservation, dual-stack).

6.5 Troubleshooting

  • If traffic fails: verify Service selector ↔ Pod labels, Endpoints/EndpointSlices are non-empty, and readiness probes pass.
  • With NetworkPolicy, remember many clusters default to allow-all until policies exist—once enabled, DNS paths and health checks are frequent gotchas.

7. Production Kubernetes Patterns

7.1 Availability and Upgrades

  • Run multiple replicas; define PodDisruptionBudgets so node drains/cluster upgrades do not violate minimum availability.
  • Tune RollingUpdate parameters for speed vs. safety.

7.2 Security and Multi-Tenancy

  • RBAC with least privilege; namespace boundaries
  • Admission policies (OPA/Gatekeeper, Kyverno) to enforce standards
  • NetworkPolicies for east-west segmentation
  • Secrets: external stores + CSI drivers, rotation, minimal blast radius

7.3 Governance

  • ResourceQuota and LimitRange per namespace
  • HPA scales Pods; cluster autoscaler scales nodes—different problems, often used together

7.4 Observability

  • Metrics: kube-state-metrics, cAdvisor, app RED/USE signals
  • Centralized logs and traces
  • Alert on API latency, etcd health, scheduling backlog, OOMKills, CrashLoopBackOff rates

7.5 GitOps

Manage manifests in Git and sync with Argo CD or Flux for auditability and rollbacks—still operate webhooks, drift detection, and secrets carefully.


8. kubectl Quick Reference

kubectl get pods -A -o wide
kubectl describe pod POD -n NAMESPACE
kubectl get endpoints SERVICE -n NAMESPACE
kubectl get networkpolicy -A
kubectl explain pod.spec.affinity

Summary

Kubernetes realizes user intent through etcd-backed declarations, scheduling, kubelet execution, kube-proxy dataplane programming, and controller reconciliation. Production incidents often surface at the intersections—Pending Pods, NetworkPolicies, PDBs, and quotas. Understanding these layers turns noisy symptoms into targeted fixes.