Should I use iptables or IPVS mode for kube-proxy?

Most clusters run fine with iptables (or nftables-backed netfilter rules). Consider IPVS when you operate at very large Service and rule counts and have measured kube-proxy overhead—validate kernel, CNI, and platform support before switching.

Why is my Pod Pending?

Pending often means scheduling failed (resources, affinity, taints, PVC topology). But Pods can also remain unready after scheduling due to image pulls, volume mounts, admission webhook denials, or container start failures—always read Events and container status together.

What does etcd actually store?

The API server persists Kubernetes objects (desired state metadata) in etcd. Worker nodes, container images, and application data live elsewhere—etcd is the source of truth for API objects, which is why etcd backup and restore is central to disaster recovery.

How is a Deployment different from a controller reconciliation loop?

A Deployment is a declarative API object. Controllers (Deployment → ReplicaSet → Pod) implement reconciliation loops that continuously compare observed state to desired state and take actions (create/delete Pods, adjust ReplicaSets) until they converge.

Kubernetes Complete Guide — Architecture, Scheduling, Services, etcd, Controllers & Production

2026년 4월 7일 · 42분 읽기 · 수정 2026년 4월 7일 advanced tutorial

이 글의 핵심

Beyond everyday YAML, this guide explains how Kubernetes actually works: how Pods are scheduled, how Services are implemented on the node, how etcd stores cluster state under Raft, how controllers reconcile desired state, and which production patterns matter in real clusters.

What This Guide Covers

This article is a complete guide that pairs everyday Kubernetes objects with internals: the Pod scheduling pipeline, Service data plane behavior and kube-proxy modes, etcd consensus and data model, controller reconciliation, and production-grade operational patterns. The depth matches the Korean edition (kubernetes-complete-guide.md).

1. Architecture at a Glance

Kubernetes is built from a declarative API and control loops. You record desired state via kubectl apply; controllers continuously drive the live cluster toward that intent.

Control plane (typical):

kube-apiserver: The single API front door. After authn/authz/admission, it persists objects to etcd; other components watch/list through it.
etcd: Distributed key-value store with Raft replication (see §4).
kube-scheduler: Assigns unscheduled Pods to nodes (see §3).
kube-controller-manager: Runs many controllers (Deployments, ReplicaSets, Nodes, etc.) (see §5).
cloud-controller-manager (when applicable): Integrates with cloud load balancers, routes, and node lifecycle.

Worker nodes:

kubelet: Manages Pod lifecycle on the node via the CRI (e.g., containerd).
kube-proxy: Programs node-level forwarding rules for Services (see §6).
CNI plugin: Pod networking; CSI: storage.

2. Core Objects (Practical Summary)

Pod

The smallest deployable unit—one or more containers sharing network and storage namespaces. Prefer Deployments (or StatefulSets/DaemonSets) over naked Pods.

Deployment

Declares replicas and rolling update strategy. ReplicaSet controllers create/maintain Pods; Deployment manages ReplicaSets across revisions.

Service

Provides a stable ClusterIP and DNS name for a set of Pods selected by labels. Endpoints are published via Endpoints / EndpointSlice objects.

Ingress

Routes external HTTP(S) traffic to Services by host/path. Requires an Ingress controller (e.g., NGINX Ingress).

ConfigMap / Secret

Inject configuration and secrets. Plan for encryption at rest, RBAC, rotation, and sometimes external secret stores for production.

3. Pod Scheduling Algorithm

Scheduling is performed by kube-scheduler (or a custom scheduler). The scheduling framework is easiest to reason about as filter → score → bind.

3.1 Filtering (Predicates)

Build the set of feasible nodes. If none qualify, the Pod stays Pending with a message in Events.

Common filters include:

Resources: Do requests fit allocatable CPU/memory on the node?
Host ports: Conflicts with other Pods using the same host port
Selectors / affinity: nodeSelector, nodeAffinity, podAffinity / podAntiAffinity
Taints and tolerations: Whether the Pod may land on tainted nodes
Volume topology: PVC binding to zones/regions, CSI constraints

3.2 Scoring (Priorities)

Rank feasible nodes with weighted scores. Highest score wins (with tie-breaking rules). Examples:

Balanced allocation across nodes
Affinity weights for soft preferences
Topology spread vs. locality trade-offs

3.3 Binding

The scheduler issues a Bind to set spec.nodeName. The kubelet then pulls images and starts containers.

3.4 Operations Notes

For Pending Pods, read kubectl describe pod: “0/X nodes are available: …” often points to resources, taints, affinity, or volumes.
Omitting requests makes scheduling and capacity planning unreliable—set requests/limits and probes for production services.
Use Scheduling Profiles and plugins when you need GPU, local SSD, multi-tenancy tiers, or custom scoring.

4. etcd Consensus and Data Model

4.1 Raft

etcd uses Raft for leader election and replicated log commits. Clusters usually run an odd member count (3, 5, …) to preserve quorum under failures. If quorum is lost, writes can stop even if some reads still work—this is why control-plane HA and member placement matter.

4.2 Role in Kubernetes

The API server persists Kubernetes API objects in etcd as the source of truth.
Watch streams enable controllers and the scheduler to react to changes—foundation of the reconcile pattern.

4.3 Keys and Objects

Objects are stored as values under hierarchical keys; think in terms of API group, resource type, namespace, name. The exact prefix layout can vary by version, but mentally model it as a structured tree of persisted API objects.

4.4 Operations and Security

Backup/restore drills are mandatory for disaster recovery—without etcd snapshots, rebuilding the control plane may lose the desired state you cannot reconstruct from nodes alone.
Encryption at rest, strict TLS, and network isolation for etcd become non-negotiable at scale.
Very large numbers of objects or high churn can stress watch traffic and API latency—mind label cardinality and object counts.

5. Controller Reconciliation Loops

Kubernetes “self-healing” is implemented by controllers reconciling observed state to desired state.

5.1 Typical Pattern

Shared informers maintain a local cache of API objects.
Changes enqueue keys (namespace/name) into a workqueue.
Workers call a reconcile function: create missing child objects, delete unneeded ones, patch fields.
Transient errors retry with backoff.

5.2 Deployment → ReplicaSet → Pod

The Deployment controller manages ReplicaSets per revision; the ReplicaSet controller matches Pod count to replicas. Rolling updates gradually shift traffic by scaling ReplicaSets up/down according to strategy (maxSurge, maxUnavailable).

5.3 Why It Matters for Debugging

“Slow to converge” may indicate controller queue backlog, API latency, or admission webhook timeouts.
“Duplicates” or ownership bugs often trace to incorrect labels/selectors or conflicting controllers—inspect owner references and controller logs.

6. Service Networking and kube-proxy Modes

A Service exposes a virtual IP (ClusterIP) that resolves via CoreDNS inside the cluster. EndpointSlices track ready Pod IP:port backends. kube-proxy translates Service VIP traffic to Pod IPs on each node.

6.1 iptables Mode

kube-proxy programs iptables (or nftables-backed chains where applicable). Traffic to the Service VIP is DNAT’d to a chosen backend Pod IP. Simple and ubiquitous; very large rule counts can add traversal cost.

6.2 IPVS Mode

Uses the Linux IPVS dataplane for load balancing. Can scale better in some large environments and offers pluggable schedulers—validate platform/CNI/kernel compatibility first.

6.3 userspace Mode (Legacy)

Early kube-proxy proxied in userspace—not recommended today due to performance limits.

6.4 NodePort and LoadBalancer

NodePort exposes a static high port on every node—mind security groups/firewalls.
LoadBalancer provisions a cloud L4 LB in many providers; data-path details vary (health checks, source IP preservation, dual-stack).

6.5 Troubleshooting

If traffic fails: verify Service selector ↔ Pod labels, Endpoints/EndpointSlices are non-empty, and readiness probes pass.
With NetworkPolicy, remember many clusters default to allow-all until policies exist—once enabled, DNS paths and health checks are frequent gotchas.

7. Production Kubernetes Patterns

7.1 Availability and Upgrades

Run multiple replicas; define PodDisruptionBudgets so node drains/cluster upgrades do not violate minimum availability.
Tune RollingUpdate parameters for speed vs. safety.

7.2 Security and Multi-Tenancy

RBAC with least privilege; namespace boundaries
Admission policies (OPA/Gatekeeper, Kyverno) to enforce standards
NetworkPolicies for east-west segmentation
Secrets: external stores + CSI drivers, rotation, minimal blast radius

7.3 Governance

ResourceQuota and LimitRange per namespace
HPA scales Pods; cluster autoscaler scales nodes—different problems, often used together

7.4 Observability

Metrics: kube-state-metrics, cAdvisor, app RED/USE signals
Centralized logs and traces
Alert on API latency, etcd health, scheduling backlog, OOMKills, CrashLoopBackOff rates

7.5 GitOps

Manage manifests in Git and sync with Argo CD or Flux for auditability and rollbacks—still operate webhooks, drift detection, and secrets carefully.

8. kubectl Quick Reference

kubectl get pods -A -o wide
kubectl describe pod POD -n NAMESPACE
kubectl get endpoints SERVICE -n NAMESPACE
kubectl get networkpolicy -A
kubectl explain pod.spec.affinity

Summary

Kubernetes realizes user intent through etcd-backed declarations, scheduling, kubelet execution, kube-proxy dataplane programming, and controller reconciliation. Production incidents often surface at the intersections—Pending Pods, NetworkPolicies, PDBs, and quotas. Understanding these layers turns noisy symptoms into targeted fixes.