본문으로 건너뛰기 Cadence Complete Guide — Uber’s Workflow Engine

Cadence Complete Guide — Uber’s Workflow Engine

Cadence Complete Guide — Uber’s Workflow Engine

이 글의 핵심

Cadence is Uber’s open-source durable workflow execution engine: business processes spanning microservices and legacy systems are expressed in code. This guide covers domains and task lists, workflows and activities, signals and queries, retry policies, how it relates to Temporal, and a hands-on order-processing example.

What this article covers

Cadence is Uber’s open-source distributed workflow orchestration platform. To reproduce the same business process after failures, deploys, or process restarts, you need a model that durably records execution state outside the application process and lets workers deterministically re-run workflow code following that record. Cadence provides that via a service-style execution engine and an event history.

This guide covers placement into domains and task lists, separation of workflow vs. activity responsibilities, signals and queries for asynchronous integration with the outside world, retry policies tuned to external systems, a comparison with widely adopted Temporal, and a practical order-processing workflow. Cadence’s Go and Java SDKs are mature, so examples skew Go-style, but the ideas apply across languages.


1. The problem Cadence solves

Implementing flows like “create order → pay → reserve stock → request shipping” directly in a distributed environment scatters state storage, retries, timeouts, idempotency, and manual recovery scripts across the codebase. Cadence folds those cross-cutting concerns into a workflow execution service and a recorded event history. If a worker process dies, the server keeps the history and a new worker replays the same workflow definition to restore progress.

From an operations standpoint, unlike batch scripts, you can look up, signal, or cancel in-flight procedures by identifier (workflow ID). Support, finance, and field ops can treat “which order is stuck at which step” like a product surface.


2. Architecture overview

A Cadence deployment is usually understood as the following components:

  • Frontend Service: The gRPC entry point for clients and workers. Accepts workflow starts, signals, and queries.
  • History Service: The core that stores and replays per-workflow event histories. This is what durability rests on.
  • Matching Service: Queues tasks for task lists and delivers them to workers.
  • Worker: A process you operate that polls for workflow and activity tasks. This is the unit of horizontal scaling.

Data flow is a loop: client starts an execution → events append to history → the matching service enqueues tasks on a task list → workers process → results are written back to history. As long as this loop holds, a started procedure is tracked to completion under the defined policies.


3. Core concepts

3.1 Durable execution and determinism

When a Cadence workflow function replays the same event history, it must always reproduce the same branches and the same activity schedule order. This is workflow determinism. It is therefore unsafe to put the following directly in workflow code:

  • Direct calls to the network, databases, or message queues
  • Non-fixed-seed randomness or arbitrary reads of “now” (language-specific safe APIs exist)
  • Branches that depend on mutable global state shared outside the workflow

Side effects belong in activities; the workflow handles orchestration only—scheduling, branching, timers, waiting on signals.

3.2 Workflow execution and run ID

A workflow is a registered function-shaped procedure. When a client starts an execution with a workflow ID (often mapped to a business key), the server distinguishes each attempt with a run ID. Restarts, continue-as-new, and similar flows can chain multiple runs under one workflow ID. For ops logs and support tickets, recording both workflow ID and run ID aids tracing.

3.3 Activity

An activity is a unit of work with side effects: HTTP calls, DB transactions, message publishing. Activities can have their own timeouts, heartbeats, and retry policies, aligned with external API SLAs. The same activity may run more than once, so idempotency keys are effectively mandatory.

3.4 Child workflows and continue-as-new

Long runs or high event volume grow history and replay cost. Continue-as-new closes the current execution and continues under the same workflow ID with a new run, trimming history. Use child workflows when a sub-process should have its own lifecycle.


4. Domains and task lists

4.1 Domain

A domain is Cadence’s top-level isolation boundary for workflow executions. Splitting by team, product, or environment (e.g., production vs. staging) makes it easier to separate configuration (history retention, archival) and operational access. Workflow IDs need only be unique within a domain; they do not collide across domains.

When designing domains, consider:

  • Regulation and data sovereignty: If certain users’ data must be isolated, domain-level separation is a candidate.
  • Failure blast radius: Boundaries so overload in one domain does not destabilize other product lines.
  • Deploys and versions: Document worker binaries alongside domain settings so which domain handles which workflow types is explicit.

4.2 Task list

A task list is the logical name of the queue from which workers poll tasks. Workflow tasks and activity tasks are each scheduled to task lists; a worker subscribes to one or more lists.

In practice, teams often split as follows:

  • Service or team: Align with binary boundaries, e.g. orders-worker, payments-worker.
  • Priority and SLA: Separate real-time orders from batch settlement to reduce mutual backlog.
  • Deploy isolation: During canaries, have only some workers poll a new task list for gradual rollout.

A task list name is “just a string,” but in operations it ties directly to monitoring, alerts, and autoscaling, so a team-wide naming convention pays off.


5. Workflows and activities — contract and execution

5.1 Registering workflows

At startup, a worker process registers workflow and activity functions with names. When a client starts an execution under that name, schedule events accumulate in history and workflow tasks are delivered to a task list.

5.2 Activity options

Activity execution typically includes:

  • ScheduleToStartTimeout: Upper bound from enqueue to assignment to a worker. Surfaces worker shortage and backlog.
  • StartToCloseTimeout: Upper bound for the worker to run the activity to completion. Caps business logic duration.
  • HeartbeatTimeout: Checks that long work is still alive. Used together with heartbeats.

In order domains, “wait for payment gateway response” vs. “integrate with warehouse WMS” often need different timeouts, so per-activity-type options are common.

5.3 Versioning and compatibility

Changing workflow source can conflict with replay. While long-running instances still exist, conditional branches (e.g., build ID, domain constants, explicit version flags) let old and new logic coexist; remove branches after old executions drain. For activity input/output schema changes, backward-compatible field additions or new activity names are safer migration paths.


6. Signals and queries

6.1 Signal

A signal is an external event delivered asynchronously to a running workflow instance. Examples: “payment webhook received,” “customer cancellation,” “stock reservation complete.” The workflow receives events on signal channels and can wait on multiple sources (timers and signals) with a Selector pattern (Go).

Signal handlers are replayed too, so they must deterministically update only in-workflow state. Explicitly validate ordering against business rules (e.g., whether cancellation must be processed before payment confirmation).

6.2 Query

A query is a read-only path to inspect the workflow’s current state. Support tools and internal dashboards use it to show “what step is this order in?” Query handlers must not cause side effects and should return a consistent snapshot of state the workflow maintains.


7. Retry policies

7.1 Activity retries

External APIs fail due to timeouts, transient errors, and rate limits. Cadence attaches a RetryPolicy to activities to tune:

  • InitialInterval: Wait before the first retry
  • BackoffCoefficient: Multiplier for exponential backoff
  • MaximumInterval: Cap on backoff
  • MaximumAttempts: Max attempts (check docs—0 may mean unlimited depending on version)
  • NonRetriableErrorReasons: Classify errors where retry is pointless (e.g., invalid order ID) as immediate failure

Independently of retries, design for duplicate activity executions. For payments and inventory, use request keys or tokens supported by external systems.

7.2 Workflow tasks and worker-side failure

If a workflow decision cycle takes too long or replay fails due to a bug, it surfaces as a workflow-task-level error. That differs from activity failure, so keep workflow code to light control logic and push heavy work to activities.


8. Comparison with Temporal

Temporal is a separate project that forked and evolved the Cadence codebase. Core ideas—event history, deterministic workflows, activities, signals and queries, task queues—are the same, but product roadmaps, SDKs, and managed services diverged.

AspectCadenceTemporal
OriginUber open sourceIndependent community after Cadence fork
NamespaceDomainNamespace (similar concept)
Task routingTask listTask queue
SDK maturityGo- and Java-centricBroad: Go, Java, TypeScript, Python, PHP, .NET, …
Managed serviceSelf-operated / communityTemporal Cloud, etc.

When to read about Cadence: Legacy systems already on Cadence, or when alignment with Uber open-source docs and patterns matters. When choosing a new standard, teams usually compare Temporal and Cadence by language SDKs, cloud ops needs, and enterprise support. Many teams read Cadence for concepts and implement with the Temporal SDK.


9. Hands-on: order-processing workflow (Go style)

Below is a simplified create order → reserve inventory → request payment → wait for payment signal → release inventory on failure. Production needs transaction boundaries, fraud checks, audit logs, and PSP-specific APIs.

9.1 Signals, queries, and state

// 개념 예시: 실제 프로젝트의 패키지 경로·클라이언트 생성은 환경에 맞게 조정합니다.
const (
    SignalPaymentResult = "payment_result"
    QueryOrderState     = "order_state"
)

type PaymentResultPayload struct {
    Success bool
    TxID    string
}

type OrderState string

const (
    StatePending    OrderState = "pending"
    StateReserved   OrderState = "inventory_reserved"
    StatePaid       OrderState = "paid"
    StateCancelled  OrderState = "cancelled"
    StateFailed     OrderState = "failed"
)

Exposing OrderState from the workflow lets query handlers pass it straight to support UIs.

9.2 Workflow body

import (
    "fmt"
    "time"

    "go.uber.org/cadence"
    "go.uber.org/cadence/workflow"
)

func OrderWorkflow(ctx workflow.Context, orderID string, amountCents int64) error {
    logger := workflow.GetLogger(ctx)
    state := StatePending

    workflow.SetQueryHandler(ctx, QueryOrderState, func() (OrderState, error) {
        return state, nil
    })

    paymentCh := workflow.GetSignalChannel(ctx, SignalPaymentResult)
    var pay PaymentResultPayload
    timerFired := false

    ao := workflow.ActivityOptions{
        StartToCloseTimeout: time.Minute,
        RetryPolicy: &cadence.RetryPolicy{
            InitialInterval:    time.Second,
            BackoffCoefficient: 2,
            MaximumInterval:    30 * time.Second,
            MaximumAttempts:    5,
        },
    }
    actx := workflow.WithActivityOptions(ctx, ao)

    if err := workflow.ExecuteActivity(actx, ReserveInventoryActivity, orderID).Get(actx, nil); err != nil {
        logger.Error("reserve failed", "order", orderID, "err", err)
        state = StateFailed
        return err
    }
    state = StateReserved

    if err := workflow.ExecuteActivity(actx, RequestPaymentActivity, orderID, amountCents).Get(actx, nil); err != nil {
        _ = workflow.ExecuteActivity(actx, ReleaseInventoryActivity, orderID).Get(actx, nil)
        state = StateCancelled
        return err
    }

    sel := workflow.NewSelector(ctx)
    sel.AddReceive(paymentCh, func(c workflow.Channel, more bool) {
        c.Receive(ctx, &pay)
    })
    tf := workflow.NewTimer(ctx, 30*time.Minute)
    sel.AddFuture(tf, func(f workflow.Future) {
        timerFired = true
    })
    sel.Select(ctx)

    if timerFired {
        _ = workflow.ExecuteActivity(actx, ReleaseInventoryActivity, orderID).Get(actx, nil)
        state = StateCancelled
        return fmt.Errorf("payment timeout")
    }
    if !pay.Success {
        _ = workflow.ExecuteActivity(actx, ReleaseInventoryActivity, orderID).Get(actx, nil)
        state = StateCancelled
        return fmt.Errorf("payment declined")
    }

    // 4) 결제 확정(매입 등)
    if err := workflow.ExecuteActivity(actx, CapturePaymentActivity, orderID, pay.TxID).Get(actx, nil); err != nil {
        state = StateFailed
        return err
    }
    state = StatePaid
    return nil
}

Using a Selector to wait on both signal and timer lets you encode policies like “if payment is not confirmed within 30 minutes, cancel.” On the timeout branch, release-inventory activity runs as a compensating transaction.

9.3 Activities (conceptual)

func ReserveInventoryActivity(ctx context.Context, orderID string) error {
    // 재고 서비스 HTTP/gRPC — 멱등 키: orderID
    return nil
}

func RequestPaymentActivity(ctx context.Context, orderID string, amountCents int64) error {
    // PG에 결제 세션 생성 — 클라이언트는 웹훅에서 Workflow ID로 시그널 전송
    return nil
}

func ReleaseInventoryActivity(ctx context.Context, orderID string) error {
    return nil
}

func CapturePaymentActivity(ctx context.Context, orderID, txID string) error {
    return nil
}

The webhook handler maps the order ID to a workflow ID and calls SignalWorkflow on the Cadence client. Webhooks need signature verification and handling for duplicate delivery; consider deduplication by event ID so duplicate signals still leave state consistent.

9.4 Workers and task lists

// Worker 시작 시: OrderWorkflow를 등록하고 Tasklist "orders"를 구독하는 식으로 구성합니다.
// workflow.RegisterOptions{Name: "OrderWorkflow"} 와 함께 바인딩하는 패턴이 일반적입니다.

In operations, splitting activity-specific workers and task lists for payment, inventory, and notifications reduces one failing segment blocking all polling.


10. Best practices and pitfalls

  • Keep workflows thin: Modularize long procedures with child workflows; separate per-step timeouts and compensations.
  • Idempotency first: Do not assume an activity runs only once. Fix order, payment, and inventory keys to external contracts.
  • Signal ordering: Test concurrent signals against business rules.
  • Secrets: Do not pass raw card numbers into workflow arguments—only tokens or references.
  • Observability: Structure logs and metrics with workflow ID, run ID, and activity type.

11. Summary

Cadence combines event history with deterministic workflow code so business procedures keep running in distributed environments. Use domains for isolation boundaries, task lists to split workers and load, workflows and activities to separate orchestration from side effects, and signals and queries to connect the outside world and ops tools. Retry policies must be read together with external API behavior and idempotent design. Understand the relationship with Temporal, then pick the stack that fits your team’s SDK and operational constraints. Treat this article’s order example as a starting point and extend it with real PSP, inventory, and shipping integrations and incident runbooks aligned to your organization.


Before deploy, run git add, git commit, and git push, then npm run deploy.