Skip to content

Production Governance Playbook

Languages: English · 中文

Agent demos often judge the final answer. Production asks a different set of questions: did the latest change improve the system, can failures be located by layer, is cost controlled, do high-risk actions require approval, and can product/operations teams receive the service?

Every added intelligent capability needs added evidence.

Four Evidence Types

EvidenceSolvesAgently entrypoint
Output contractWhether business systems can consume resultsOutput Control, Model Response
Runtime traceWhat happened across request, Action, workflow, environmentEvent Center, DevTools
State evidenceWhether artifacts, decisions, checkpoints, context are recoverableWorkspace, Long-Running State
Scenario regressionWhether prompt/model/tool/flow changes actually improve behaviorDevTools EvaluationBridge, app tests, model judge

Together they move the system from "this answer looks good" to "this capability can be changed safely".

Eval: Fix Representative Cases First

Minimal eval does not need to cover every online input. Start with cases that represent business risk:

Case typeCheck
Normal inputComplete fields, correct flow, consumable output
Boundary inputMissing fields, vague intent, long text, duplicate submit
High-risk inputRefusal, approval, human handoff, least privilege
Regression inputHistorical online failures, customer feedback, version change points

Use the right judge for the risk:

  • Fields, enums, required values: structured output and deterministic assertions.
  • Business rules: explicit checks such as amount thresholds, reversibility, approval requirements.
  • Semantic quality: a second Agently model judge returning structured decisions.
  • Human acceptance: retain input, output, judge result, and human conclusion.

Trace: Locate the Failing Layer

A production issue should be locatable to at least one layer:

text
Gateway
  task_id / session_id / trace_id / tenant

Request
  model request / output validation / retry

Action
  selected action / args / result / error / approval

ExecutionEnvironment
  environment declared / approval / ready / failed / released

TriggerFlow
  execution_id / chunk / pause / resume / close snapshot

Workspace
  artifact refs / decisions / checkpoints / ContextPack

Event Center receives framework-level RuntimeEvent. Production logs, metrics, and audit hooks should correlate by run metadata instead of parsing text messages.

Do Not Mix Runtime Stream and Observation Events

NeedUse
Frontend shows "reviewing risk item 3"TriggerFlow runtime stream
Diagnose whether model request or Action happenedEvent Center / RuntimeEvent
Local full-run inspectionDevTools ObservationBridge
Release candidate scenario runsDevTools EvaluationBridge or app tests

UI stream items should be stable business events such as report_section_ready, approval_required, or risk_item_ready.

Cost and Reliability Need Owners

ProblemOwnerDesign action
High-frequency request costGateway / runtime settingsModel tier, budget, rate limit, fallback
First token or stream stallsmodel requester / result facadeTimeout, stream idle, materialization timeout
Action slow or failingAction Runtime / adapterTimeout, error structure, retry, fallback
Non-idempotent action repeatsBusiness adapterIdempotency key, external write record, duplicate protection
Long workflow interruptionTriggerFlow / Workspaceexecution save, checkpoint, resume, artifact refs
Prompt or flow change quality unknownEval / DevToolsFixed cases, baseline, pre-release regression

These controls are not one global switch. They live in gateway, requester, adapters, workflow, and business systems.

Safety Starts with Capability Visibility

LayerControl
Visible capabilitiesWhich Actions, MCP tools, Skills, resources the model sees in this run
Execution permissionRead-only vs write vs approval vs fail-closed actions
Data boundaryRedaction of inputs, tool results, logs, traces, reports
Audit and recoveryHigh-risk actions, approvals, resume, external writes are traceable

MCP standardizes capability exposure. Host / Action Runtime / ExecutionEnvironment / business adapters own enterprise governance.

Minimal Production Topology

text
api-gateway
  auth / tenant / route / rate limit / SSE or WebSocket

agent-service
  Agent definition / request contracts / result projection

workflow-worker
  TriggerFlow / Dynamic Task / pause-resume / stream bridge

capability-adapters
  Actions / MCP clients / internal APIs / sandbox environments

state-store
  Session store / Workspace / checkpoint / business DB refs

observer-eval
  RuntimeEvent hooks / DevTools / eval cases / release evidence

Early versions can run in one process, but code responsibilities should follow this topology.

Pre-Launch Check

CheckPassing standard
OutputBusiness writes come from structured data or snapshot projection
StreamingUI stream is stable business events; final state is persisted separately
ToolsActions/MCP have schema, permission, timeout, error semantics
EnvironmentSandbox, browser, DB, MCP server lifecycle has an owner
StateLarge artifacts go to Workspace; execution state keeps compact data/refs
Recoverypause/resume, checkpoint, live resource reinjection are clear
ObservabilityKey layers have RuntimeEvent, trace id, or business logs
EvaluationRepresentative cases can run repeatedly with acceptance standards
SafetyHigh-risk actions have approval, audit, and fail-closed paths
UpdatesRelease notes, examples, website docs, and Agently-Skills guidance update together

Continue Reading