Production Governance Playbook
Languages: English · 中文
Agent demos often judge the final answer. Production asks a different set of questions: did the latest change improve the system, can failures be located by layer, is cost controlled, do high-risk actions require approval, and can product/operations teams receive the service?
Every added intelligent capability needs added evidence.
Four Evidence Types
| Evidence | Solves | Agently entrypoint |
|---|---|---|
| Output contract | Whether business systems can consume results | Output Control, Model Response |
| Runtime trace | What happened across request, Action, workflow, environment | Event Center, DevTools |
| State evidence | Whether artifacts, decisions, checkpoints, context are recoverable | Workspace, Long-Running State |
| Scenario regression | Whether prompt/model/tool/flow changes actually improve behavior | DevTools EvaluationBridge, app tests, model judge |
Together they move the system from "this answer looks good" to "this capability can be changed safely".
Eval: Fix Representative Cases First
Minimal eval does not need to cover every online input. Start with cases that represent business risk:
| Case type | Check |
|---|---|
| Normal input | Complete fields, correct flow, consumable output |
| Boundary input | Missing fields, vague intent, long text, duplicate submit |
| High-risk input | Refusal, approval, human handoff, least privilege |
| Regression input | Historical online failures, customer feedback, version change points |
Use the right judge for the risk:
- Fields, enums, required values: structured output and deterministic assertions.
- Business rules: explicit checks such as amount thresholds, reversibility, approval requirements.
- Semantic quality: a second Agently model judge returning structured decisions.
- Human acceptance: retain input, output, judge result, and human conclusion.
Trace: Locate the Failing Layer
A production issue should be locatable to at least one layer:
Gateway
task_id / session_id / trace_id / tenant
Request
model request / output validation / retry
Action
selected action / args / result / error / approval
ExecutionEnvironment
environment declared / approval / ready / failed / released
TriggerFlow
execution_id / chunk / pause / resume / close snapshot
Workspace
artifact refs / decisions / checkpoints / ContextPackEvent Center receives framework-level RuntimeEvent. Production logs, metrics, and audit hooks should correlate by run metadata instead of parsing text messages.
Do Not Mix Runtime Stream and Observation Events
| Need | Use |
|---|---|
| Frontend shows "reviewing risk item 3" | TriggerFlow runtime stream |
| Diagnose whether model request or Action happened | Event Center / RuntimeEvent |
| Local full-run inspection | DevTools ObservationBridge |
| Release candidate scenario runs | DevTools EvaluationBridge or app tests |
UI stream items should be stable business events such as report_section_ready, approval_required, or risk_item_ready.
Cost and Reliability Need Owners
| Problem | Owner | Design action |
|---|---|---|
| High-frequency request cost | Gateway / runtime settings | Model tier, budget, rate limit, fallback |
| First token or stream stalls | model requester / result facade | Timeout, stream idle, materialization timeout |
| Action slow or failing | Action Runtime / adapter | Timeout, error structure, retry, fallback |
| Non-idempotent action repeats | Business adapter | Idempotency key, external write record, duplicate protection |
| Long workflow interruption | TriggerFlow / Workspace | execution save, checkpoint, resume, artifact refs |
| Prompt or flow change quality unknown | Eval / DevTools | Fixed cases, baseline, pre-release regression |
These controls are not one global switch. They live in gateway, requester, adapters, workflow, and business systems.
Safety Starts with Capability Visibility
| Layer | Control |
|---|---|
| Visible capabilities | Which Actions, MCP tools, Skills, resources the model sees in this run |
| Execution permission | Read-only vs write vs approval vs fail-closed actions |
| Data boundary | Redaction of inputs, tool results, logs, traces, reports |
| Audit and recovery | High-risk actions, approvals, resume, external writes are traceable |
MCP standardizes capability exposure. Host / Action Runtime / ExecutionEnvironment / business adapters own enterprise governance.
Minimal Production Topology
api-gateway
auth / tenant / route / rate limit / SSE or WebSocket
agent-service
Agent definition / request contracts / result projection
workflow-worker
TriggerFlow / Dynamic Task / pause-resume / stream bridge
capability-adapters
Actions / MCP clients / internal APIs / sandbox environments
state-store
Session store / Workspace / checkpoint / business DB refs
observer-eval
RuntimeEvent hooks / DevTools / eval cases / release evidenceEarly versions can run in one process, but code responsibilities should follow this topology.
Pre-Launch Check
| Check | Passing standard |
|---|---|
| Output | Business writes come from structured data or snapshot projection |
| Streaming | UI stream is stable business events; final state is persisted separately |
| Tools | Actions/MCP have schema, permission, timeout, error semantics |
| Environment | Sandbox, browser, DB, MCP server lifecycle has an owner |
| State | Large artifacts go to Workspace; execution state keeps compact data/refs |
| Recovery | pause/resume, checkpoint, live resource reinjection are clear |
| Observability | Key layers have RuntimeEvent, trace id, or business logs |
| Evaluation | Representative cases can run repeatedly with acceptance standards |
| Safety | High-risk actions have approval, audit, and fail-closed paths |
| Updates | Release notes, examples, website docs, and Agently-Skills guidance update together |