Production Governance Playbook

Languages: English · 中文

Agent demos often judge the final answer. Production asks a different set of questions: did the latest change improve the system, can failures be located by layer, is cost controlled, do high-risk actions require approval, and can product/operations teams receive the service?

Every added intelligent capability needs added evidence.

Four Evidence Types

Evidence	Solves	Agently entrypoint
Output contract	Whether business systems can consume results	Output Control, Model Response
Runtime trace	What happened across request, Action, workflow, environment	Event Center, DevTools
State evidence	Whether artifacts, decisions, checkpoints, context are recoverable	Workspace, Long-Running State
Scenario regression	Whether prompt/model/tool/flow changes actually improve behavior	DevTools EvaluationBridge, app tests, model judge

Together they move the system from "this answer looks good" to "this capability can be changed safely".

Eval: Fix Representative Cases First

Minimal eval does not need to cover every online input. Start with cases that represent business risk:

Case type	Check
Normal input	Complete fields, correct flow, consumable output
Boundary input	Missing fields, vague intent, long text, duplicate submit
High-risk input	Refusal, approval, human handoff, least privilege
Regression input	Historical online failures, customer feedback, version change points

Use the right judge for the risk:

Fields, enums, required values: structured output and deterministic assertions.
Business rules: explicit checks such as amount thresholds, reversibility, approval requirements.
Semantic quality: a second Agently model judge returning structured decisions.
Human acceptance: retain input, output, judge result, and human conclusion.

Trace: Locate the Failing Layer

A production issue should be locatable to at least one layer:

text

Gateway
  task_id / session_id / trace_id / tenant

Request
  model request / output validation / retry

Action
  selected action / args / result / error / approval

ExecutionEnvironment
  environment declared / approval / ready / failed / released

TriggerFlow
  execution_id / chunk / pause / resume / close snapshot

Workspace
  artifact refs / decisions / checkpoints / ContextPack

Event Center receives framework-level RuntimeEvent. Production logs, metrics, and audit hooks should correlate by run metadata instead of parsing text messages.

Do Not Mix Runtime Stream and Observation Events

Need	Use
Frontend shows "reviewing risk item 3"	TriggerFlow runtime stream
Diagnose whether model request or Action happened	Event Center / RuntimeEvent
Local full-run inspection	DevTools ObservationBridge
Release candidate scenario runs	DevTools EvaluationBridge or app tests

UI stream items should be stable business events such as report_section_ready, approval_required, or risk_item_ready.

Cost and Reliability Need Owners

Problem	Owner	Design action
High-frequency request cost	Gateway / runtime settings	Model tier, budget, rate limit, fallback
First token or stream stalls	model requester / result facade	Timeout, stream idle, materialization timeout
Action slow or failing	Action Runtime / adapter	Timeout, error structure, retry, fallback
Non-idempotent action repeats	Business adapter	Idempotency key, external write record, duplicate protection
Long workflow interruption	TriggerFlow / Workspace	execution save, checkpoint, resume, artifact refs
Prompt or flow change quality unknown	Eval / DevTools	Fixed cases, baseline, pre-release regression

These controls are not one global switch. They live in gateway, requester, adapters, workflow, and business systems.

Safety Starts with Capability Visibility

Layer	Control
Visible capabilities	Which Actions, MCP tools, Skills, resources the model sees in this run
Execution permission	Read-only vs write vs approval vs fail-closed actions
Data boundary	Redaction of inputs, tool results, logs, traces, reports
Audit and recovery	High-risk actions, approvals, resume, external writes are traceable

MCP standardizes capability exposure. Host / Action Runtime / ExecutionEnvironment / business adapters own enterprise governance.

Minimal Production Topology

text

api-gateway
  auth / tenant / route / rate limit / SSE or WebSocket

agent-service
  Agent definition / request contracts / result projection

workflow-worker
  TriggerFlow / Dynamic Task / pause-resume / stream bridge

capability-adapters
  Actions / MCP clients / internal APIs / sandbox environments

state-store
  Session store / Workspace / checkpoint / business DB refs

observer-eval
  RuntimeEvent hooks / DevTools / eval cases / release evidence

Early versions can run in one process, but code responsibilities should follow this topology.

Pre-Launch Check

Check	Passing standard
Output	Business writes come from structured data or snapshot projection
Streaming	UI stream is stable business events; final state is persisted separately
Tools	Actions/MCP have schema, permission, timeout, error semantics
Environment	Sandbox, browser, DB, MCP server lifecycle has an owner
State	Large artifacts go to Workspace; execution state keeps compact data/refs
Recovery	pause/resume, checkpoint, live resource reinjection are clear
Observability	Key layers have RuntimeEvent, trace id, or business logs
Evaluation	Representative cases can run repeatedly with acceptance standards
Safety	High-risk actions have approval, audit, and fail-closed paths
Updates	Release notes, examples, website docs, and Agently-Skills guidance update together

Production Governance Playbook ​

Four Evidence Types ​

Eval: Fix Representative Cases First ​

Trace: Locate the Failing Layer ​

Do Not Mix Runtime Stream and Observation Events ​

Cost and Reliability Need Owners ​

Safety Starts with Capability Visibility ​

Minimal Production Topology ​

Pre-Launch Check ​

Continue Reading ​