Long-Running Agent State and Recovery Playbook

Languages: English · 中文

Long-running agent tasks fail when all state is treated as "context". Conversation history, workflow phase, report artifacts, approval waits, live clients, and eval evidence have different owners.

Five Owners First

Owner	Owns	Does not own
Gateway / API	`user_id`, `tenant_id`, `session_id`, `task_id`, auth, channel entry	Business execution steps
Session	Current chat history, context window, optional memo	Task artifacts, logs, checkpoints
TriggerFlow execution	Workflow phases, branches, waits, compact state, runtime stream	Large files, live objects, long-term evidence
Workspace	Observations, artifacts, decisions, links, checkpoints, ContextPack source	Automatically deciding what to remember
Observability / Eval	RuntimeEvent, trace, eval cases, replay, quality evidence	Business control flow

What Goes Where

Information	Place	Reason
Current conversation	Session	The next request needs recent chat context
Current phase, status, refs	TriggerFlow state	Close snapshot stays compact and recoverable
Reports, downloads, evidence, long logs	Workspace artifact / record	Large content needs cross-turn lookup and review
Business decisions and evidence links	Workspace link	Later readers can see which evidence supports which decision
Human approval wait	`pause_for(...)` interrupt	The waiting point is saveable, visible, and resumable
Database client, browser, socket, callback	runtime resources	Live objects cannot be serialized and must be reinjected
Representative inputs and acceptance rules	Eval case / app test	Release decisions need repeatable evidence

Recommended Shape

text

Gateway assigns task_id / session_id / trace_id
  -> AgentExecution or TriggerFlow execution starts
  -> compact state stores phase and artifact refs
  -> Workspace stores artifacts, evidence, decisions, checkpoints
  -> runtime stream exposes product-facing progress
  -> RuntimeEvent / DevTools records diagnostic facts

Minimal Code Patterns

Give the Task an Owner at the Entry

python

execution = flow.create_execution(
    meta={
        "task_id": task_id,
        "tenant_id": tenant_id,
        "trace_id": trace_id,
    }
)

Store Large Artifacts in Workspace

python

artifact = await workspace.async_save_artifact(
    name="report-draft.md",
    content=report_markdown,
    media_type="text/markdown",
)
await data.async_set_state("report_artifact_ref", artifact.ref)

State keeps a stable ref. Workspace keeps the large content.

Write Checkpoints at Stable Boundaries

python

await workspace.async_record_checkpoint(
    name="source_selection_done",
    summary="Selected 12 source articles for the daily report.",
    refs=[artifact.ref],
)

Recall by Goal

The next run should not dump the whole Workspace into prompt context. It should ask Recall / ContextPack for the records relevant to the current goal and budget.

Use Pause/Resume for External Waits

python

interrupt = await data.async_pause_for(
    type="approval",
    resume_to="after_approval",
    payload={"artifact_ref": artifact.ref},
)

The product layer can show the interrupt and later resume with a payload.

Checkpoint / Resume Acceptance Questions

Question	Passing standard
Which steps can rerun?	Rerunnable steps have no irreversible side effects or use idempotency keys
Which steps cannot rerun?	External writes, billing, notifications, and approvals are recorded and protected
What live dependencies are needed after restore?	`runtime_resources` keys are explicit and the restore side can reinject them
Where does execution continue?	Pending interrupt, checkpoint state, or business phase can locate the next step
Can artifacts be trusted?	Each artifact has source, summary, scope, and links when needed
How is failure debugged?	RuntimeEvent, execution snapshot, and Workspace evidence can be correlated

Avoid

Putting large reports or raw logs into TriggerFlow state.
Treating Session as durable task memory.
Keeping only the final answer and discarding evidence.
Pausing a workflow without a task id, interrupt id, and resume payload.
Storing live clients in serialized state.

Long-Running Agent State and Recovery Playbook ​

Five Owners First ​

What Goes Where ​

Recommended Shape ​

Minimal Code Patterns ​

Give the Task an Owner at the Entry ​

Store Large Artifacts in Workspace ​

Write Checkpoints at Stable Boundaries ​

Recall by Goal ​

Use Pause/Resume for External Waits ​

Checkpoint / Resume Acceptance Questions ​

Avoid ​

Continue Reading ​

Long-Running Agent State and Recovery Playbook

Five Owners First

What Goes Where

Recommended Shape

Minimal Code Patterns

Give the Task an Owner at the Entry

Store Large Artifacts in Workspace

Write Checkpoints at Stable Boundaries

Recall by Goal

Use Pause/Resume for External Waits

Checkpoint / Resume Acceptance Questions

Avoid

Continue Reading