Skip to content

Long-Running Agent State and Recovery Playbook

Languages: English · 中文

Long-running agent tasks fail when all state is treated as "context". Conversation history, workflow phase, report artifacts, approval waits, live clients, and eval evidence have different owners.

Five Owners First

OwnerOwnsDoes not own
Gateway / APIuser_id, tenant_id, session_id, task_id, auth, channel entryBusiness execution steps
SessionCurrent chat history, context window, optional memoTask artifacts, logs, checkpoints
TriggerFlow executionWorkflow phases, branches, waits, compact state, runtime streamLarge files, live objects, long-term evidence
WorkspaceObservations, artifacts, decisions, links, checkpoints, ContextPack sourceAutomatically deciding what to remember
Observability / EvalRuntimeEvent, trace, eval cases, replay, quality evidenceBusiness control flow

What Goes Where

InformationPlaceReason
Current conversationSessionThe next request needs recent chat context
Current phase, status, refsTriggerFlow stateClose snapshot stays compact and recoverable
Reports, downloads, evidence, long logsWorkspace artifact / recordLarge content needs cross-turn lookup and review
Business decisions and evidence linksWorkspace linkLater readers can see which evidence supports which decision
Human approval waitpause_for(...) interruptThe waiting point is saveable, visible, and resumable
Database client, browser, socket, callbackruntime resourcesLive objects cannot be serialized and must be reinjected
Representative inputs and acceptance rulesEval case / app testRelease decisions need repeatable evidence
text
Gateway assigns task_id / session_id / trace_id
  -> AgentExecution or TriggerFlow execution starts
  -> compact state stores phase and artifact refs
  -> Workspace stores artifacts, evidence, decisions, checkpoints
  -> runtime stream exposes product-facing progress
  -> RuntimeEvent / DevTools records diagnostic facts

Minimal Code Patterns

Give the Task an Owner at the Entry

python
execution = flow.create_execution(
    meta={
        "task_id": task_id,
        "tenant_id": tenant_id,
        "trace_id": trace_id,
    }
)

Store Large Artifacts in Workspace

python
artifact = await workspace.async_save_artifact(
    name="report-draft.md",
    content=report_markdown,
    media_type="text/markdown",
)
await data.async_set_state("report_artifact_ref", artifact.ref)

State keeps a stable ref. Workspace keeps the large content.

Write Checkpoints at Stable Boundaries

python
await workspace.async_record_checkpoint(
    name="source_selection_done",
    summary="Selected 12 source articles for the daily report.",
    refs=[artifact.ref],
)

Recall by Goal

The next run should not dump the whole Workspace into prompt context. It should ask Recall / ContextPack for the records relevant to the current goal and budget.

Use Pause/Resume for External Waits

python
interrupt = await data.async_pause_for(
    type="approval",
    resume_to="after_approval",
    payload={"artifact_ref": artifact.ref},
)

The product layer can show the interrupt and later resume with a payload.

Checkpoint / Resume Acceptance Questions

QuestionPassing standard
Which steps can rerun?Rerunnable steps have no irreversible side effects or use idempotency keys
Which steps cannot rerun?External writes, billing, notifications, and approvals are recorded and protected
What live dependencies are needed after restore?runtime_resources keys are explicit and the restore side can reinject them
Where does execution continue?Pending interrupt, checkpoint state, or business phase can locate the next step
Can artifacts be trusted?Each artifact has source, summary, scope, and links when needed
How is failure debugged?RuntimeEvent, execution snapshot, and Workspace evidence can be correlated

Avoid

  • Putting large reports or raw logs into TriggerFlow state.
  • Treating Session as durable task memory.
  • Keeping only the final answer and discarding evidence.
  • Pausing a workflow without a task id, interrupt id, and resume payload.
  • Storing live clients in serialized state.

Continue Reading