Long-Running Agent State and Recovery Playbook
Languages: English · 中文
Long-running agent tasks fail when all state is treated as "context". Conversation history, workflow phase, report artifacts, approval waits, live clients, and eval evidence have different owners.
Five Owners First
| Owner | Owns | Does not own |
|---|---|---|
| Gateway / API | user_id, tenant_id, session_id, task_id, auth, channel entry | Business execution steps |
| Session | Current chat history, context window, optional memo | Task artifacts, logs, checkpoints |
| TriggerFlow execution | Workflow phases, branches, waits, compact state, runtime stream | Large files, live objects, long-term evidence |
| Workspace | Observations, artifacts, decisions, links, checkpoints, ContextPack source | Automatically deciding what to remember |
| Observability / Eval | RuntimeEvent, trace, eval cases, replay, quality evidence | Business control flow |
What Goes Where
| Information | Place | Reason |
|---|---|---|
| Current conversation | Session | The next request needs recent chat context |
| Current phase, status, refs | TriggerFlow state | Close snapshot stays compact and recoverable |
| Reports, downloads, evidence, long logs | Workspace artifact / record | Large content needs cross-turn lookup and review |
| Business decisions and evidence links | Workspace link | Later readers can see which evidence supports which decision |
| Human approval wait | pause_for(...) interrupt | The waiting point is saveable, visible, and resumable |
| Database client, browser, socket, callback | runtime resources | Live objects cannot be serialized and must be reinjected |
| Representative inputs and acceptance rules | Eval case / app test | Release decisions need repeatable evidence |
Recommended Shape
text
Gateway assigns task_id / session_id / trace_id
-> AgentExecution or TriggerFlow execution starts
-> compact state stores phase and artifact refs
-> Workspace stores artifacts, evidence, decisions, checkpoints
-> runtime stream exposes product-facing progress
-> RuntimeEvent / DevTools records diagnostic factsMinimal Code Patterns
Give the Task an Owner at the Entry
python
execution = flow.create_execution(
meta={
"task_id": task_id,
"tenant_id": tenant_id,
"trace_id": trace_id,
}
)Store Large Artifacts in Workspace
python
artifact = await workspace.async_save_artifact(
name="report-draft.md",
content=report_markdown,
media_type="text/markdown",
)
await data.async_set_state("report_artifact_ref", artifact.ref)State keeps a stable ref. Workspace keeps the large content.
Write Checkpoints at Stable Boundaries
python
await workspace.async_record_checkpoint(
name="source_selection_done",
summary="Selected 12 source articles for the daily report.",
refs=[artifact.ref],
)Recall by Goal
The next run should not dump the whole Workspace into prompt context. It should ask Recall / ContextPack for the records relevant to the current goal and budget.
Use Pause/Resume for External Waits
python
interrupt = await data.async_pause_for(
type="approval",
resume_to="after_approval",
payload={"artifact_ref": artifact.ref},
)The product layer can show the interrupt and later resume with a payload.
Checkpoint / Resume Acceptance Questions
| Question | Passing standard |
|---|---|
| Which steps can rerun? | Rerunnable steps have no irreversible side effects or use idempotency keys |
| Which steps cannot rerun? | External writes, billing, notifications, and approvals are recorded and protected |
| What live dependencies are needed after restore? | runtime_resources keys are explicit and the restore side can reinject them |
| Where does execution continue? | Pending interrupt, checkpoint state, or business phase can locate the next step |
| Can artifacts be trusted? | Each artifact has source, summary, scope, and links when needed |
| How is failure debugged? | RuntimeEvent, execution snapshot, and Workspace evidence can be correlated |
Avoid
- Putting large reports or raw logs into TriggerFlow state.
- Treating Session as durable task memory.
- Keeping only the final answer and discarding evidence.
- Pausing a workflow without a task id, interrupt id, and resume payload.
- Storing live clients in serialized state.