Multi-Agent Wiki

Observability and Event Model

Trace, event, and metrics design for a multi-agent platform.

Without a trace, a multi-agent platform is essentially unmaintainable. You need to record more than the final answer — every routing decision, message, tool call, handoff, state change, approval, failure, and retry.

Event model

TypeScript
export type AgentEvent = {
  id: string;
  traceId: string;
  spanId: string;
  parentSpanId?: string;
  sessionId: string;
  runId: string;
  taskId?: string;
  actor: string;
  type: AgentEventType;
  payload: unknown;
  timestamp: string;
  schemaVersion: string;
};
TypeMeaning
session.startedSession begins
workflow.node.enterEntered a workflow node
agent.message.createdAgent produced a message
agent.task.assignedTask assigned to an agent
tool.call.startedTool call began
tool.call.completedTool call finished
handoff.requestedHandoff initiated
handoff.acceptedHandoff accepted
blackboard.item.createdShared state written
approval.requestedApproval requested
approval.grantedApproval granted
verifier.issue.foundVerifier raised an issue
loop.round.completedRefinement loop iteration finished
budget.exceededBudget exhausted
session.completedSession ended

Metrics

MetricMeaning
Task success rateShare of tasks that succeed
Handoff loop rateShare of sessions with handoff loops
Verifier rejection rateHow often the verifier rejects
Average agent depthAverage call depth
Tool failure rateTool errors per call
Cost per successful taskCost amortized over wins
Human approval latencyApproval queue delay
Context compression ratioCompression effectiveness

Trace UI suggestion

Render each session as a tree:

Text
Session
├─ Planner
│  └─ plan.created
├─ Search Agent
│  ├─ tool.web_search
│  └─ result.summary
├─ Code Agent
│  ├─ tool.read_file
│  ├─ tool.edit_file
│  └─ patch.created
├─ Test Agent
│  └─ test.failed
├─ Code Agent retry
└─ Reviewer
   └─ approved

Workflow run observability

A Dynamic Workflow run produces a richer event tree than a single agent. Events are needed at the workflow, phase, agent, claim, and checkpoint layers so that any node in the tree can be reconstructed after the run.

Event typeWhen emittedKey fields
workflow.createdClaude wrote the scriptworkflowId, script, phases, estimatedTokens
workflow.approvedUser confirmed before runworkflowId, approvedAt, approvedBy
workflow.phase.startedScript entered a named phaseworkflowId, phase, inputCount
workflow.phase.completedPhase exited successfullyworkflowId, phase, outputCount, tokens
agent.spawnedScript called agent() / parallel()workflowId, phase, agentTask, parentSpanId
agent.completedSubagent returnedagentId, tokens, durationMs, resultSchema
claim.verifiedVerifier accepted or rejected a findingfindingId, verifierAgentId, verdict, evidence
workflow.checkpoint.savedRuntime persisted stateworkflowId, phase, stateBytes
workflow.checkpoint.resumedScript restarted from checkpointworkflowId, phase, resumedAt
workflow.finalizedFinal synthesis returned to user sessionworkflowId, totalTokens, totalAgents, durationMs

The trace tree for a workflow run typically nests as:

SCSS
workflow.created
└─ workflow.approved
   └─ workflow.phase.started (discover)
      ├─ agent.spawned (map-codebase)
      └─ agent.completed
   └─ workflow.phase.started (audit)
      ├─ agent.spawned × N (audit-file, parallel barrier)
      └─ agent.completed × N
   └─ workflow.checkpoint.saved
   └─ workflow.phase.started (verify)
      ├─ agent.spawned × N (adversarial-review)
      ├─ claim.verified × N
   └─ workflow.finalized

Every workflow event should carry the workflowId, and every nested agent.* event should carry the originating phase so that traces can be sliced by phase or by agent.

Minimum viable pipeline

  1. Write every event to append-only JSONL first.
  2. Mirror key fields into Postgres / ClickHouse.
  3. Use traceId / spanId for tree rendering.
  4. Redact sensitive fields from messages and tool calls.
  5. Eventually feed OpenTelemetry or your own observability platform.