ADR 0004 — Observability, telemetry & evals

English | 中文

Status: Proposed
Date: 2026-06-05
Relates to: ADR 0001 · improvement-proposal.md §6.3

Context

Alfred's whole thesis is provable reliability, but nothing is instrumented: the CostTracker (src/cost/tracker.ts) is never consulted (review), and events are console.log/chalk strings (src/repl.ts). There is no span model, no trajectory export, and no eval harness — so the reliability claim is currently unprovable.

The field standard is the OpenTelemetry GenAI semantic conventions: gen_ai spans for model calls, agent invocations, workflow spans, and execute_tool {gen_ai.tool.name}, with token/cost/session attributes. Any backend (Datadog, Honeycomb, Langfuse, LangSmith) renders these without bespoke code.

Decision

OTel GenAI spans — src/telemetry/otel.ts: wrap each provider.chat, tool call, and orchestrator agent/workflow in a gen_ai.* span; export via OTLP (opt-in env).
The run ledger IS the span tree — emit the ADR 0001/§5.3 HMAC-signed ledger as OTel spans, so the receipt and the observability trace are one artifact (ties trace-vault).
Eval harness — src/eval/: replay recorded sessions and assert tool-call / verify-exit regressions.

Consequences

Positive: makes "provable reliability" literally exportable and standard; one artifact serves both audit (HMAC) and observability (OTel); enables regression evals on the agent itself.
Negative/cost: OTel SDK dependency; spans must avoid leaking secrets (coordinate with ADR 0003 redaction); eval harness needs a corpus of recorded sessions.
Phasing: OTel spans + ledger-as-spans P2 (M); eval harness P3.

Alternatives considered

Bespoke JSON logs only. Rejected: re-invents a worse, non-portable subset of OTel; no free backend support.
A hosted tracing SaaS as the default. Rejected: violates local-first; OTLP export is opt-in and points wherever the user chooses (including a local collector).

References

See improvement-proposal.md §11 — [O1] OTel GenAI agent spans, [O2] Datadog OTel GenAI support.

ADR 0004 — Observability, telemetry & evals ​

Context ​

Decision ​

Consequences ​

Alternatives considered ​