A reliability gate for tool-using agents

Determinism ≠ Faithfulness

Record one run, replay it offline, and fail CI when a tool-using agent stops behaving the same way — or stops actually doing what it says. No API key. A few seconds.

CI passing Python 3.11+ coverage 93% mypy --strict ruff MIT

View on GitHub ↗ Read the tutorial ↗

See the gate run

Four committed scenarios pass — offline, no key. Add three regressions and the gate catches them on two different axes.

vault gate — exit 0

$ vault gate -b examples/baseline.json

trace-vault gate · 4 scenarios

scenario                     determ.  faithful.    traj.  verdict
-----------------------------------------------------------------
booking.write_room             1.00       1.00     1.00   PASS
research.cite_source           1.00       1.00     1.00   PASS
refund.transfer_once           1.00       1.00     1.00   PASS
refund.idempotent_retry        1.00       1.00     1.00   PASS

GATE: PASS  (4/4 scenarios within thresholds)   exit 0

vault gate --full — exit 1

$ vault gate -b examples/baseline.json --full

trace-vault gate · 7 scenarios

scenario                     determ.  faithful.    traj.  verdict
-----------------------------------------------------------------
booking.write_room             1.00       1.00     1.00   PASS
research.cite_source           1.00       1.00     1.00   PASS
refund.transfer_once           1.00       1.00     1.00   PASS
refund.idempotent_retry        1.00       1.00     1.00   PASS
report.flaky_plan              0.60*      1.00     1.00   FAIL
booking.unfaithful_write       1.00       0.00*    1.00   FAIL
payment.injection              1.00       0.00*    1.00   FAIL

GATE: FAIL  (3/7 scenario(s) below threshold)   exit 1

  x report.flaky_plan: determinism 0.60 < 1.00
  x booking.unfaithful_write: faithfulness 0.00 < 1.00
  x payment.injection: faithfulness 0.00 < 1.00

flaky_plan breaks on determinism (it can't repeat its path); the other two break on faithfulness (they reliably leave the real table empty). The * marks the axis that fell below threshold.

Run it yourself

Below is an AI agent. Its job: book a hotel room, then write the booking into a database. You decide how it behaves — trace-vault runs it for real in your browser (20 replays) and checks two independent things:

① Determinism — same path every run? ② Faithfulness — did it really write to the database?

Choose how the agent behaves:

👉 New here? Start with 🎲 Flaky — it shows the difference best.

▸ show this agent’s raw gate output

vault gate

# pick a behavior above

Why two scores, gated apart

They are genuinely different problems with different fixes. A combined number loses information.

Determinism

Does the agent take the same path every time?

Replay N times and compare the tool-call sequence. Reported as pass^k / pass@k with a seeded bootstrap CI, so sampling noise isn't mistaken for a regression.

Faithfulness

Did it actually change the data it claims to?

Tools run for real against an in-memory SQLite database and a temp folder. The check reads that state directly — never the transcript.

An agent can be perfectly reproducible and reliably wrong, or correct once on a run it can't repeat. A single combined score would hide one of them.

The two scores are uncorrelated

One flaky scenario and one unfaithful scenario, over 20 replays — each caught by a different score.

scenario	determinism	pass^5	pass@5	faithfulness
report.flaky_plan	0.60	0.051	0.996	1.00
booking.unfaithful_write	1.00	1.000	1.000	0.00

pass@5 is the chance at least one of five runs matches; pass^5 is the chance all five do. A flaky agent scores a healthy pass@5 (0.996) and a near-zero pass^5 (0.051) — and the second number is closer to what you feel in production.

How it works

Three parts kept separate — which is what makes offline replay possible.

Provider

One thin interface for what's the next step. A real LLM when recording; a FakeProvider or a recording when replaying. The offline path never imports a network client.

Cassette

One run's model outputs, saved as YAML with volatile fields (ids, timestamps, UUIDs) normalized out so it still matches on replay.

World

An in-memory SQLite database and a temp folder the tools really read and write. This is the ground truth that faithfulness checks against.