A reliability gate for tool-using agents
Record one run, replay it offline, and fail CI when a tool-using agent stops behaving the same way — or stops actually doing what it says. No API key. A few seconds.
Four committed scenarios pass — offline, no key. Add three regressions and the gate catches them on two different axes.
$ vault gate -b examples/baseline.json trace-vault gate · 4 scenarios scenario determ. faithful. traj. verdict ----------------------------------------------------------------- booking.write_room 1.00 1.00 1.00 PASS research.cite_source 1.00 1.00 1.00 PASS refund.transfer_once 1.00 1.00 1.00 PASS refund.idempotent_retry 1.00 1.00 1.00 PASS GATE: PASS (4/4 scenarios within thresholds) exit 0
$ vault gate -b examples/baseline.json --full trace-vault gate · 7 scenarios scenario determ. faithful. traj. verdict ----------------------------------------------------------------- booking.write_room 1.00 1.00 1.00 PASS research.cite_source 1.00 1.00 1.00 PASS refund.transfer_once 1.00 1.00 1.00 PASS refund.idempotent_retry 1.00 1.00 1.00 PASS report.flaky_plan 0.60* 1.00 1.00 FAIL booking.unfaithful_write 1.00 0.00* 1.00 FAIL payment.injection 1.00 0.00* 1.00 FAIL GATE: FAIL (3/7 scenario(s) below threshold) exit 1 x report.flaky_plan: determinism 0.60 < 1.00 x booking.unfaithful_write: faithfulness 0.00 < 1.00 x payment.injection: faithfulness 0.00 < 1.00
flaky_plan breaks on determinism (it can't repeat its path); the other two break on faithfulness (they reliably leave the real table empty). The * marks the axis that fell below threshold.
Below is an AI agent. Its job: book a hotel room, then write the booking into a database. You decide how it behaves — trace-vault runs it for real in your browser (20 replays) and checks two independent things:
Choose how the agent behaves:
👉 New here? Start with 🎲 Flaky — it shows the difference best.
# pick a behavior above
They are genuinely different problems with different fixes. A combined number loses information.
Does the agent take the same path every time?
Replay N times and compare the tool-call sequence. Reported as pass^k / pass@k with a seeded bootstrap CI, so sampling noise isn't mistaken for a regression.
Did it actually change the data it claims to?
Tools run for real against an in-memory SQLite database and a temp folder. The check reads that state directly — never the transcript.
An agent can be perfectly reproducible and reliably wrong, or correct once on a run it can't repeat. A single combined score would hide one of them.
One flaky scenario and one unfaithful scenario, over 20 replays — each caught by a different score.
| scenario | determinism | pass^5 | pass@5 | faithfulness |
|---|---|---|---|---|
| report.flaky_plan | 0.60 | 0.051 | 0.996 | 1.00 |
| booking.unfaithful_write | 1.00 | 1.000 | 1.000 | 0.00 |
pass@5 is the chance at least one of five runs matches; pass^5 is the chance all five do. A flaky agent scores a healthy pass@5 (0.996) and a near-zero pass^5 (0.051) — and the second number is closer to what you feel in production.
Three parts kept separate — which is what makes offline replay possible.
One thin interface for what's the next step. A real LLM when recording; a FakeProvider or a recording when replaying. The offline path never imports a network client.
One run's model outputs, saved as YAML with volatile fields (ids, timestamps, UUIDs) normalized out so it still matches on replay.
An in-memory SQLite database and a temp folder the tools really read and write. This is the ground truth that faithfulness checks against.