Deterministic results in Agentic Workflows

I had an agent fail an eval at 2:14 PM on a Wednesday.

The same agent had passed the same eval, with the same prompt and the same toolset, twenty-three hours earlier. The diff between those two runs was zero lines of agent code. The model version hadn't changed. My tests hadn't changed.

What changed was the third-party API the agent called. A field that had been a string became a nullable string. The agent sawnullwhere it expected text, did something different downstream, and failed the assertion at the end of the chain.

I spent an hour debugging my agent before I figured out the agent was fine.

The non-determinism stack

Agents have three layers of variance, and only one of them is your code:

↳ The model. Same prompt, different sampling, different output.
↳ The agent loop. Same outputs from the model, different tool-call ordering on retries, different memory state.
↳ The world. The APIs your agent calls return different data today than they did yesterday. Sometimes that's a bug fix. Sometimes it's an idempotency-key edge case. Sometimes it's a rate limit. Sometimes it's nothing you can reproduce.

Most of the debugging energy I've seen spent on agent reliability targets layer one. Make the model output more deterministic. Lower the temperature. Pin the seed. Switch to a smaller model. Switch to a bigger one. Add a self-consistency check.

Some of it targets layer two. Constrain the agent loop with stricter tool-call schemas. Add planner-executor separation. Add a critic step.

Almost nobody attacks layer three head-on. The world keeps moving and your agent silently inherits the movement.

A non-deterministic system cannot reliably monitor another non-deterministic system.
— Phil Venables, ex-Goldman / Google CISO

Venables said this in the context of AI governance, but it applies one layer down too. A non-deterministic upstream cannot reliably support a deterministic test. If you want to assert anything about your agent's behavior over time, you need to freeze something.

You can't freeze the model — that's the whole point of using one. You can't fully freeze the agent loop either; some non-determinism is inherent to the planning step.

So freeze the world.

What freezing the world actually looks like

Record the real upstream once. Capture the request, the response, the timing, the headers. Then on every subsequent test run, serve those bytes back. Byte-equivalent. Same input, same output. Every run.

That's what Gostly does. It's an HTTP proxy that records the traffic between your agent and its external dependencies, and replays that traffic back at request time instead of letting the agent call the live API.

The proxy isn't in the model loop. It isn't in the agent loop. It sits between the agent process and the network, in the same place your operating system's TCP stack already sits, and it never asks an LLM what to return.

The cascade

When the agent makes a request, the proxy walks a cascade:

request
  → exact match against recorded traffic                (returns bytes)
  → structural match (same shape, different ids)        (returns bytes)
  → recording-derived fallback                          (returns bytes)
  → (only if no recording exists) LLM gap-fill, cached  (rare)

The first three steps are deterministic. They return bytes that were recorded once and approved. The fourth step is the last-resort fallback for cases where the agent calls a route you never recorded against — and even there, the result is cached on first generation, so the second call to that route returns the same bytes too.

Cache hits absorb roughly nine out of every ten misses in the workloads I've measured. Most of the time, when your agent runs, no LLM is involved in supplying the responses your agent sees.

That's the property worth selling. Not "faster mocks." Not "cheaper tests." Reproducibility you can put in a model-risk memo.

What this enables

↳ The same agent input produces the same output across CI runs, across weeks, across model versions — until you deliberately re-record.
↳ Test failures point at your agent, not at someone else's sandbox.
↳ CI doesn't pay per-token to OpenAI to find out the same thing it found out yesterday.
↳ Agent eval becomes a math problem (did the output match?) instead of a flake-management problem (which retries do we count?).
↳ Onboarding a new engineer takes minutes, not hours of credential setup.

Three patterns we keep coming back to

Building this surfaced three architectural disciplines we didn't name at first and now use everywhere.

The three-persona pre-mortem. Before shipping anything that touches authenticated upstream traffic, we review it from three independent perspectives — a platform engineer, an IAM person, and a principal developer. One review finds three or four problems. Three independent reviews surface roughly ten or eleven failure modes, and they're mostly different ones. The first time we ran this, our proposed design had eleven issues we hadn't seen on our own.

In-memory credentials still count. A short-lived bearer token held in process memory is still a credential. If your test infrastructure passes it through a log, a stack trace, a debug response body, or a captured HTTP recording, it's leaked. We strip sixteen header classes structurally at capture time, before anything reaches disk. Not configurable, not skippable.

Passive observation dominates active probing for regulated traffic. When your test stack needs fresh data, the cheap move is to re-call the upstream API. The right move is to passively re-record from observed traffic. Active probing doubles your request volume, doubles your rate-limit exposure, and burns credentials twice. Passive observation costs nothing extra because the traffic was going to happen anyway.

What this isn't

Gostly isn't the LLM. It's the deterministic substrate the LLM runs against.

If you need a generative step — because no recording exists and you're testing the agent's ability to handle a shape it hasn't seen — Gostly will use a model for that. The model is gated to Pro and above and only runs when the cascade has nothing else to return. It will never be on your test's critical path. The cache absorbs the second call, and the third, and the hundredth.

A determinism guarantee also isn't a correctness guarantee. The agent could still be wrong. The point is that when the agent is wrong, it's wrong in a way you can reproduce. Reproducibility is the prerequisite to eval, not the substitute for it.

The closing line that lands

Same input. Same output. Until that's true, every other claim about your agent is just a number that happened to come out on Tuesday.

Try it. Free, self-hosted, no license key.

Get started free →Read the architecture

← back to blog