Agent runtime

Verifiable AI infrastructure for agent execution

Agents need reproducible upstream behavior to be auditable. Live APIs aren’t reproducible. LLM-generated mocks aren’t reproducible. Gostly is.

Mocks that match what your API actually does. Not what an LLM thinks it does.

The anchor argument

An agent that calls tools is a non-deterministic decision system wrapped around a set of effects on the outside world. When the tool responses themselves are non-deterministic — a live API that returns different data hour-to-hour, an LLM-synthesised mock that hallucinates a new shape per call — the agent’s behaviour cannot be replayed, regression-tested, or formally reviewed. Each run is a new universe.

The fix is structural: pin the tool layer to a recording. The agent stays non-deterministic. The tool calls become repeatable. The same trajectory you saw in staging is the same trajectory you can rerun in CI tomorrow, in front of an auditor next quarter, and in front of a regulator the quarter after that.

The readiness gap

Industry surveys of enterprise AI buyers in early 2026 found roughly 83% planning to deploy agents in production within the year and roughly 29% confident that they could deploy them securely. That gap — fifty-four points between intent and readiness — is the part most agent infrastructure does not address.

The high-visibility failures so far have a common shape: an agent makes a tool call against a live system, the live system does something the team did not anticipate, and the consequence is real — production data deleted, money moved, a customer record corrupted. The kind of failure that triggers a board-level review.

Deterministic replay is not a complete answer to that failure mode — a sufficiently determined agent can still cause harm against a real upstream — but it is the part of the answer that engineering can build today, without waiting on a research breakthrough. If the agent’s tool calls are recorded, replayable, and reviewable, the failure can be reproduced and the regression caught before the next deploy.

“You cannot reliably, with high assurance for critical actions, use a solely non-deterministic system to effectively monitor another non-deterministic system.”

— Phil Venables, security leader

What the runtime provides

Byte-equivalent replay of your real APIs

When an agent calls a tool, the proxy returns exactly what your upstream returned the last time that request was recorded — same body, same status, same headers (minus the redacted ones). The agent's choice tree is reproducible by construction; you can re-run the same trajectory tomorrow and get the same observations.

Verbatim replay of stateful flows

Multi-step flows — log in, then act — need cookies and CSRF tokens to line up. Inside an active session the proxy keeps a byte-exact in-memory copy of what it just saw, headers included (Set-Cookie, CSRF), and replays it verbatim so the flow stays coherent (X-Ghost-Mock: session-verbatim). Body replay is byte-exact across restarts, served from the on-disk recording; header-verbatim replay is in-session only — that buffer lives in RAM and clears on restart or the next LEARN window.

Linked mocks for stateful CRUD

A POST that creates a resource and a later GET that reads it back are linked through a Harel statechart, so POST /charges → GET /charges/{id} returns the created resource instead of a 404. PATCH/POST transitions advance the resource's status (X-Ghost-Transition), keeping Stripe-shaped charge/subscription/order/invoice lifecycles coherent across a multi-step agent run.

TLS interception for tools that pin HTTPS

When an agent's tool hard-pins an HTTPS endpoint you can't rewrite, set ENABLE_TLS_INTERCEPTION and the proxy terminates TLS itself — a CONNECT forward proxy on :8443, reached through the standard HTTPS_PROXY env var, minting per-host leaf certs from an embedded CA served at /ca.crt. HTTP, HTTPS, and HTTP/2 are captured and replayed; plain ws:// too. wss:// interception is on the roadmap, not yet shipped.

Recorded upstream behavior, not synthesized

The mocks are not generated by an LLM. They are the actual responses your real service emitted, captured at the proxy layer. That distinction matters when an auditor asks why a particular tool call produced a particular result — the answer is "because the upstream said so," with the recorded payload available for review.

MCP server — Team-tier, shipped

A Model Context Protocol endpoint your agents talk to directly. The MCP server lists configured services, recorded mocks, and active traffic so an agent can introspect its own deterministic environment before it acts. API-key authenticated, tenant-scoped, and gated behind the Team-tier feature flag.

Redacted at capture, not at replay

Sensitive headers are stripped before the recording is ever written to disk. The 16-header redaction floor is immutable: an operator cannot accidentally roll it back. A 19-pattern PII scrubber and a 22-element sensitive-key allowlist apply to bodies before anything is persisted.

Self-hosted and single-tenant

Gostly is self-hosted and single-tenant per deployment — your captured traffic, mocks, and library live in your own database, never co-mingled with another customer's. Every query is tenant-scoped in application code, and on top of that all tenant-scoped tables carry per-tenant Postgres Row-Level Security policies as a defense-in-depth layer. The agent runtime inherits the same boundary.

MCP server — already shipping on Team

The Model Context Protocol endpoint is live in the platform today. Agents authenticate with an API key, and every call is tenant-scoped in application code — the same authoritative boundary that scopes the REST API scopes the MCP surface. Tools currently exposed include list_services, list_mocks, and traffic introspection. The endpoint is gated behind the mcp feature flag on the Team tier.

# Configure your agent (Claude Desktop, Cursor, custom MCP client)
# to point at your Gostly workspace's MCP endpoint.
{
  "mcpServers": {
    "gostly": {
      "url": "https://<your-workspace>/mcp",
      "headers": { "X-API-Key": "$GOSTLY_API_KEY" }
    }
  }
}

# The agent now lists services, inspects recorded mocks, and reasons over
# the deterministic library before it issues its next tool call.

Structural invariants

These properties are enforced by the type system and the database — not by reviewer attention. Each one is a guarantee that survives a careless commit, not a policy that can be forgotten.

Property	What it guarantees
Tenant isolation	Self-hosted, single-tenant; per-tenant RLS policies on all tenant-scoped tables as defense-in-depth
Header redaction	16-header floor, immutable; applied before any payload is written to disk
Body scrubbing	19 PII regex patterns + 22-element sensitive-key allowlist, applied at capture
Auth surface	SAML + OIDC + 4-role RBAC + audit log (Team tier)
License durability	4-hour offline grace; the runtime keeps serving from cache when the license check is unreachable
Webhook origin trust	Capture is origin-authenticated; replay is SSRF-guarded
Wire-level hardening	Bounded request bodies; constant-time secret compare on the API surface

What this doesn’t do

Deterministic replay protects pre-deploy testing and post-incident reproduction. It is not a runtime guard rail against an agent operating directly against a live production system — that requires a policy enforcement layer at the tool call, which is a complementary problem (see our comparisons with Microsoft AGT and AWS AgentCore for how those layers fit together).

Gostly’s runtime captures what the upstream returned. The agent’s reasoning over those returns remains non-deterministic. If the agent is asked the same question against the same recorded library, the LLM may still pick a different path — what is guaranteed is that the tool calls along whichever path it picks are repeatable.

Run your agent against a recording, not a prayer

Capture one good trajectory through your upstream. Replay it byte-equivalent every time after that. Audit it next quarter.

Get started free Become a design partner

MCP server access requires a Team-tier workspace.