Agent runtime

Verifiable AI infrastructure for agent execution

Agents need reproducible upstream behavior to be auditable. Live APIs aren’t reproducible. LLM-generated mocks aren’t reproducible. Gostly is.

Mocks that match what your API actually does. Not what an LLM thinks it does.

The anchor argument

An agent that calls tools is a non-deterministic decision system wrapped around a set of effects on the outside world. When the tool responses themselves are non-deterministic — a live API that returns different data hour-to-hour, an LLM-synthesised mock that hallucinates a new shape per call — the agent’s behaviour cannot be replayed, regression-tested, or formally reviewed. Each run is a new universe.

The fix is structural: pin the tool layer to a recording. The agent stays non-deterministic. The tool calls become repeatable. The same trajectory you saw in staging is the same trajectory you can rerun in CI tomorrow, in front of an auditor next quarter, and in front of a regulator the quarter after that.

The readiness gap

Industry surveys of enterprise AI buyers in early 2026 found roughly 83% planning to deploy agents in production within the year and roughly 29% confident that they could deploy them securely. That gap — fifty-four points between intent and readiness — is the part most agent infrastructure does not address.

The high-visibility failures so far have a common shape: an agent makes a tool call against a live system, the live system does something the team did not anticipate, and the consequence is real — production data deleted, money moved, a customer record corrupted. The kind of failure that triggers a board-level review.

Deterministic replay is not a complete answer to that failure mode — a sufficiently determined agent can still cause harm against a real upstream — but it is the part of the answer that engineering can build today, without waiting on a research breakthrough. If the agent’s tool calls are recorded, replayable, and reviewable, the failure can be reproduced and the regression caught before the next deploy.

You cannot reliably, with high assurance for critical actions, use a solely non-deterministic system to effectively monitor another non-deterministic system.
Phil Venables, security leader

What the runtime provides

Byte-equivalent replay of your real APIs

When an agent calls a tool, the proxy returns exactly what your upstream returned the last time that request was recorded — same body, same status, same headers (minus the redacted ones). The agent's choice tree is reproducible by construction; you can re-run the same trajectory tomorrow and get the same observations.

Recorded upstream behavior, not synthesized

The mocks are not generated by an LLM. They are the actual responses your real service emitted, captured at the proxy layer. That distinction matters when an auditor asks why a particular tool call produced a particular result — the answer is "because the upstream said so," with the recorded payload available for review.

MCP server — Team-tier, shipped

A Model Context Protocol endpoint your agents talk to directly. The MCP server lists configured services, recorded mocks, and active traffic so an agent can introspect its own deterministic environment before it acts. API-key authenticated, tenant-scoped, and gated behind the Team-tier feature flag.

Redacted at capture, not at replay

Sensitive headers are stripped before the recording is ever written to disk. The 16-header redaction floor is immutable: an operator cannot accidentally roll it back. A 19-pattern PII scrubber and a 22-element sensitive-key allowlist apply to bodies before anything is persisted.

Per-tenant isolation, all the way down

Every tenant-scoped table (22 of them) has Postgres row-level security enabled with a tenant_isolation policy. The API binds the tenant GUC per request; a forgotten WHERE clause cannot leak rows across tenants. The agent runtime inherits the same boundary.

MCP server — already shipping on Team

The Model Context Protocol endpoint is live in the platform today. Agents authenticate with an API key, and every call is scoped to the caller’s tenant — the same RLS policy that gates the REST API gates the MCP surface. Tools currently exposed include list_services, list_mocks, and traffic introspection. The endpoint is gated behind the mcp feature flag on the Team tier.

# Configure your agent (Claude Desktop, Cursor, custom MCP client)
# to point at your Gostly workspace's MCP endpoint.
{
  "mcpServers": {
    "gostly": {
      "url": "https://<your-workspace>/v1/mcp",
      "headers": { "X-API-Key": "$GOSTLY_API_KEY" }
    }
  }
}

# The agent now lists services, inspects recorded mocks, and reasons over
# the deterministic library before it issues its next tool call.

Structural invariants

These properties are enforced by the type system and the database — not by reviewer attention. Each one is a guarantee that survives a careless commit, not a policy that can be forgotten.

PropertyWhat it means in source
Tenant isolation22 Postgres tables with RLS enabled (USING + WITH CHECK on tenant_id)
Header redaction16-header floor, immutable; applied before any payload is written to disk
Body scrubbing19 PII regex patterns + 22-element sensitive-key allowlist, applied at capture
Auth surfaceSAML + OIDC + 4-role RBAC + audit log (Team tier)
License durability4-hour offline grace; the runtime keeps serving from cache when the license check is unreachable
Webhook origin trustCapture is origin-authenticated; replay is SSRF-guarded
Wire-level hardeningBounded request bodies; constant-time secret compare on the API surface

What this doesn’t do

Deterministic replay protects pre-deploy testing and post-incident reproduction. It is not a runtime guard rail against an agent operating directly against a live production system — that requires a policy enforcement layer at the tool call, which is a complementary problem (see our comparisons with Microsoft AGT and AWS AgentCore for how those layers fit together).

Gostly’s runtime captures what the upstream returned. The agent’s reasoning over those returns remains non-deterministic. If the agent is asked the same question against the same recorded library, the LLM may still pick a different path — what is guaranteed is that the tool calls along whichever path it picks are repeatable.

Run your agent against a recording, not a prayer

Capture one good trajectory through your upstream. Replay it byte-equivalent every time after that. Audit it next quarter.

MCP server access requires a Team-tier workspace.