Core Concepts

The Match Cascade

In MOCK mode every inbound request walks a fixed, deterministic cascade of match tiers — cheapest first, most capable last. The architectural invariant that makes this safe for latency-sensitive and regulated workloads: no LLM ever runs in the request hot path. Generation happens on a background worker behind a bounded queue, and responses are served from cache. This page walks the five tiers in order and shows exactly where — and why — the model stays off the wire.

The invariant: no LLM in the hot path

Lead with the contract, because everything else follows from it. When a request arrives in MOCK mode, the only work the request thread is allowed to do is deterministic lookup: hash comparisons, an in-memory store probe, a structural scan, and at most one O(1) cache read. It never blocks on an inference call. The model can be loading, warming, rate-limited, or behind a tripped circuit breaker, and the request still returns at lookup speed.

This is enforced structurally, not by convention. The blocking generation call was moved off the request path entirely: a cache miss enqueues a pre-warm job onto a bounded queue and falls straight through to the next deterministic tier. A single background task owns the inference call, populates the cache, and the nextrequest for the same key gets a cache hit. The inference call never contributes to a request's wall-clock latency.

Why this matters

For a latency-sensitive or regulated buyer, “the mock is usually fast” is not a guarantee — “an LLM is structurally incapable of being on the request path” is. The cascade below is built so that the deterministic tiers answer the request, and generation is something that happens around the hot path, never inside it.

Generative tiers are also gated by subscription tier (Pro and above) and are off by default. See the AI pipeline for the inference-server details.

The five tiers, in order

Each request descends the cascade until a tier produces a response. Earlier tiers are cheaper and more faithful; later tiers are more capable but more synthetic. The tier that answered is recorded on the response via an X-Ghost-* header and counted in the ghost_requests_total{match_type} metric.

Session-verbatim

All tiers

X-Ghost-Mock: session-verbatim

The in-memory active-session store, keyed by the current recording session, holds un-redacted clones of everything captured during the last LEARN window. A body-exact match on (method, URI, body) replays the interaction byte-for-byte — Set-Cookie, CSRF, OAuth exchanges included — so stateful flows stay coherent. This buffer is RAM-only: it never leaves the box and resets on restart or a new LEARN window.

Exact

All tiers

X-Ghost-Mock: true

An O(1) per-service hash lookup on (method, URI, request body). If a recorded entry matches exactly, it is served from the disk-loaded library. This is the fastest and most faithful path — the bytes are production-accurate, only with credential headers redacted.

Resource store

All tiers

X-Ghost-Resource: true

Catches the canonical "POST a resource, then GET it by id, get a 404" failure. If a resource with this id was captured during LEARN, a GET serves it back as a 200 — linking a POST-created resource to a later GET-by-id. The agent-side statechart engine also fires here on PATCH /collection/{id} (action body) and POST /collection/{id}/{action}, advancing bundled resource lifecycles (charge, customer, invoice, order, subscription).

Smart-swap

All tiers¹

X-Ghost-SwapMatch: true

Structural match. URI path parameters are normalised to a shape (/users/42 and /users/99 share the /users/{id} template) so a recording of one id can answer a request for another. This is the last fully deterministic tier — no model is consulted.

AI inference (at the edge of the cascade)

Pro+

X-Ghost-Generated: true

Last resort, and the only tier that involves a model — but still without an LLM on the request thread. The hot path does an O(1) cache read keyed by (method, URI, service). A cache hit serves immediately. A cache miss enqueues a background generation job and falls through to a miss; the worker fills the cache so the next request hits. Free tier stops at smart-swap.

¹ Smart-swap is granted on every licensed tier (Free included). The SMART_SWAP_ENABLED=true opt-in only applies to unlicensed boots, where agent features default off; the license capability sets the ceiling.

If no tier produces a response, the request is recorded as a miss (match_type="miss") and the unmatched request is tracked so you can record or seed coverage for it later.

How generation stays off the hot path

The AI tier is split into two halves: a synchronous half that runs on the request thread, and an asynchronous half that runs on a background worker. The split is the whole point.

On the request

The hot path performs an O(1) cache lookup keyed by (method, URI, service). A hit serves the cached response immediately, with zero network I/O. A miss builds a generation payload, enqueues it, and returns — falling through to smart-swap or a miss. The cached body, status, headers, and confidence are byte-identical to what a synchronous call would have produced.

Bounded queue

Enqueue is non-blocking. The queue has a fixed capacity (INFERENCE_PREWARM_QUEUE_CAPACITY, default 1024); when it is full, new jobs are dropped with a metric increment rather than blocking the request. Duplicate in-flight keys coalesce into a single inference call, so a thundering herd of misses for the same endpoint produces one generation, not N.

Background worker

A single background task owns the inference HTTP call. It performs generation off the hot path, records success/failure against the inference circuit breaker, and populates the in-memory cache (TTL'd, LRU-bounded). The next request for the same key gets an O(1) cache hit.

The cache is ephemeral by design

The inference cache lives in memory only — it does not persist across agent restarts, and it rebuilds organically from request traffic on a cold start. Cache keys include the service id, so two services (or two deployments) never collide. Each Gostly process is single-tenant by virtue of one license key per binary; the service id is the inner isolation boundary.

Where the model itself runs depends on configuration. Generation is off by default (ENABLE_GENERATION=false) and routes through the local ghost-llamacpp sidecar; retrieval-augmented matching (ENABLE_RAG) is on by default. When fine-tuning is used, LoRA adapters train only on PII-scrubbed rows, are served from cache, and stay self-hosted. None of this changes the invariant — the request thread never waits on a model.

What the cascade can serve — and what it can't

The cascade records and replays HTTP and HTTPS — HTTP/1.1 and HTTP/2 over TLS. Plain HTTP is served on :8080; the TLS-MITM listener binds on :8443 only when ENABLE_TLS_INTERCEPTION is set (tri-state: default off; true/lax; or strict). With interception on, GET /ca.crt serves the MITM CA for the per-OS trust-install step (it returns a 503 while interception is off).

Scope boundaries (honest about what's not replayed)

WebSocket frames are captured for observability only — they are not replayed through the cascade. Webhook capture is automatic; webhook replay is operator-triggered through the control-plane API, not auto-fired by the agent. There is no gRPC, async-messaging, or database mocking today — those are on the roadmap.

Two product behaviours are worth distinguishing from the cascade itself. Drift watches recorded responses against live traffic and emits drift events plus a 0–100 freshness score with a sparkline trend, so you know when a mock has gone stale. AI mock-repair turns drift into operator proposals you approve or reject; it is off by default (ENABLE_AI_MOCK_REPAIR) and never edits a mock without approval. Neither sits on the request hot path.

The cascade runs in MOCK mode — the other three modes

The five-tier cascade is what the proxy does in MOCK mode. The agent has four modes total:

LEARN

Transparent pass-through to the real upstream, recording every interaction. This is what populates the library the cascade later serves from.

MOCK

The cascade above. Requests are served from the library (and, on Pro+, the cache) with no upstream required.

PASSTHROUGH

Forward all traffic to the upstream with no recording and no mocking. The proxy is effectively a transparent relay.

TRANSITIONING

A brief interstitial during a LEARN→MOCK transition while recorded traffic is scrubbed and written to the store. Returns 503 with a Retry-After header so callers don't see a partial library.

You don't have to record from scratch to fill the library. Cold-start seeding accepts a HAR, Postman collection, or OpenAPI spec via the dashboard's drag-and-drop, which posts to POST /v1/seed/{har,postman,openapi}. Seeded entries feed the same cascade as recorded traffic.

# Switch to MOCK mode (the cascade is now active)
curl -X POST http://localhost:8000/v1/mode \
  -H "X-API-Key: $GHOST_API_KEY" \
  -H 'Content-Type: application/json' \
  -d '{"mode": "MOCK"}'

Observing which tier answered

Every match-path outcome is metered. The agent exposes Prometheus metrics on /metrics; the cascade's primary signal is the match_type label on ghost_requests_total:

# Per-tier request counts (one increment per match-path outcome)
ghost_requests_total{match_type="session_verbatim"}
ghost_requests_total{match_type="exact"}
ghost_requests_total{match_type="resource_store"}
ghost_requests_total{match_type="smart_swap"}
ghost_requests_total{match_type="generated_cached"}
ghost_requests_total{match_type="miss"}

# Library + I/O health
ghost_mock_library_size
ghost_io_errors_total{operation="..."}

# HTTP rate + latency, and the TLS-MITM subsystem
axum_http_requests_total
axum_http_requests_duration_seconds
gostly_tls_listener_state{state="..."}

Reading the match_type distribution tells you how your coverage is composed — a healthy library answers most traffic at the exact or smart-swap tiers, with generated_cached as the long tail. A rising miss rate is your signal to record or seed more.

OSS proxy vs. the licensed product

The deterministic tiers — session-verbatim, exact, resource store, and smart-swap — are the core matching engine and run everywhere. The generative tier is the line between products:

OSS proxy

A separate open-source product, distributed via Homebrew and a container registry, and it ships a host CLI. It runs at the Free tier, so the AI tier is unreachable — the cascade stops at smart-swap. The inference cache and pre-warm path are not part of the open-source build at all.

Licensed product

Distributed as a Docker Compose stack with prebuilt images (no host CLI). Pro and Team licenses unlock the AI tier and the statechart editor / per-tenant overrides. Team adds SSO (SAML + OIDC), four-role RBAC (viewer < member < admin < owner), and an append-only audit log in the web container's auth layer.

Single-tenant per deployment

Each licensed deployment is single-tenant — the tenant id defaults to default and one license key gates one binary. Per-tenant Postgres row-level-security policies are defined as defense-in-depth, but the default configuration is not engine-enforced multi-tenancy: isolation comes from the single-tenant deployment model, with RLS as a defined policy layer beneath it. Outbound TLS fingerprint impersonation (genuine Chrome / Firefox / Safari fingerprints) is a Pro+ capability.

Next steps

How It Works →

The LEARN → TRANSITION → MOCK pipeline and the AI inference server end to end.

Configuration Reference →

Every environment variable, feature flag, and the header-redaction floor.