Guides

AI Gap-Fill & Adapter Training

Recordings cover the requests your test suite actually made. Real traffic rarely covers everything — a new query parameter, an unrecorded resource id, an endpoint your CI never hit. Gostly fills those gaps with a generative model that runs at the edge of the match cascade, on infrastructure you control. The LLM is never on the request hot path, no recorded traffic leaves the box, and per-service adapters train only on PII-scrubbed rows. This page explains how.

Where AI sits in the cascade

In MOCK mode every inbound request walks a tiered match cascade. Each tier is cheaper and more deterministic than the next; AI generation is the last resort, reached only when nothing cheaper matches:

1. Session verbatim

If the request was seen during the active LEARN session, the in-memory capture replays it byte-for-byte. RAM-only; never leaves the box; resets on restart or a new LEARN window.

2. Exact match

Method + URI + request-body hash matches a recorded entry exactly. O(1) hash lookup.

3. Resource store

Links a POST-created resource to a later GET-by-id, so POST /charges then GET /charges/{id} returns the created resource instead of a 404.

4. Smart swap

Path parameters are normalised to templates (/users/{id}) and matched structurally — a recording of /users/42 serves a request to /users/99.

5. AI inference (at the edge)

Only when every tier above misses. The inference engine generates a response consistent with what nearby captures returned. Pro/Team only.

Architectural invariant — no LLM in the hot path

The match cascade is the only thing on the request hot path, and it is fully deterministic. When a request reaches the AI tier, the agent does an O(1) cache lookup only. On a cache miss it enqueues a background pre-warm job and returns immediately (falling through to the existing miss behaviour — typically smart-swap or a structured unmatched status). A single background worker behind a bounded queueowns the actual generation call; the next request for the same key gets the cached result. Generation latency never lands in a request's own 99th percentile. For Free tier the cascade stops at smart swap, and for regulated or latency-sensitive deployments the inference engine can be disabled entirely so the stack runs fully deterministically.

Self-hosted inference — no third-party LLM

Training and inference run on infrastructure you control. The data involved is recorded request/response traffic — it may contain auth tokens, business-sensitive payloads, and confidential API schemas — so data sovereignty is a hard constraint, not a preference. Gostly does not send recorded traffic, request bodies, or response schemas to a hosted third-party LLM API. The inference container ships inside your Docker stack; no request bodies or response contents leave the box.

The default ship configuration routes generation through a sibling ghost-llamacpp sidecar container. That sidecar is attached to an internal-only Docker network with no outbound internet route, which you can confirm directly:

docker network inspect <project>_llama-isolated
# look for:  "Internal": true

OSS proxy vs. licensed product

AI gap-fill is a capability of the licensed Gostly product, which ships as docker-compose + private registry images. The separate open-source proxy (distributed via Homebrew / a public container registry) is the deterministic replay core and does not include the inference engine. This page describes the licensed product.

ENABLE_GENERATION & the llamacpp sidecar

The inference server (port 5000) has two generation paths. Which one runs is determined by environment variables on the inference container — there is no third path to a hosted API.

LLAMACPP_ENDPOINT

When set, /generate routes through the sibling ghost-llamacpp HTTP server. The default configuration sets it to http://ghost-llamacpp:8080. The sidecar serves the quantized base model (and any per-service LoRA adapters loaded via --lora), so the inference container itself does not load model weights.

ENABLE_GENERATION

Off by default. Flipping it on additionally loads the generative base model in-process (PyTorch) — useful as a fallback when no sidecar endpoint is configured, but it costs the host >8 GB RAM for weights the sidecar already serves. Most deployments leave it off and rely on the sidecar.

ENABLE_RAG

On by default. Loads a small (~80 MB, CPU-friendly) sentence-embedding model and builds a per-service semantic index over your scrubbed mock library. The /generate path degrades gracefully if the embedding model fails to load. This is the recommended grounding layer.

In the default ship config the two flags work together: the sidecar handles generation, and the retrieval index grounds it. When a recorded example is close enough to the incoming request, retrieval short-circuits the model entirely and the recorded response is replayed; only genuinely novel requests reach the generator.

# Default ship config: route generation through the isolated sidecar
docker compose up ghost-inference ghost-llamacpp

# Optional in-process PyTorch fallback (>8 GB RAM) — usually left off
ENABLE_GENERATION=true   # loads the generative base model in-process
ENABLE_RAG=true          # retrieval grounding (on by default)

Resource budget

The inference container with retrieval on is comfortable in a few GB of RAM. Flipping ENABLE_GENERATION=trueloads model weights in-process and adds several GB — only do this when you intentionally want the in-process fallback. If you don't need any AI gap-fill, set both flags off to run the deterministic stack alone.

Per-service LoRA adapters

The base model produces generic-but-realistic responses. For a service with enough recorded traffic, you can train a small per-service LoRA adapter— a lightweight set of weights that specialise the base model to one API's shapes, field names, and conventions, without retraining the whole model.

The most important property: adapters train only on PII-scrubbed rows. The training stream is the same scrubbed mock library the deterministic tiers serve from — credentials are stripped at the redaction floor before anything touches disk, and body PII is scrubbed on the path into the durable store and any exports. Training never sees a raw secret.

Training runs as a background job on your own infrastructure, then the trained adapter is converted into the format the sidecar can load and served from there:

# Start a background training session for a service
POST /finetune                       # spawns a background training job
GET  /finetune/{session_id}/status   # poll progress

# Once ready, register the adapter as active for the service
POST /activate/{session_id}

On completion the PEFT adapter is converted to the GGUF format the ghost-llamacpp sidecar consumes via its --lora flag. If the sidecar reports that the converted adapter is on disk but not yet loaded, the operator restarts the sidecar so its entrypoint re-scans the adapter directory and picks it up:

docker compose restart ghost-llamacpp

Once loaded, /generateselects the right adapter per-request, and results are populated into the agent's response cache so repeat requests for the same shape never re-invoke the model. Adapters are served from cache, self-hosted, from your own deployment.

Retrieval first, then fine-tune

For a handful to a few dozen examples per endpoint, retrieval grounding (ENABLE_RAG) generally outperforms a freshly trained adapter and is "ready" instantly — there is no training step. Reach for a per-service adapter once a service has accumulated enough recorded traffic to actually shift the model's behaviour. Start with retrieval; add an adapter when the volume justifies it.

Knowing when a response was generated

A generated response is not a recorded one, and Gostly never pretends otherwise. The /generate response envelope carries a source field distinguishing a replayed retrieval hit from base-model or adapter generation, a confidence score, and a degraded boolean. When degradedis true, the model's own output failed schema validation and was backfilled or substituted with a schema-derived template — so a CI pipeline or dashboard can branch on a single signal rather than learning the full repair taxonomy.

Treat degraded responses as a prompt to record more

A degradedresponse usually means the endpoint is under-covered. The envelope includes a human-readable hint — typically to record more traffic for that endpoint or train a service-specific adapter. Don't accept silent template responses as if they were real generation output.

Cold-start seeding

You don't have to record live traffic to give the gap-fill engine something to work with. Existing artifacts can seed the mock library directly — drag-drop a HAR capture, a Postman collection, or an OpenAPI spec into the dashboard, which posts to the control-plane seeding endpoints:

POST /v1/seed/har
POST /v1/seed/postman
POST /v1/seed/openapi

Seeded entries flow into the same scrubbed mock library the deterministic tiers and the retrieval index read from — so seeding bootstraps both exact matching and the grounding the AI tier relies on.

Observability

The agent exposes Prometheus metrics at /metrics. The one to watch for gap-fill is the per-outcome match counter — it shows how often each tier of the cascade fired, so you can see what fraction of traffic ever reached AI generation versus being served deterministically:

ghost_requests_total{match_type}

Request counter, one increment per match-path outcome — exact, smart_swap, session_verbatim, resource_store, generated_cached, miss, learn, transitioning, passthrough, chaos, and more. The generated_cached share is your AI gap-fill rate.

ghost_mock_library_size

Mock-library size gauge — how much recorded coverage the deterministic tiers have to draw on.

ghost_io_errors_total{operation}

Disk-sink open/write failures by operation.

axum_http_requests_total / _duration

HTTP request rate and latency by endpoint, method, and status.

A healthy deployment's match_type distribution is dominated by the deterministic tiers, with the generated share shrinking over time as the mock library fills in. AI gap-fill is a safety net for the edges, not the main serving path.

Next steps