Clean-room runbook

Run Claude Skills against deterministic mock services

A step-by-step runbook for wiring Claude Skills against Gostly-served mocks instead of live upstream APIs. Reproducible by construction. Sealed by self-hosted design. Audited by default.

Why this runbook exists

Skills give Claude the ability to do real things — call your APIs, modify your data, move money, send messages. That power is the point. It is also the failure mode: a skill that calls a live upstream during development burns API budget, mutates production state, and produces non-deterministic behavior that can’t be regression-tested.

The fix is structural: pin the skill’s tool layer to a recording. The skill stays expressive. The tool calls become repeatable. The same skill invocation yields the same observed response in dev today, in CI tomorrow, and in front of an auditor next quarter. Gostly is the proxy that captures the recording and serves it back, byte-equivalent, on demand.

What you’ll build

By the end of this runbook you’ll have:

A Gostly proxy running locally with a recorded mock library for the GitHub REST API (the running example below — the same pattern works for any HTTP-speaking upstream).
A Claude Skill whose backend HTTP base URL points at the local Gostly proxy, not the live upstream.
A reproducible test harness for the skill: same input → same recorded response → same observed behavior, every run.
A pattern you can extend to every other skill in your workspace.

Prerequisites

Docker (to run the Gostly stack locally — the customer image ships as a Compose file).
Claude Code or the Claude Agent SDK, with the ability to define custom skills in your local config.
An upstream API you want to mock.We’ll use GitHub’s REST API (api.github.com) as the running example. The same pattern applies to any HTTP-speaking upstream — internal service, Stripe, Twilio, OpenAI, anything.

Step 1

Capture real upstream traffic

Start Gostly in LEARN mode and route a handful of representative API calls through it. Each request/response pair lands in a local JSONL recording, scrubbed of auth-class headers before the bytes ever touch disk.

# Start the Gostly stack in LEARN mode
docker compose up -d
curl -X POST http://localhost:8000/v1/mode -d '{"mode":"learn"}'

# Point your code at the proxy instead of api.github.com
export GITHUB_API_BASE=http://localhost:8080

# Exercise a few representative paths
your-app fetch-pr      --owner octocat --repo Hello-World --number 1
your-app list-issues   --owner octocat --repo Hello-World --state open
your-app fetch-checks  --owner octocat --repo Hello-World --ref main

# Recordings now live in ./data/mocks/github.jsonl

Step 2

Switch to MOCK mode

Flip the proxy to MOCK mode. From here on, every matching request gets the recorded response. The live upstream is never touched again until you explicitly switch back.

curl -X POST http://localhost:8000/v1/mode -d '{"mode":"mock"}'

# Same request as before — but now served from the recording
your-app fetch-pr --owner octocat --repo Hello-World --number 1
# → byte-equivalent response from data/mocks/github.jsonl
# → no api.github.com call made, no quota consumed, no PAT needed

Step 3

Author a Claude Skill that uses the proxy

Define the skill so its backend HTTP base URL is the local Gostly proxy. The skill code is unchanged from how it would target the live upstream — only the base URL differs. The proxy is transparent to the skill itself.

# ~/.claude/skills/pr-status/skill.md
---
name: pr-status
description: Look up the status of a GitHub pull request by owner/repo/number.
---

When the user asks about a PR, call
${GITHUB_API_BASE}/repos/{owner}/{repo}/pulls/{number}
and return: state, mergeable status, reviewer list, and a check-run
summary. If the PR is mergeable, suggest the next action.

With GITHUB_API_BASE pointing at the local proxy, the skill calls Gostly. Gostly serves the recorded response. Claude reasons about it. The skill returns its summary. Nothing on GitHub’s side moves — no API quota consumed, no PAT required.

Step 4

Test the skill — repeatedly

Invoke the skill the same way three times. The recorded response comes back identically each time. Diff the outputs to confirm byte-equivalence. This is the property that makes regression tests possible: a previously-green skill that turns red points at something the skill changed, not something the upstream changed.

# Run 1
$ claude /pr-status octocat/Hello-World 1
> PR #1: open · mergeable · 2 reviewers · 3/3 checks passing

# Run 2 (five minutes later)
$ claude /pr-status octocat/Hello-World 1
> PR #1: open · mergeable · 2 reviewers · 3/3 checks passing

# Run 3 (in CI tomorrow morning)
$ claude /pr-status octocat/Hello-World 1
> PR #1: open · mergeable · 2 reviewers · 3/3 checks passing

Step 5

Wire into CI

Commit the recordings to your repo (or to a shared mock library your team draws from). CI brings up the Gostly stack from those recordings — no upstream credentials needed in CI, no live API allowed. Skill regression tests run hermetically.

# .github/workflows/skill-tests.yml (illustrative)
- name: Boot Gostly with committed recordings
  run: |
    docker compose up -d
    curl -X POST http://localhost:8000/v1/mode \
      -d '{"mode":"mock"}'

- name: Run skill regression tests
  env:
    GITHUB_API_BASE: http://localhost:8080
    # No GITHUB_TOKEN needed — every call is served from the recording.
  run: pytest tests/skills/

What you’ve gained — the three pillars

Deterministic

Same skill input → same tool response, every run. Recorded responses are replayed byte-equivalent: same body, same status, same headers (minus the redacted auth class). The skill's behavior is reproducible by construction; CI runs are repeatable; failure cases are inspectable.

Sealed

Skills run against a self-hosted proxy on localhost. No live API calls leave the developer's machine during test or CI. No upstream credentials needed in test environments. No accidental writes to production. The proxy strips 16 auth-class header values from every disk recording (by default) and from everything it ships off the box (no opt-out) — the only place those headers stay verbatim is the in-memory active-session buffer, which never leaves the machine.

Audited

Every skill→tool→response triple is captured to a local JSONL log with workload classification. Operators can replay, diff, and reason about every interaction. Suitable as evidence in MRM review, SOC 2 audit, or post-incident analysis.

Failure modes the clean room blocks

Skill burns through your API quota

GitHub's REST API caps authenticated traffic at 5,000 requests per hour, unauthenticated at 60. A chatty skill that polls issues, walks PR comment threads, and lists checks across repos can exhaust that quota before lunch. Mock mode serves every call from the local recording — your live quota is untouched.

Skill mutates production state

A skill that POSTs a comment, merges a PR, or closes an issue will do exactly that against the live API. Replayed mocks return the recorded response without making the real call — your test posts no comments, merges no PRs, closes no issues.

Skill behavior changes between runs

Live APIs return different data hour-to-hour: a PR's mergeable status flips, a check goes from queued to passed, a new comment lands. The same skill input can yield three different summaries on three Tuesdays. Recorded mocks return the exact response captured at recording time — the skill becomes deterministic regardless of upstream drift.

Test environment leaks credentials to a third-party

Skills that talk to live GitHub need a Personal Access Token with the right scopes — provisioned, rotated, and exposed in dev or CI envs. Mock mode requires no PAT at all. The proxy is the upstream, and the auth header was already stripped from the recording at capture time by the non-overridable credential-redaction floor.

When you can’t rewrite the base URL — TLS interception

The runbook above points the skill’s base URL at the proxy. That works when your tool lets you set the upstream. Some tools don’t — they hard-pin an HTTPS endpoint you can’t edit. For those, the proxy can terminate TLS itself. Set ENABLE_TLS_INTERCEPTION and the agent runs a CONNECT forward proxy on :8443. Your tool talks HTTPS to its real endpoint; you just route it through the proxy with the standard HTTPS_PROXY env var and trust the embedded CA.

# Trust the agent's CA once (per-OS install step in the docs)
curl http://localhost:8080/ca.crt > gostly-ca.crt

# Route the tool's pinned HTTPS calls through the proxy — no base-URL edit
export HTTPS_PROXY=http://localhost:8443
your-app fetch-pr --owner octocat --repo Hello-World --number 1
# → the proxy mints a per-host leaf cert, terminates TLS, and serves the mock

HTTP, HTTPS, and HTTP/2 are all captured and replayed this way, as is plain ws:// WebSocket traffic. Interception of wss:// (WebSocket-over-TLS) is on the roadmap, not yet shipped.

Stateful flows replay coherently

Read-only skills are the easy case. Skills that drive a flow — create a resource, then read it back, then advance it — need the mock library to behave like a state machine, not a flat lookup table. Two mechanisms cover that:

Verbatim session replay

Within a single LEARN-then-MOCK session, the proxy keeps a byte-exact in-memory copy of everything it just saw — bodies andheaders, including Set-Cookie and CSRF tokens — and replays it verbatim so a login-then-act flow stays coherent. Body replay stays byte-exact across restarts (it’s served from the on-disk recording); header-verbatim replay is in-session only, since that buffer lives in RAM and clears on restart or the next LEARN window.

Linked mocks (statecharts)

A POST that creates a resource and a later GET that reads it back are linked through a statechart, so POST /charges → GET /charges/{id} returns the created resource instead of a 404. PATCH/POST transitions advance the resource’s status, so CRUD lifecycles stay coherent across a multi-step skill run.

Spin up your clean room.

The full quickstart takes well under ten minutes — capture, switch, point your skill, run.