Clean-room runbook

Run Claude Skills against deterministic mock services

A step-by-step runbook for wiring Claude Skills against Gostly-served mocks instead of live upstream APIs. Reproducible by construction. Sealed by self-hosted design. Audited by default.

Why this runbook exists

Skills give Claude the ability to do real things — call your APIs, modify your data, move money, send messages. That power is the point. It is also the failure mode: a skill that calls a live upstream during development burns API budget, mutates production state, and produces non-deterministic behavior that can’t be regression-tested.

The fix is structural: pin the skill’s tool layer to a recording. The skill stays expressive. The tool calls become repeatable. The same skill invocation yields the same observed response in dev today, in CI tomorrow, and in front of an auditor next quarter. Gostly is the proxy that captures the recording and serves it back, byte-equivalent, on demand.

What you’ll build

By the end of this runbook you’ll have:

  • A Gostly proxy running locally with a recorded mock library for the GitHub REST API (the running example below — the same pattern works for any HTTP-speaking upstream).
  • A Claude Skill whose backend HTTP base URL points at the local Gostly proxy, not the live upstream.
  • A reproducible test harness for the skill: same input → same recorded response → same observed behavior, every run.
  • A pattern you can extend to every other skill in your workspace.

Prerequisites

  • Docker (to run the Gostly stack locally — the customer image ships as a Compose file).
  • Claude Code or the Claude Agent SDK, with the ability to define custom skills in your local config.
  • An upstream API you want to mock.We’ll use GitHub’s REST API (api.github.com) as the running example. The same pattern applies to any HTTP-speaking upstream — internal service, Stripe, Twilio, OpenAI, anything.
Step 1

Capture real upstream traffic

Start Gostly in LEARN mode and route a handful of representative API calls through it. Each request/response pair lands in a local JSONL recording, scrubbed of auth-class headers before the bytes ever touch disk.

# Start the Gostly stack in LEARN mode
docker compose up -d
curl -X POST http://localhost:8000/v1/mode -d '{"mode":"learn"}'

# Point your code at the proxy instead of api.github.com
export GITHUB_API_BASE=http://localhost:8080

# Exercise a few representative paths
your-app fetch-pr      --owner octocat --repo Hello-World --number 1
your-app list-issues   --owner octocat --repo Hello-World --state open
your-app fetch-checks  --owner octocat --repo Hello-World --ref main

# Recordings now live in ./data/mocks/github.jsonl
Step 2

Switch to MOCK mode

Flip the proxy to MOCK mode. From here on, every matching request gets the recorded response. The live upstream is never touched again until you explicitly switch back.

curl -X POST http://localhost:8000/v1/mode -d '{"mode":"mock"}'

# Same request as before — but now served from the recording
your-app fetch-pr --owner octocat --repo Hello-World --number 1
# → byte-equivalent response from data/mocks/github.jsonl
# → no api.github.com call made, no quota consumed, no PAT needed
Step 3

Author a Claude Skill that uses the proxy

Define the skill so its backend HTTP base URL is the local Gostly proxy. The skill code is unchanged from how it would target the live upstream — only the base URL differs. The proxy is transparent to the skill itself.

# ~/.claude/skills/pr-status/skill.md
---
name: pr-status
description: Look up the status of a GitHub pull request by owner/repo/number.
---

When the user asks about a PR, call
${GITHUB_API_BASE}/repos/{owner}/{repo}/pulls/{number}
and return: state, mergeable status, reviewer list, and a check-run
summary. If the PR is mergeable, suggest the next action.

With GITHUB_API_BASE pointing at the local proxy, the skill calls Gostly. Gostly serves the recorded response. Claude reasons about it. The skill returns its summary. Nothing on GitHub’s side moves — no API quota consumed, no PAT required.

Step 4

Test the skill — repeatedly

Invoke the skill the same way three times. The recorded response comes back identically each time. Diff the outputs to confirm byte-equivalence. This is the property that makes regression tests possible: a previously-green skill that turns red points at something the skill changed, not something the upstream changed.

# Run 1
$ claude /pr-status octocat/Hello-World 1
> PR #1: open · mergeable · 2 reviewers · 3/3 checks passing

# Run 2 (five minutes later)
$ claude /pr-status octocat/Hello-World 1
> PR #1: open · mergeable · 2 reviewers · 3/3 checks passing

# Run 3 (in CI tomorrow morning)
$ claude /pr-status octocat/Hello-World 1
> PR #1: open · mergeable · 2 reviewers · 3/3 checks passing
Step 5

Wire into CI

Commit the recordings to your repo (or to a shared mock library your team draws from). CI brings up the Gostly stack from those recordings — no upstream credentials needed in CI, no live API allowed. Skill regression tests run hermetically.

# .github/workflows/skill-tests.yml (illustrative)
- name: Boot Gostly with committed recordings
  run: |
    docker compose up -d
    curl -X POST http://localhost:8000/v1/mode \
      -d '{"mode":"mock"}'

- name: Run skill regression tests
  env:
    GITHUB_API_BASE: http://localhost:8080
    # No GITHUB_TOKEN needed — every call is served from the recording.
  run: pytest tests/skills/

What you’ve gained — the three pillars

Deterministic

Same skill input → same tool response, every run. Recorded responses are replayed byte-equivalent: same body, same status, same headers (minus the redacted auth class). The skill's behavior is reproducible by construction; CI runs are repeatable; failure cases are inspectable.

Sealed

Skills run against a self-hosted proxy on localhost. No live API calls leave the developer's machine during test or CI. No upstream credentials needed in test environments. No accidental writes to production. The proxy strips 16 auth-class headers before any I/O — non-overridable.

Audited

Every skill→tool→response triple is captured to a local JSONL log with workload classification. Operators can replay, diff, and reason about every interaction. Suitable as evidence in MRM review, SOC 2 audit, or post-incident analysis.

Failure modes the clean room blocks

Skill burns through your API quota

GitHub's REST API caps authenticated traffic at 5,000 requests per hour, unauthenticated at 60. A chatty skill that polls issues, walks PR comment threads, and lists checks across repos can exhaust that quota before lunch. Mock mode serves every call from the local recording — your live quota is untouched.

Skill mutates production state

A skill that POSTs a comment, merges a PR, or closes an issue will do exactly that against the live API. Replayed mocks return the recorded response without making the real call — your test posts no comments, merges no PRs, closes no issues.

Skill behavior changes between runs

Live APIs return different data hour-to-hour: a PR's mergeable status flips, a check goes from queued to passed, a new comment lands. The same skill input can yield three different summaries on three Tuesdays. Recorded mocks return the exact response captured at recording time — the skill becomes deterministic regardless of upstream drift.

Test environment leaks credentials to a third-party

Skills that talk to live GitHub need a Personal Access Token with the right scopes — provisioned, rotated, and exposed in dev or CI envs. Mock mode requires no PAT at all. The proxy is the upstream, and the auth header was already stripped from the recording at capture time per the REDACT_FLOOR invariant.

Spin up your clean room.

The full quickstart takes well under ten minutes — capture, switch, point your skill, run.