Install & Deploy
The licensed Gostly product ships as a Docker Compose stack backed by container images you pull from a registry — there is no host binary to install. This page covers system sizing, the .envyou fill in, the ports each service exposes, the memory split between the inference engine and the proxy, and the low-memory path for machines that can't spare the RAM for AI generation.
What you're installing
The licensed product is delivered as a docker-compose.yml plus a set of pre-built images. You authenticate to the image registry with a short-lived token from your dashboard, pull the images, fill in an .env, and run docker compose up. A Pro/Team stack is five services on a shared ./data bind-mount:
ghost-proxy8080 · 8443The Rust proxy. The only service your application talks to. Runs the match cascade and the four operating modes. Plain HTTP on 8080; the TLS-MITM listener binds on 8443 when enabled.
ghost-web3000 · 8000Control plane API (8000) plus the operator dashboard (3000), one container. Owns services, mocks, transitions, drift, and the SSO/RBAC auth layer.
ghost-postgresinternalPostgreSQL 16. The persistent store for the mock library, services, and audit log. Not published to the host.
ghost-inferenceinternalThe AI inference engine (Pro/Team). Runs RAG retrieval and — when generation is enabled — serves cached LoRA-adapter responses. This is the memory-hungry container.
ghost-llamacppisolatedOptional generation sidecar (compose profile llamacpp). Runs on a network with no internet egress — the model container is structurally air-gapped.
The Free tier ships ghost-proxy only — no inference, no dashboard. The proxy serves recorded mocks standalone and never needs the rest of the stack to replay.
Compose + registry images, not a CLI
The licensed product is distributed exclusively as Docker Compose plus registry images. There is no gostly command to install on your host. If you came here looking for a host CLI with brew / package-manager install, that's the open-source proxy — a separate product with its own binary and its own command line. This page is about the licensed self-hosted stack.
Prerequisites
- → Docker and Docker Compose installed (Docker Desktop, or Docker Engine + the Compose plugin on Linux)
- → A Gostly license key — get one free. The key gates which tier capabilities the stack unlocks.
- → A registry login token (issued from your dashboard, valid 12 hours) to pull the images
- → Any HTTP service you want to mock — local, staging, or third-party. Production access is not required.
System sizing
The memory footprint is dominated by one container. The proxy, dashboard, and database are light; the AI inference engine is what sets your floor. Plan for two numbers:
ghost-inference~6 GB RAMThe AI inference container. 8 GB free RAM recommended. This is the deciding number — if you don't have it, see the low-memory path below.
proxy + dashboard + db~1 GB RAMghost-proxy, ghost-web, and ghost-postgres combined. Light enough to run on any dev laptop.
The compose file sets the inference container's memory limit via INFERENCE_MEM_LIMIT and pins its CPU with INFERENCE_CPU_LIMIT. The PyTorch BLAS thread count is INFERENCE_NUM_THREADS — set it relative to your core count (the shipped .env.example carries per-machine recommendations). Keep INFERENCE_CPU_LIMIT ≥ INFERENCE_NUM_THREADS or the thread pool starves.
Docker Desktop's default allocation will OOMKill inference
Docker Desktop ships with a 4 GB memory allocation by default. The inference container needs more than that — under the default it will be OOMKilled on startup. Raise Docker Desktop's memory to 8 GB (Settings → Resources) before running docker compose up, or take the low-memory path below.
Low-memory path — run without AI generation
If you have less than 8 GB free, run the stack without the generative model loaded. The match cascade stays fully functional — exact match, session-verbatim replay, the statechart / resource engine, and smart swap all run with no model in memory. You only lose the last-resort AI-generation fallback for requests that match nothing recorded.
# In .env — skip loading the generative model ENABLE_GENERATION=false
With generation off, the inference container does not load the base model, and the whole stack runs in roughly ~1 GB total. Retrieval (ENABLE_RAG) is a separate, lighter knob and is on by default — it loads only the sentence-encoder, not the generative model. On a truly memory-constrained machine you can set ENABLE_RAG=false as well.
Where generation actually runs
By default, generation routes through the optional ghost-llamacpp sidecar (enabled via the llamacpp compose profile), so ENABLE_GENERATION controls whether the inference server also loads the in-process PyTorch base as a fallback. Either way, the model lives inside your stack — and no LLM sits in the request hot path. Generation runs on a background worker behind a bounded queue; served responses come from cache. That's an architectural invariant, not a tuning default.
Ports
Three ports are published to the host on a Pro/Team stack. Postgres, inference, and the generation sidecar stay on internal Docker networks and are never exposed.
| Port | Service | What it is |
|---|---|---|
8080 | ghost-proxy | Plain-HTTP proxy. Point your application here instead of its upstream URL. This is the only port your app needs. |
8443 | ghost-proxy | TLS-MITM listener. Published unconditionally, but only binds when ENABLE_TLS_INTERCEPTION is set — until then the port is unused. Override the host mapping with TLS_PORT. |
3000 | ghost-web | Operator dashboard. Watch traffic, manage services, switch modes. |
8000 | ghost-web | Control plane API. Browser calls and curl scripts hit it here (the /v1/* endpoints). |
On HTTPS, fetch and trust the proxy's MITM CA once — it serves the cert at GET /ca.crt on port 8080 (which returns 503 while interception is off, so flip the knob first):
curl http://localhost:8080/ca.crt > gostly-ca.crt # then add gostly-ca.crt to your client / OS trust store
See Proxy Setup → TLS for the per-OS trust-install steps.
.env setup
All configuration is environment variables — there are no runtime config files. Copy the example file and fill in the four things that have no safe default:
cp .env.example .env
| Variable | What to set |
|---|---|
GOSTLY_LICENSE_KEY | The key from your dashboard. Gates tier-specific capabilities. Same key on the proxy and the control plane. |
BACKEND_URL | The upstream the proxy forwards to in LEARN mode and falls back to in PASSTHROUGH. Any reachable HTTP endpoint. |
POSTGRES_PASSWORD | Required — compose refuses to start without it. The Postgres superuser password for the bundled database. |
GHOST_API_KEY | Sets the control plane fail-closed: once set, every call on port 8000 needs the X-API-Key header. Leave unset only for a throwaway local trial. |
The sizing and feature knobs you'll most often touch at install time:
# Sizing — see System sizing above INFERENCE_MEM_LIMIT=10G # inference container memory limit INFERENCE_CPU_LIMIT=4 # must be >= INFERENCE_NUM_THREADS INFERENCE_NUM_THREADS=4 # PyTorch BLAS threads — tune to core count # AI behaviour ENABLE_GENERATION=false # set true to load the generative model (needs the RAM) ENABLE_RAG=true # retrieval — light; on by default # Tenant identity (single-tenant deployments typically pin this) GOSTLY_TENANT_ID=default # Optional: enable TLS interception on :8443 (off by default) ENABLE_TLS_INTERCEPTION=false
Single-tenant per deployment
Each self-hosted stack is a single tenant — GOSTLY_TENANT_ID defaults to default. Isolation comes from the deployment boundary itself: one stack, one tenant, one customer's data on that customer's volume. Per-tenant row-level-security policies are defined in the schema as defense-in-depth, but in the shipped single-tenant configuration the isolation is the deployment boundary, not engine-enforced row filtering. See Configuration for the full env-var reference.
On secrets and the data directory
Keep .env out of version control. The shared ./data bind-mount holds recorded traffic, the mock library, and license_cache.json (your license JWT) — review before committing any of it. The Quick Start has safe .gitignore defaults. Credential headers are stripped to a non-overridable 16-header floor before anything is written to disk; PII in bodies is kept verbatim only on the local replay library and scrubbed out of the Postgres store and any export.
Pulling the images
The licensed images live in a private registry. From your dashboard, click Get registry token under Quick Start to generate a login command. The token is valid for 12 hours; refresh it any time.
# The dashboard generates this with your registry host filled in docker login -u AWS -p <your-registry-token> <your-registry-host> # Then pull and start the stack docker compose pull docker compose up -d
The dashboard also generates a docker-compose.yml pre-configured for your plan — download it or copy it directly. The Free tier compose references ghost-proxy only; Pro/Team include the inference engine, dashboard, and database.
Pin a release tag, not :latest
The generated compose references :latest. For anything beyond a quick trial, pin a specific release tag (e.g. :v1.4.2) so a registry push can't change your stack out from under a pinned mock library or a CI run.
First boot
On docker compose up -d the proxy starts in LEARN mode (override with INITIAL_MODE). Point your application at http://localhost:8080 and every request is forwarded to your upstream and recorded — no code changes needed. Open the dashboard at http://localhost:3000 to watch traffic arrive and switch modes.
To skip pointing at a real upstream and start with a seed instead, drag a HAR, Postman collection, or OpenAPI spec onto the dashboard — it posts to the cold-start importer:
POST /v1/seed/har # also: /v1/seed/postman, /v1/seed/openapi
The inference container's first /generate call after startup may briefly return 503 while the model loads (only relevant when generation is enabled). From there, drive the LEARN → MOCK pipeline as described in Quick Start.
What you can record & replay
Gostly records and replays HTTP and HTTPS (HTTP/1.1 and HTTP/2 over TLS). WebSocket frames are captured for observability only — they are not replayed. There is no gRPC, async-messaging, or database mocking today (roadmap). Webhook traffic is captured automatically; replay is operator-triggered through the control-plane API, not auto-fired by the proxy.
Metrics & health
Each service exposes a /health endpoint used by the compose healthchecks. The proxy also exports Prometheus metrics at /metrics for scraping into your own monitoring:
ghost_requests_total{match_type}Request counter, one increment per match-path outcome (exact, smart_swap, session_verbatim, generated_cached, miss, …).
ghost_mock_library_sizeGauge of the current mock-library size.
ghost_io_errors_total{operation}Disk-sink open/write failures, labelled by operation.
axum_http_requests_total / _durationHTTP rate and latency for the proxy's own endpoints.
gostly_tls_*The TLS-MITM subsystem family — ALPN negotiation, cert-cache hit/miss/eviction, and listener state.
Next steps
Quick Start →
Record traffic and serve mocks end-to-end in a coffee break.
Configuration Reference →
Every environment variable, the redaction floor, scrub rules, and feature flags by tier.
How It Works →
The four modes and the match cascade — why AI is the last resort, never the hot path.
Proxy Setup →
Multiple upstreams, TLS interception, CI integration, and chaos injection.