Chaos & Fault Injection
Once a service is serving from its mock library, chaos injection lets you fault it on purpose — add latency, return error codes, or simulate an intermittent outage — so you can prove your app handles a degraded dependency before production does it for you. Chaos is operator-configured: you set the fault profile (or pick a preset). It is wrapped around the deterministic match cascade, so a faulted request still resolves the same recorded response underneath — chaos only decides whether to delay it or replace it with an injected error.
What chaos injection is — and is not
Chaos is a per-service config attached to a mocked upstream. When enabled, the proxy evaluates it in MOCK and PASSTHROUGH modes only — never in LEARN or TRANSITIONING, so you never corrupt a recording or interfere with the transition interstitial. On every request the proxy decides, per the configured model, whether to delay the response, replace it with a weighted error, or both. An injected response carries the header X-Ghost-Chaos: true so callers and tests can tell a chaos fault apart from a real upstream failure.
Configured, not auto-learned
Two injection models
The chaos_modelfield on a service's chaos config selects how faults are drawn. The default is the per-request model; the Markov model is opt-in.
uniformdefaultIndependent per-request draw. Each request rolls against your error_rate and a latency draw — every request is an i.i.d. coin flip with no memory of the last one. Use it for steady-state degradation ("this dependency is 40% flaky and adds 1–3s").
markovopt-inA two-state machine — healthy ↔ degraded — with exponentially distributed dwell times. While healthy, zero errors and normal latency; while degraded, errors fire at degraded_error_rate and latency is multiplied by degraded_latency_mult. This produces bursty outages (clean for ~30s, then a ~5s spike of failures) rather than uniformly sprinkled errors — closer to how a real upstream actually fails.
The Markov state is synthetic, not fitted from upstream traffic
markov model is a generator, not a learner. Its state transitions are driven by an RNG and four operator-set tuning knobs — mean_dwell_healthy_ms, mean_dwell_degraded_ms, degraded_error_rate, degraded_latency_mult. It does not observe your real upstream's failure pattern and replay it. The defaults match the realistic-outagepreset (~30s healthy, ~5s degraded, 60% errors, 5× latency). State is held in memory per service and resets to healthy on agent restart by design — chaos is meant to be observed within a session.The chaos config
Chaos config lives on the service record and is set via the dashboard chaos workbench or by PATCHing the service. A minimal uniform config:
{
"chaos_config": {
"enabled": true,
"chaos_model": "uniform",
"error_rate": 0.4,
"error_codes": [
{ "status": 503, "weight": 0.7 },
{ "status": 500, "weight": 0.3 }
],
"latency_jitter": { "min_ms": 200, "max_ms": 1500, "distribution": "uniform" },
"modes": ["MOCK"]
}
}Probability (0.0–1.0) that a request is replaced with an error. In the markov model this is ignored while healthy and replaced by degraded_error_rate while degraded.
Weighted dictionary of status codes to draw from. Each entry takes a status, a weight, and an optional body + headers (e.g. a Retry-After on a 429). When non-empty this overrides the legacy single error_status / error_body fields.
A min/max latency band with a uniform or normal distribution. The injected sleep is drawn from this band; under the markov model the draw is multiplied by degraded_latency_mult while degraded. Overrides the legacy flat latency_ms.
Which proxy modes chaos fires in. Empty = both MOCK and PASSTHROUGH. Set to ["MOCK"] to fault only mocked responses.
Per-endpoint overrides matched by URI glob (* = one path segment, ** = any depth). First matching rule wins; otherwise the service-level config applies. Each rule carries its own full chaos config.
The legacy flat fields (latency_ms, error_status, error_body) still deserialize for backward compatibility — the newer structured fields take precedence whenever they are set.
Built-in presets
The control plane ships a fixed set of named presets so common failure shapes are one click in the workbench. Each preset is a static config — picking one overwrites the latency and error settings. They are served from GET /chaos/presets.
rate-limited60% of requests return 429 with a Retry-After: 30 header. No added latency. Models a quota you've blown through.
cascading-failure70% errors drawn across 500 / 502 / 503 with a 200–800ms latency band. Models a backend buckling under load.
flaky-network40% of requests time out as 504, with a heavy 1–3s latency band. Models an unreliable network path.
degradedNo errors — just a steady 300–800ms latency tax. Models a slow-but-up dependency.
realistic-outageThe markov preset: mostly clean (~30s healthy dwell) punctuated by short ~5s bursts of 503s with 5× latency. Bursty rather than uniform — the closest fixed preset to a real intermittent outage.
The preset field on a config is a display-only label recording which preset (if any) populated it — editing any latency or error field in the workbench clears it back to custom.
Calibrating an outage profile (optional, best-effort)
If you want a Markov profile shaped like the outages you've already been injecting, the API can fit one for you. This is the one place where chaos is derived from observed data rather than authored from scratch — but read the fine print, because it is narrower than "auto-calibration" might suggest:
- It fits from recorded chaos events — the proxy's own
chaos_firedactivity log — notfrom your real upstream's traffic. It learns the shape of the chaos you already ran, so you can promote a hand-tuned run into a reusable profile. - It is best-effort:
mean_dwell_healthy_msandmean_dwell_degraded_mscome from the p50 of observed clean-gap and error-run lengths;degraded_error_rateis the empirical fraction of degraded events that fired an error;degraded_latency_multis a ratio of observed latencies. A single percentile, no maximum-likelihood fit — feed it more data for a tighter result. - It needs at least 30 recorded chaos events for the service, or it returns
400. - It returns a calibrated
MarkovConfig; it does not apply it. You review the numbers and PATCH them onto the service yourself.
# Fit a Markov profile from this service's recorded chaos events
curl -X POST http://localhost:8000/services/{service_id}/chaos/calibrate \
-H "X-API-Key: $GHOST_API_KEY"
# 200 → { "chaos_model": "markov", "markov": { ... } } (review, then apply)
# 400 → fewer than 30 recorded chaos events to fit fromWhat chaos emits
Every injected fault is observable, and the event trail is built to be safe to surface. The agent emits a minimal, scrubbed chaos event: service_id, method, a normalized path, the chaos type (error / latency / both), the injected status and latency, and the rule/preset name.
The chaos event log is scrubbed at the source
{id} before it leaves the agent. Headers, request and response bodies, upstream URLs, and client IPs are never included in a chaos event. The event ships only on the licensed cloud-observability path; an OSS proxy logs chaos locally and ships nothing.On the metrics side, an injected error increments the proxy's match-path counter with the chaos label:
ghost_requests_total{match_type="chaos"}A chaos response wraps — it does not replace — the deterministic match cascade. The underlying request still resolves through session-verbatim → exact → resource-store → smart-swap (→ AI generation, Pro+) the same way it would without chaos. There is no LLM in the request hot path: chaos is pure config evaluation plus an optional sleep and a synthetic response, so faulting a request adds no inference latency of its own.
Availability
Chaos injection — latency, weighted errors, presets, the Markov model, and per-endpoint rules — is available on every plan, Free included. The licensed product ships as a Docker Compose deployment with prebuilt registry images; chaos is a first-class part of the proxy in that build.
The basic chaos primitives (latency and error injection) are also part of the separate open-source proxy, which runs as a standalone host binary (installed via Homebrew or a container image). The cloud-side chaos event ingestion and dashboard event log are a licensed-product surface; an OSS proxy logs chaos locally only.