Guide • Reliability • Incident response
Algo incident response playbook¶
When automation misfires, speed matters — but sequence matters more. This playbook gives you a repeatable order: detect → contain → diagnose → recover.
Severity levels (keep this simple)¶
- Sev-1: potential account harm now (unexpected live orders, runaway loop)
- Sev-2: strategy malfunction with limited blast radius
- Sev-3: degraded behavior (noise, slippage drift, stale signals)
First 2 minutes (Sev-1)¶
- Trigger kill switch (L3/L4 depending on blast radius).
- Disable affected strategy.
- Confirm no new orders are being sent.
- Preserve evidence immediately:
- broker responses / rejects
- strategy settings snapshot
- timestamps and symbols
30-minute diagnosis block¶
Use this checklist: - Was this signal logic, risk rule, execution transport, or broker state? - Did config drift from approved baseline? - Did this happen before (repeat signature)?
Quick triage matrix¶
| Symptom | Probable class | First check |
|---|---|---|
| Unexpected symbols traded | Universe/symbol list issue | Strategy source + list mapping |
| Too many orders quickly | Missing cap / duplicated triggers | Max trades/day + dedupe |
| Orders rejected repeatedly | Broker/API session issue | API auth + connection state |
| “Late” behavior and poor fills | Execution quality drift | Slippage + spread regime |
Recovery protocol¶
- Patch one root cause at a time.
- Run canary in paper first.
- Re-enable with reduced risk caps.
- Monitor first session manually.
Postmortem template (same day)¶
- What happened (factual timeline)
- Impact (orders, risk, downtime)
- Root cause (single clearest statement)
- Contributing factors
- Permanent fixes (owner + due date)
- Guardrail added (to prevent recurrence)
Reliability KPIs (lightweight)¶
Track these weekly: - Incident count by severity - Time to containment - Repeat incident ratio - % incidents with completed postmortem
Where this connects in your stack¶
FAQ¶
Do small accounts need incident response?¶
Yes. Small accounts are less tolerant of operational mistakes.
Should every incident get a postmortem?¶
Sev-1 and Sev-2: yes. Sev-3: at least a short record.