Skip to content
Guide • Reliability • Incident response

Algo incident response playbook

When automation misfires, speed matters — but sequence matters more. This playbook gives you a repeatable order: detect → contain → diagnose → recover.

Algo incident response flow

Severity levels (keep this simple)

  • Sev-1: potential account harm now (unexpected live orders, runaway loop)
  • Sev-2: strategy malfunction with limited blast radius
  • Sev-3: degraded behavior (noise, slippage drift, stale signals)

First 2 minutes (Sev-1)

  1. Trigger kill switch (L3/L4 depending on blast radius).
  2. Disable affected strategy.
  3. Confirm no new orders are being sent.
  4. Preserve evidence immediately:
  5. broker responses / rejects
  6. strategy settings snapshot
  7. timestamps and symbols

30-minute diagnosis block

Use this checklist: - Was this signal logic, risk rule, execution transport, or broker state? - Did config drift from approved baseline? - Did this happen before (repeat signature)?

Quick triage matrix

Symptom Probable class First check
Unexpected symbols traded Universe/symbol list issue Strategy source + list mapping
Too many orders quickly Missing cap / duplicated triggers Max trades/day + dedupe
Orders rejected repeatedly Broker/API session issue API auth + connection state
“Late” behavior and poor fills Execution quality drift Slippage + spread regime

Recovery protocol

  • Patch one root cause at a time.
  • Run canary in paper first.
  • Re-enable with reduced risk caps.
  • Monitor first session manually.

Postmortem template (same day)

  • What happened (factual timeline)
  • Impact (orders, risk, downtime)
  • Root cause (single clearest statement)
  • Contributing factors
  • Permanent fixes (owner + due date)
  • Guardrail added (to prevent recurrence)

Reliability KPIs (lightweight)

Track these weekly: - Incident count by severity - Time to containment - Repeat incident ratio - % incidents with completed postmortem

Where this connects in your stack

FAQ

Do small accounts need incident response?

Yes. Small accounts are less tolerant of operational mistakes.

Should every incident get a postmortem?

Sev-1 and Sev-2: yes. Sev-3: at least a short record.




David
Written by
Updated 2026-02-25
Mentor-style Trade Ideas tutorials focused on workflow, clarity, and repeatable process.