Incident Runbook
Production incident response procedures for FeatureSignals. Severity classification, escalation paths, rollback procedures, communication templates, and the blameless post-mortem process — everything you need when things go sideways at 3 AM.
Warning
Severity Levels
Incidents are classified by scope, impact, and urgency. Use these definitions to triage quickly and consistently:
P0 — Critical
P1 — High
P2 — Medium
P3 — Low
First Response Checklist
When an alert fires or an incident is reported, the first responder runs this checklist:
- Acknowledge the alert — Silence the pager. Acknowledge in the incident Slack channel. This buys you time to think.
- Assess blast radius — How many customers are affected? Which services? Is this a partial or complete outage?
- Declare severity — Use the severity definitions above. Err on the side of over-classifying — you can downgrade later.
- Start the incident timer — Note the start time (UTC). This feeds SLA reporting.
- Open an incident channel — Create a dedicated Slack channel (#incident-{number}) or Zoom war room for P0/P1.
- Send initial communication — Use the template below to notify affected customers and internal stakeholders.
- Begin investigation — Check dashboards (SigNoz), recent deploys, database metrics, and error logs.
Rollback Procedures
If a recent deployment caused the incident, rollback is your fastest path to recovery. Don't debug in production — roll back first, investigate later.
- Step 1: Identify the last known-good deployment from the deployment history in your CI/CD pipeline.
- Step 2: Redeploy the previous version. One-click rollback is available in the Ops Portal.
- Step 3: Verify health endpoints return 200 and evaluation latency returns to baseline.
- Step 4: Confirm with affected customers that service is restored.
- Step 5: Preserve all logs, metrics, and traces from the incident window for post-mortem analysis.
Ops Portal Rollback
https://ops.featuresignals.com/deployments. Select the deployment, click “Rollback,” and confirm. The platform handles canary traffic shifting and health verification automatically.Communication Templates
Consistent communication reduces panic. Use this template for all customer-facing incident updates:
Subject: [INCIDENT] FeatureSignals {SEVERITY} — {BRIEF_TITLE}
Status: {INVESTIGATING | MONITORING | RESOLVED}
Incident ID: {INCIDENT_ID}
Start Time: {START_TIME_UTC}
Impact: {DESCRIPTION_OF_IMPACT}
Current Status:
{WHAT_WE_KNOW_AND_WHAT_WE'RE_DOING}
Next Update: {EXPECTED_UPDATE_TIME}
FeatureSignals Incident Response TeamStatus Update Cadence
| Severity | Update Frequency | Channel |
|---|---|---|
| P0 | Every 30 minutes | Status page + Slack + Email |
| P1 | Every 1 hour | Status page + Slack |
| P2 | Every 4 hours | Status page |
| P3 | On resolution | Issue tracker |
Post-Mortem Process
Every P0 and P1 incident requires a blameless post-mortem within 48 hours of resolution. The goal is understanding, not blame:
- Timeline — Construct a minute-by-minute timeline from alerts, logs, chat messages, and deployment records.
- Root cause — What specifically caused the incident? Use the 5 Whys technique to trace back to process or systemic gaps.
- Impact assessment — Duration, affected customers, evaluation failures, SLA impact.
- What went well — Acknowledge effective detection, fast response, good communication. Celebrate smart decisions under pressure.
- What could be better — Detection gaps, unclear runbooks, tooling deficiencies, training needs.
- Action items — Specific, assigned, time-boxed remediation tasks. Each action item links to a GitHub issue.
- Review — Post-mortem presented at the next engineering all-hands. Action items tracked to completion.