FeatureSignals

Incident Runbook

Production incident response procedures for FeatureSignals. Severity classification, escalation paths, rollback procedures, communication templates, and the blameless post-mortem process — everything you need when things go sideways at 3 AM.

Warning

This runbook covers both FeatureSignals-hosted incidents and guidance for self-hosted customers responding to their own deployments. For self-hosted deployments, you own the response process — we provide tools and support.

Severity Levels

Incidents are classified by scope, impact, and urgency. Use these definitions to triage quickly and consistently:

P0 — Critical

Definition: FeatureSignals is completely unavailable. Flag evaluations are failing for all customers. Data loss or security breach in progress.
Response: Immediate. On-call engineer acknowledges within 5 minutes. War room initiated within 15 minutes.
Escalation: CTO notified immediately. CEO notified within 30 minutes if unresolved.

P1 — High

Definition: Major feature degraded. Evaluation latency exceeds 5s p99. SDKs returning stale data. Single customer experiencing complete outage on Dedicated Cloud.
Response: On-call engineer acknowledges within 15 minutes. Investigation begins within 30 minutes.
Escalation: Engineering manager notified within 1 hour. VP Engineering if unresolved after 4 hours.

P2 — Medium

Definition: Partial degradation. Dashboard UI slow but functional. Webhook deliveries delayed. Non-critical API endpoints returning errors.
Response: Acknowledged within 1 hour. Fix deployed within next business day.
Escalation: Engineering manager notified within 4 hours.

P3 — Low

Definition: Minor issues. Cosmetic UI bugs. Documentation errors. Non-production environment issues. Feature requests misclassified as bugs.
Response: Triaged within 1 business day. Scheduled for next sprint or backlog.
Escalation: No escalation required. Tracked in issue tracker.

First Response Checklist

When an alert fires or an incident is reported, the first responder runs this checklist:

  1. Acknowledge the alert — Silence the pager. Acknowledge in the incident Slack channel. This buys you time to think.
  2. Assess blast radius — How many customers are affected? Which services? Is this a partial or complete outage?
  3. Declare severity — Use the severity definitions above. Err on the side of over-classifying — you can downgrade later.
  4. Start the incident timer — Note the start time (UTC). This feeds SLA reporting.
  5. Open an incident channel — Create a dedicated Slack channel (#incident-{number}) or Zoom war room for P0/P1.
  6. Send initial communication — Use the template below to notify affected customers and internal stakeholders.
  7. Begin investigation — Check dashboards (SigNoz), recent deploys, database metrics, and error logs.

Rollback Procedures

If a recent deployment caused the incident, rollback is your fastest path to recovery. Don't debug in production — roll back first, investigate later.

  1. Step 1: Identify the last known-good deployment from the deployment history in your CI/CD pipeline.
  2. Step 2: Redeploy the previous version. One-click rollback is available in the Ops Portal.
  3. Step 3: Verify health endpoints return 200 and evaluation latency returns to baseline.
  4. Step 4: Confirm with affected customers that service is restored.
  5. Step 5: Preserve all logs, metrics, and traces from the incident window for post-mortem analysis.

Ops Portal Rollback

For FeatureSignals Cloud and Dedicated Cloud, one-click rollback is available in the Ops Portal at https://ops.featuresignals.com/deployments. Select the deployment, click “Rollback,” and confirm. The platform handles canary traffic shifting and health verification automatically.

Communication Templates

Consistent communication reduces panic. Use this template for all customer-facing incident updates:

Subject: [INCIDENT] FeatureSignals {SEVERITY} — {BRIEF_TITLE}

Status: {INVESTIGATING | MONITORING | RESOLVED}
Incident ID: {INCIDENT_ID}
Start Time: {START_TIME_UTC}
Impact: {DESCRIPTION_OF_IMPACT}

Current Status:
{WHAT_WE_KNOW_AND_WHAT_WE'RE_DOING}

Next Update: {EXPECTED_UPDATE_TIME}

FeatureSignals Incident Response Team

Status Update Cadence

SeverityUpdate FrequencyChannel
P0Every 30 minutesStatus page + Slack + Email
P1Every 1 hourStatus page + Slack
P2Every 4 hoursStatus page
P3On resolutionIssue tracker

Post-Mortem Process

Every P0 and P1 incident requires a blameless post-mortem within 48 hours of resolution. The goal is understanding, not blame:

  1. Timeline — Construct a minute-by-minute timeline from alerts, logs, chat messages, and deployment records.
  2. Root cause — What specifically caused the incident? Use the 5 Whys technique to trace back to process or systemic gaps.
  3. Impact assessment — Duration, affected customers, evaluation failures, SLA impact.
  4. What went well — Acknowledge effective detection, fast response, good communication. Celebrate smart decisions under pressure.
  5. What could be better — Detection gaps, unclear runbooks, tooling deficiencies, training needs.
  6. Action items — Specific, assigned, time-boxed remediation tasks. Each action item links to a GitHub issue.
  7. Review — Post-mortem presented at the next engineering all-hands. Action items tracked to completion.

Learn More