Disaster Recovery
This disaster recovery plan defines the recovery objectives, backup strategy, restore procedures, and failover processes for FeatureSignals. It applies to FeatureSignals Cloud, Dedicated Cloud, and provides guidance for self-hosted deployments.
Warning
Recovery Objectives (RPO / RTO)
Recovery objectives define how much data loss is acceptable (RPO) and how quickly service must be restored (RTO):
| Scenario | RPO | RTO | Target |
|---|---|---|---|
| Database corruption (single AZ) | < 5 minutes (WAL shipping) | < 30 minutes | FeatureSignals Cloud |
| Full region failure | < 1 hour (cross-region backup) | < 4 hours | FeatureSignals Cloud |
| Dedicated Cloud — instance failure | < 5 minutes (WAL shipping) | < 15 minutes (auto-failover) | Dedicated Cloud |
| Self-hosted — complete rebuild | Customer-defined backup schedule | Customer-driven | Self-Hosted |
Backup Strategy
FeatureSignals employs a layered backup strategy to meet the RPO targets:
PostgreSQL WAL Archiving
Continuous Write-Ahead Log (WAL) archiving to cloud object storage (S3-compatible). Point-in-time recovery with 5-minute granularity. WAL segments are shipped every 60 seconds or when they reach 16 MB.
Daily Full Backups
Full pg_dump backups taken daily at 03:00 UTC during low-traffic window. Encrypted at rest with AES-256. Retained for 30 days. Stored in a separate region from the primary database.
Cross-Region Replication
Backups replicated to a secondary cloud region within 1 hour. For Dedicated Cloud, customers can configure an additional replication target in their own object storage account.
Immutable Backups
Backups stored with object lock (WORM — write once, read many) for 7 days. This protects against ransomware and accidental deletion. Compliance mode prevents even root accounts from deleting locked backups.
Restore Procedures
1. Database Restore from Backup
- Provision a new PostgreSQL instance (same version as backup).
- Download the latest daily backup from object storage.
- Restore with
pg_restoreto the new instance. - Apply WAL segments forward to the desired point-in-time.
- Update DNS or connection strings to point to the new instance.
- Verify flag evaluations return expected results from a test SDK.
2. Full Stack Recovery
- Provision new compute instances in the target region.
- Restore PostgreSQL database (follow procedure above).
- Deploy the latest FeatureSignals release via CI/CD or Helm chart.
- Populate Redis cache by restarting the server (auto-warms from database).
- Verify health endpoints:
GET /healthandGET /ready. - Run the integration test suite against the restored environment.
- Switch DNS or load balancer traffic to the new stack.
Regional Failover
FeatureSignals Cloud uses active-passive regional failover for disaster recovery:
- Primary region: All traffic served from the primary cloud region. Database is the source of truth.
- Standby region: Infrastructure pre-provisioned (compute, database instance, object storage). Database restored from the latest cross-region backup. Not serving traffic in normal operation.
- Failover trigger: Manual decision by the on-call engineer after confirming the primary region is unrecoverable within RTO. Failover is not automatic to prevent split-brain scenarios.
- DNS cutover: Update DNS records to point to the standby region. TTL is set to 60 seconds to allow fast propagation.
Testing Disaster Recovery
DR procedures are only as good as their last test. We run the following DR tests on a regular cadence:
| Test Type | Frequency | Scope |
|---|---|---|
| Backup verification | Weekly (automated) | Verify latest backup is restorable. Checksum validation. |
| Tabletop exercise | Monthly | Walk through DR procedures with the engineering team. No actual failover. |
| Database restore drill | Quarterly | Restore database from backup in an isolated environment. Run integration tests. |
| Full regional failover | Biannually | Complete failover to standby region. Serve production traffic for 24 hours. Fail back. |
Self-Hosted DR Guidance
If you're running FeatureSignals self-hosted, you are responsible for your own DR plan. Here's what we recommend:
- Automate PostgreSQL backups — Use
pg_dumpor your cloud provider's managed backup service. Schedule daily full backups with WAL archiving for PITR. - Store backups off-site — Replicate backups to a different region, cloud provider, or on-premises location.
- Document your restore procedure — Write down the exact steps. The person restoring at 3 AM may not be the person who set it up.
- Test regularly— Restore from backup into a staging environment quarterly. A backup you haven't tested is not a backup.
- Monitor backup health — Alert if backups fail, if WAL shipping lags, or if backup storage is approaching capacity.
Info