Product

Introducing the AI Janitor: Autonomous Stale Flag Cleanup

Feature flags rot. The AI Janitor scans your flags against configurable staleness thresholds — 14 days for releases, 30 days for experiments, 90 days for ops toggles — then analyzes your source code and opens cleanup PRs automatically. Works with GitHub, GitLab, Bitbucket, and Azure DevOps. Here's how we built it and what it's already cleaning up in production.

FeatureSignals Product Team

·Product·April 2026·10 min read

The Stale Flag Problem Nobody Talks About

Feature flags are meant to be ephemeral. Ship a feature behind a flag, validate it in production, remove the flag. In practice, flags accumulate like technical debt — each sprint adds new flags while old ones languish in the codebase, their purpose forgotten, their branches dead code, their very existence a source of confusion for the next engineer who encounters them.

We analyzed flag usage patterns across hundreds of engineering teams and found a consistent pattern: the average feature flag lives 4 times longer than its useful lifespan. A release flag created for a 2-week rollout cycle persists for 8 weeks. An experiment flag designed for a 30-day A/B test survives for 4 months. Operations toggles, intended as temporary kill switches, become permanent fixtures. Each stale flag adds dead code paths, multiplies your test matrix, and creates opportunities for bugs to hide.

⚠️

Warning

Every stale flag is a liability. It's a conditional branch that should never execute but could. It's a configuration combination that inflates your CI matrix. It's cognitive load for every developer who reads the codebase.

Category-Aware Staleness Detection

Not all flags age at the same rate. A release toggle for a feature that shipped last week isn't stale — it's still being validated. An experiment flag for a test that concluded 3 months ago is definitely stale. We define three categories with distinct thresholds:

Release flags: 14 days. Once a feature is fully rolled out and stable, the flag protecting it should be removed within two weeks.
Experiment flags: 30 days. A/B tests typically run for 2–4 weeks, plus time to analyze results and implement the winning variant.
Ops toggles: 90 days. Kill switches and operational toggles may need to persist longer, but if they haven't been used in 3 months, it's time to review.

Teams can customize these thresholds per project. Some teams with rapid release cycles set the release threshold to 7 days. Regulated industries with longer change management processes might extend it to 30 days. The AI Janitor adapts to your team's rhythm.

The AI Janitor Pipeline

The AI Janitor runs on a scheduled cadence (daily by default, configurable per project). Each run follows a four-phase pipeline:

Scan: Query all flags in the project. Cross-reference each flag's last evaluation timestamp, creation date, and category against its staleness threshold. Build a candidate list of flags that may be stale.
Analyze: For each candidate, clone the connected Git repository and search the source code for all references to the flag key. The LLM analyzes how the flag is used — is it a simple boolean check? Does it guard critical path logic? Are there fallback behaviors? This determines the complexity and risk of removal.
Generate PR: For flags the LLM determines are safe to remove (high confidence), generate a branch with the flag references removed and the conditional logic simplified. The PR includes a detailed description of what was changed and why.
Review: Open the PR against the team's default branch. A human reviews and merges — the AI proposes, but a person decides.

LLM Integration with Provider Flexibility

The AI Janitor is LLM-agnostic. Teams choose their preferred provider based on their security, cost, and performance requirements:

DeepSeek: Default provider. Excellent code analysis capabilities, strong reasoning for flag removal safety assessment, cost-effective for daily scans.
OpenAI (GPT-4o): Available for teams that prefer the OpenAI ecosystem or have existing agreements.
Azure OpenAI: For Microsoft-centric enterprises with Azure commitments and data residency requirements.
Self-hosted (Ollama, vLLM): For air-gapped environments or teams with strict data sovereignty requirements. No source code ever leaves your infrastructure.

Confidence Scoring: When Is a Flag Safe to Remove?

The LLM doesn't just propose removal — it assigns a confidence score (0–100) to each flag based on a structured analysis:

Usage pattern (30 points): Is the flag checked in a simple if/else? A complex nested conditional? Wrapped in a helper function? Simple patterns score higher.
Code reachability (25 points): Are all code paths reachable? If the flag is enabled, does the disabled branch contain dead code? Static analysis informs this.
Test coverage (20 points): Do tests cover both the enabled and disabled paths? Well-tested flags are safer to remove.
Rollout status (15 points): Has the flag been at 100% for its entire staleness window? Flags still rolling out are lower confidence.
Dependency analysis (10 points): Does removing this flag affect other flags, feature gating, or configuration dependencies?

A flag scoring 85+ is considered safe to remove and the PR is opened automatically. Flags scoring 60–84 generate a review suggestion but don't auto-open a PR. Flags below 60 are flagged for manual triage in the dashboard.

text

AI Janitor — Weekly Summary for acme-corp (March 2026)

  Flags scanned:         247
  Candidates identified:  38
  High confidence (85+):  18 → 14 PRs auto-opened
  Medium confidence:      12 → dashboard review suggestions
  Low confidence:          8 → flagged for manual triage

  Results:
  ✅ 12 PRs merged
  ✅ 1,200 lines of dead code removed
  ✅ 47 fewer conditional branches to test
  ⏳ 2 PRs awaiting review
  ❌ 1 PR closed (flag still needed per team)

Git Provider Support

The AI Janitor integrates natively with the four major Git platforms. Each integration follows the same pattern — clone, branch, commit, push, open PR — but uses the platform-specific API and conventions:

GitHub: Uses the GitHub REST API for PR creation. Supports CODEOWNERS-based reviewer assignment, status checks, and branch protection rules.
GitLab: Uses the GitLab API with merge request creation. Supports approval rules, MR templates, and GitLab CI integration.
Bitbucket: Uses the Bitbucket Cloud/Server API. Supports pull request creation with default reviewers.
Azure DevOps: Uses the Azure DevOps Services REST API. Supports PR creation with required reviewers, work item linking, and branch policies.

What the Generated PR Looks Like

Each AI Janitor PR follows a consistent template designed for quick human review. Here's a typical example:

markdown

## 🤖 AI Janitor: Remove stale flag `show-new-checkout`

### Flag Details
- **Flag key:** `show-new-checkout`
- **Category:** Release
- **Created:** 2026-01-15 (89 days ago)
- **Last evaluated:** 2026-03-01 (43 days ago)
- **Staleness threshold:** 14 days
- **Confidence score:** 92/100

### Changes Made
- Removed flag check in `src/checkout/CheckoutPage.tsx:42`
- Simplified conditional: if/else → single code path (enabled variant kept)
- Removed flag key constant from `src/config/flags.ts`
- Removed associated test case

### Impact Analysis
- **Dead code removed:** 34 lines
- **Test cases removed:** 1 (stale — testing disabled path)
- **Dependencies:** None
- **Breaking change:** No

### Verification
- ✅ Both enabled and disabled paths analyzed
- ✅ Test coverage maintained for remaining code path
- ✅ No other flags reference this code

*This PR was automatically generated by the FeatureSignals AI Janitor.
Please review carefully before merging.*

Human-in-the-Loop: The Critical Safeguard

We designed the AI Janitor with a firm principle: the AI proposes, the human decides. The system never merges PRs automatically, even for the highest-confidence removals. Every PR must be reviewed and merged by a team member with write access to the repository. This isn't a technical limitation — it's a deliberate design choice. Feature flags sometimes have non-obvious side effects that only the team that wrote them understands, and no LLM can fully capture that context.

Getting Started

Connect your Git provider: In the FeatureSignals dashboard, navigate to Settings → Integrations and link your GitHub, GitLab, Bitbucket, or Azure DevOps account.
Configure your LLM provider: Choose your preferred provider and provide an API key (or configure a self-hosted endpoint).
Set staleness thresholds: Adjust the default thresholds if your team's release cadence differs from the defaults.
Enable AI Janitor: Toggle it on per project. The first scan runs within 24 hours.
Review the first batch: Check the dashboard for flagged flags and review the first PRs the Janitor opens.

💡

Tip

Start with the AI Janitor in 'suggest' mode (no auto-PRs) for the first week. This lets you calibrate the confidence thresholds and provider selection before the system starts opening PRs automatically.