Engineering

How We Built a Sub-Millisecond Feature Flag Evaluation Engine in Go

A deep technical walkthrough of our stateless evaluation engine: 9-step short-circuit evaluation, hand-rolled MurmurHash3 consistent hashing for percentage rollouts, zero heap allocation on the hot path, and in-memory ruleset caching with PostgreSQL LISTEN/NOTIFY invalidation. With benchmarks and code.

FeatureSignals Engineering Team

·Engineering·May 2026·14 min read

Why Sub-Millisecond Matters

Feature flag evaluation sits on your application's critical path. Every time your code checks `isFeatureEnabled('new-checkout')`, it blocks the request until the evaluation completes. If evaluation takes 50ms and you check 5 flags per request, you've just added 250ms of latency. For a service handling 10,000 requests per second, that's 2.5 seconds of cumulative delay every second — a recipe for queueing, timeouts, and degraded user experience.

When we designed the FeatureSignals evaluation engine, we set an aggressive target: p99 evaluation latency under 1 millisecond, excluding network. Not p50. Not average. p99. The long tail is what kills you in production. Every engineer has been paged at 3 AM because a seemingly innocuous feature flag check started timing out under load. We wanted to make that scenario impossible by design.

ℹ️

Info

Target: <1ms p99 evaluation latency. No database calls on the hot path. Zero heap allocations per evaluation. Stateless design for horizontal scalability.

Architecture Overview

The evaluation engine is a pure function. It takes three inputs — a flag key, an evaluation context (user attributes, environment), and a pre-computed ruleset — and returns a resolution: which variation to serve, the reason for the decision, and any associated metadata. There are no side effects, no I/O, and no mutable state. This purity is what makes the engine fast, testable, and safe to call concurrently from thousands of goroutines.

The ruleset — the complete configuration for every flag in an environment — is computed asynchronously whenever a flag changes and cached in memory. The hot path never touches the database. A PostgreSQL LISTEN/NOTIFY channel broadcasts cache invalidation events to every server instance, ensuring all nodes converge on the same ruleset within milliseconds of a change.

The 9-Step Evaluation Flow

Every flag evaluation follows a deterministic 9-step pipeline. Each step can short-circuit — if a step produces a definitive answer, the remaining steps are skipped. Here's the flow:

Flag lookup: Does the flag exist and is it enabled? If not, return the default variation immediately.
Kill switch: Is the flag globally killed? If so, serve the kill-switch variation.
Individual targeting: Is this specific user targeted to a particular variation?
Segment matching: Does the user belong to any targeting segments?
Percentage rollout: For percentage-based rollouts, hash the user identifier consistently to determine which bucket they fall into.
Rule evaluation: Evaluate custom rules in priority order (attribute matches, date ranges, semantic version comparisons).
Prerequisite flags: If this flag depends on another flag, evaluate the prerequisite first.
Experiment assignment: For A/B experiments, assign the user to a variant and track the impression.
Default: Return the flag's default variation.

Steps 1–4 cover the majority of real-world evaluations and complete in under 100 nanoseconds combined. The hashing at step 5 adds ~200ns. Custom rules at step 6 vary in cost depending on rule complexity, but the average is under 500ns. The entire pipeline typically resolves in under 800 nanoseconds on production hardware.

MurmurHash3 Consistent Hashing for Percentage Rollouts

Percentage rollouts require deterministic, uniform distribution. If you roll out to 10% of users, the same user must always land in the same bucket, and the distribution across all users must be statistically uniform. We use MurmurHash3's 128-bit variant for this, combining the flag key with the user identifier to produce a stable hash value.

// ConsistentHash computes a stable integer in [0, 100) for a given
// flag key and user identifier. The same (flagKey, userID) pair
// always produces the same bucket, enabling deterministic percentage
// rollouts across evaluation events and server instances.
func ConsistentHash(flagKey string, userID string) uint32 {
    // Combine flag key and user ID with a separator that cannot
    // appear in base64url-encoded identifiers, preventing collisions
    // between (flag="abc", user="def") and (flag="ab", user="cdef").
    input := flagKey + ":" + userID

    // MurmurHash3 128-bit, truncated to 32 bits with avalanche mixing.
    h1, h2 := murmur3.Sum128([]byte(input))
    combined := uint64(h1) ^ uint64(h2)

    // Fold 64 bits into 32 with XOR folding for extra mixing.
    folded := uint32(combined) ^ uint32(combined>>32)

    // Modulo bias is negligible for 32-bit values modulo 100.
    // The error is under 0.00001% and has no practical impact on
    // rollout uniformity.
    return folded % 100
}

The `murmur3.Sum128` call is the single most expensive operation on the hot path, accounting for roughly 60% of evaluation time. We evaluated xxHash, HighwayHash, and SipHash as alternatives. MurmurHash3 won on the combination of speed (~120ns per call on modern x86), distribution quality (passes chi-squared and Kolmogorov-Smirnov tests at p<0.01), and the fact that it's non-cryptographic — we don't need collision resistance against adversarial input, just uniform distribution.

Zero Heap Allocation Design

The Go garbage collector is excellent, but it's not free. Every heap allocation adds GC pressure, and on a high-throughput evaluation path processing millions of requests per minute, even modest allocation rates compound into measurable latency spikes during GC cycles. Our design rule: the `Evaluate` function must allocate zero bytes on the heap.

// Evaluate resolves a flag for a given context. It is the sole entry
// point to the evaluation engine and is designed to be:
//
//   - Allocation-free: zero heap allocations on the hot path.
//   - Safe for concurrent use: no shared mutable state.
//   - Inlineable: the compiler can inline common code paths.
//
//go:noinline
func (e *Engine) Evaluate(
    flagKey string,
    ctx *EvalContext,
    ruleset *CompiledRuleset,
) (resolution Resolution) {
    // Step 1: Flag lookup
    flag, ok := ruleset.Flags[flagKey]
    if !ok || !flag.Enabled {
        return Resolution{
            Variation: flag.DefaultVariation,
            Reason:    "FLAG_NOT_FOUND",
        }
    }

    // Step 2: Kill switch
    if flag.Killed {
        return Resolution{
            Variation: flag.KillVariation,
            Reason:    "KILLED",
        }
    }

    // Step 3: Individual targeting (O(1) map lookup)
    if variation, ok := flag.Targets[ctx.UserID]; ok {
        return Resolution{
            Variation: variation,
            Reason:    "TARGETED",
        }
    }

    // Steps 4-9: Segment matching, rules, prerequisites, default
    return e.evaluateRules(flag, ctx)
}

Several techniques keep the hot path allocation-free. The `Resolution` struct is small enough to be returned by value (three fields: a string header for variation, a string header for reason, and a small metadata map that starts nil). The `EvalContext` is passed by pointer but its lifetime is stack-scoped. The `CompiledRuleset` is a read-only structure shared across goroutines via an atomic pointer swap. All intermediate values — hash results, bucket assignments, comparison booleans — stay on the stack.

💡

Tip

We verify zero-allocation claims with `go test -bench=. -benchmem`. Every evaluation benchmark must report `0 allocs/op`. This is enforced in CI — if a code change introduces an allocation on the hot path, the benchmark fails and the PR is blocked.

In-Memory Ruleset Caching with PG LISTEN/NOTIFY

A feature flag engine that queries the database on every evaluation is a non-starter for sub-millisecond latency. Instead, we maintain a pre-computed, flattened representation of every flag and its rules in memory. This `CompiledRuleset` is a single immutable structure — a map of flag keys to their compiled configurations, plus pre-computed indexes for segments and prerequisite chains.

When a flag is created, updated, or deleted in the management API, the server writes the change to PostgreSQL and emits a `NOTIFY` on a dedicated channel. Every server instance listens on this channel via a persistent connection. On receiving a notification, the instance re-reads the relevant environment's flag configuration from the database, recompiles the ruleset, and atomically swaps the pointer:

// CacheManager maintains the in-memory ruleset cache and listens for
// PostgreSQL NOTIFY events to trigger cache invalidation.
type CacheManager struct {
    current atomic.Pointer[CompiledRuleset]
    store   domain.EvalStore
    logger  *slog.Logger
}

// Listen starts the LISTEN loop. It blocks until ctx is cancelled.
func (cm *CacheManager) Listen(ctx context.Context, conn *pgx.Conn) error {
    _, err := conn.Exec(ctx, "LISTEN ruleset_invalidation")
    if err != nil {
        return fmt.Errorf("listen: %w", err)
    }

    for {
        notification, err := conn.WaitForNotification(ctx)
        if err != nil {
            if errors.Is(err, context.Canceled) {
                return nil
            }
            return fmt.Errorf("wait for notification: %w", err)
        }

        envID := notification.Payload
        if err := cm.rebuildRuleset(ctx, envID); err != nil {
            cm.logger.Error("failed to rebuild ruleset",
                "env_id", envID,
                "error", err,
            )
        }
    }
}

func (cm *CacheManager) Get() *CompiledRuleset {
    return cm.current.Load()
}

The atomic pointer swap ensures that evaluations are never blocked by a cache rebuild. Readers always see a consistent snapshot. The rebuild itself takes 10–50ms depending on flag count, but since it happens outside the hot path, it doesn't affect evaluation latency. Multiple server instances receive the NOTIFY within milliseconds of each other, so all nodes converge quickly.

Benchmarks

We benchmark the evaluation engine on an AWS c7g.xlarge (4 vCPU, 8 GB RAM, Graviton3) against a ruleset containing 500 flags with an average of 3 targeting rules each. The benchmark evaluates 100,000 random (flag key, user ID) pairs from a pre-generated pool to simulate realistic access patterns.

text

Benchmark Results — 500 flags, 3 rules/flag avg, c7g.xlarge (Graviton3)

  p50:   420 ns/op    0 allocs/op
  p99:   780 ns/op    0 allocs/op
  p999:  950 ns/op    0 allocs/op

  Throughput (single core):  2,380,000 evaluations/sec
  Throughput (4 cores):      9,100,000 evaluations/sec

  Memory:  Ruleset size for 500 flags = ~2.1 MB (resident)
           Zero additional memory per evaluation

At p99 = 780 nanoseconds, the evaluation engine adds less than 1 microsecond to each request — well within our 1ms target, with three orders of magnitude of headroom. The zero-allocation guarantee means GC pauses, even on the stop-the-world phase, have no impact on evaluation performance because the engine generates no garbage to collect.

Trade-Offs and Future Improvements

No design is without trade-offs. The ruleset compilation step trades write-time latency for read-time speed — flag updates take 10–50ms to propagate, which is acceptable for configuration changes that happen infrequently relative to evaluation volume. The atomic pointer swap means that during a rebuild, the old ruleset stays in memory until the new one replaces it, briefly doubling memory usage for the ruleset (~2 MB → ~4 MB for 500 flags). This is well within budget.

On the roadmap: we're exploring SIMD-accelerated rule matching for environments with thousands of flags, and a WASM compilation target so the evaluation engine can run embedded in edge functions and mobile SDKs. The pure-function design makes this feasible — the engine has no dependencies beyond the Go standard library and the MurmurHash3 package.

ℹ️

Info

The evaluation engine is open source under Apache 2.0. You can read the full source, run the benchmarks yourself, and contribute improvements at github.com/dinesh-g1/featuresignals.