Invisble Changes

This one resonates with me. Checking latest deployments only to find no changes. The later on in the investigation going to the inevitable conclusion that a feature flag must be the culprit and then having to track down the feature flag either being toggled on, or seeing which are in ramp up.

“Oh, I toggled that feature flag” 💢

Disparity Between Impact & Cause

Currently dashboards for production focus on showing you impact, such as latency or error increases. But that’s a stark contrast from showing the cause, such as a feature flag being toggled on.

This is frustrating because the MTTR is increased from seconds (toggle it off) to much longer - 10’s of minutes.

Observing

Level 0

Nothing 💩

Level 1 - Broadcast

Broadcast the change somewhere. This can be a slack channel or an annotation on the timeline in your observability platform.

We don’t know if the flag is in use by the service, only if it has changed.

Thoughts - Basic implementation, doesn’t take into context different architectures. For example the flag might be enabled for the EU, but you have a US and EU deployment. We lack the context and would have to have the observability platform understand the context.

Level 2 - Targeted Broadcast (Unicast? 😅)

We target the change to specific services. Rather than a global broadcast which might add noise to unrelated services, we attach the meta to the timeline of a specific service.

Thoughts - Still same as the above really. I’d actually merge level 1 and 2.

Level 3 - Automated Targeted Broadcast

Same as above but automated detection of which services use the flag.

Thoughts - Kinda sucks if we have to work it out, it should really be part of the flag design.

Level 4 - Trace Level

Attach the flag meta to the trace. See the standards section for more details.

Thoughts - Ding ding - This is what I’ve personally rolled out before. This does massively spike the amount of data being sent depending on how it’s implemented.

Standards

The obvious ones:

feature_flag.key - The flag identifier
feature_flag.result.variant - Which variant was served
feature_flag.result.value - The value return by the flag
feature_flag.result.reason - Why this variant was served
feature_flag.provider.name - The name of the provider

The less obvious or nuanced ones:

feature_flag.set.id - Human readable logical identifier of where the flag is managed
feature_flag.context.id - Provider’s context identifier. Use to lookup specific evaluations
feature_flag.version - Version of the rule at evaluation time

Final Thoughts

This was a good one. I enjoyed the talk because it was a real problem I keep encountering. When I tried to solve it with traces I went with a very similar approach but produced crazy amounts of data. I think switching to trace events should help with this, but may end up bloating payloads or be less discoverable in certain observability platforms due to lackluster event support. But translating those events into metrics could help too if the cardinality is low. That said it would lack context useful, other than total amounts of negative and positive evaluations for a flag (or just positive).

Unlocking Feature Flag Observability