Unlocking Feature Flag Observability
The video commentary section is a note section for videos I enjoyed watching.
Like many posts on this blog, it's my thoughts at the time - Potentially even unedited before the video completes.
Invisble Changes
This one resonates with me. Checking latest deployments only to find no changes. The later on in the investigation going to the inevitable conclusion that a feature flag must be the culprit and then having to track down the feature flag either being toggled on, or seeing which are in ramp up.
“Oh, I toggled that feature flag” 💢
Disparity Between Impact & Cause
Currently dashboards for production focus on showing you impact, such as latency or error increases. But that’s a stark contrast from showing the cause, such as a feature flag being toggled on.
This is frustrating because the MTTR is increased from seconds (toggle it off) to much longer - 10’s of minutes.
Observing
Level 0
Nothing 💩
Level 1 - Broadcast
Broadcast the change somewhere. This can be a slack channel or an annotation on the timeline in your observability platform.
We don’t know if the flag is in use by the service, only if it has changed.
Thoughts - Basic implementation, doesn’t take into context different architectures. For example the flag might be enabled for the EU, but you have a US and EU deployment. We lack the context and would have to have the observability platform understand the context.
Level 2 - Targeted Broadcast (Unicast? 😅)
We target the change to specific services. Rather than a global broadcast which might add noise to unrelated services, we attach the meta to the timeline of a specific service.
Thoughts - Still same as the above really. I’d actually merge level 1 and 2.
Level 3 - Automated Targeted Broadcast
Same as above but automated detection of which services use the flag.
Thoughts - Kinda sucks if we have to work it out, it should really be part of the flag design.
Level 4 - Trace Level
Attach the flag meta to the trace. See the standards section for more details.
Thoughts - Ding ding - This is what I’ve personally rolled out before. This does massively spike the amount of data being sent depending on how it’s implemented.
Standards
The obvious ones:
feature_flag.key- The flag identifierfeature_flag.result.variant- Which variant was servedfeature_flag.result.value- The value return by the flagfeature_flag.result.reason- Why this variant was servedfeature_flag.provider.name- The name of the provider
The less obvious or nuanced ones:
feature_flag.set.id- Human readable logical identifier of where the flag is managedfeature_flag.context.id- Provider’s context identifier. Use to lookup specific evaluationsfeature_flag.version- Version of the rule at evaluation time
Final Thoughts
This was a good one. I enjoyed the talk because it was a real problem I keep encountering. When I tried to solve it with traces I went with a very similar approach but produced crazy amounts of data. I think switching to trace events should help with this, but may end up bloating payloads or be less discoverable in certain observability platforms due to lackluster event support. But translating those events into metrics could help too if the cardinality is low. That said it would lack context useful, other than total amounts of negative and positive evaluations for a flag (or just positive).