Envoy Log Cost Reduction


The video commentary section is a note section for videos I enjoyed watching.

Like many posts on this blog, it's my thoughts at the time - Potentially even unedited before the video completes.


  • Double logging - Same or similar data being logged multiple times. No brainer to cut these.
  • Filter details - What was the result of a filter? This doesn’t sound like a valuable bit of data to log in every request. Could the data be sampled or aggregated? Sampling ended up being the solution.
  • Double cost
    • Emitting the logs was having a significant cost factor (presumably resource wise - CPU & egress?)
    • Storage of the logs was the second cost factor
  • Expression Convenience - The CEL extraction method was convenient, but had a 17% cost of CPU for the real workload, versus a limited amount of fields with direct extract which used 3.8% of a core.
  • Logging filters - about a 1:600 improvement in CPU usage to switch to logging filters.

Recommendations

Tier 1 - Essential (Always on)

  • 100% sampling
  • fields:
    • status
    • duration
    • bytes
    • path
    • method
  • Use cases: Compliance, basic debugging

Tier 2 - Debug (Errors only)

  • 1-5% sampling
  • fields:
    • (Tier 1 fields)
    • response_flags
    • upstream_service
    • request_id
  • Use cases: full error context

Tier 3 - Deep diagnostics

  • 10% sampling
  • fields:
    • (Tier 1 fields)
    • (Tier 2 fields)
    • metadata
    • timings
  • Use cases: Deep diagnostics
  • Alternative - Tracing?