πΎ Archived View for nox.im βΊ posts βΊ 2022 βΊ 0327 βΊ observability-opentelemetry-context-propagation-iβ¦ captured on 2024-09-29 at 00:37:03. Gemini links have been rewritten to link to archived content
β¬ οΈ Previous capture (2023-09-28)
-=-=-=-=-=-=-
Observability is a superset of monitoring. It is the ability to measure a systems internal state by its outputs such as metrics, events, logs and traces (sometimes referred to as MELT). Its goal is a holistic understanding of one or more applications operation, usually deployed in the service oriented architecture model of distributed computing[1].
1: service oriented architecture model of distributed computing
Heisenberg's uncertainty principle states that more precisely we measure the position of a particle, less precisely we can know its velocity and vice versa.
Something similar is true with observability, the more additinal code we wrap around software interfaces, the more potential faults and performance issues we may have to deal with.
I haven't found a concise, minimal example combining OpenTelemetry with the Uber zap logging library. The example created in this post can be found on GitHub[1].
For instrumentation to work, we need a set of values that are always accessible in our code, from the beginning of a transaction in the web client, all the way through the to backend services, their application, business and data layers.
A "trace" is a record of the lifecycle of a request, including each service it interacts with including http server and databases. A "span" represents operations or segments that are part of a trace.
OpenTelemetry is becoming the one standard for open source instrumentation and telemetry collection and emerged from a merger of OpenCensus and OpenTracing that were preceeded by the likes of Zipkin. OpenZipkin described an http request carrying propagated traces as follows:
Client Tracer Server Tracer βββββββββββββββββββββββββ βββββββββββββββββββββββββ β β β β β TraceContext β Http Request Headers β TraceContext β β βββββββββββββββββββββ β βββββββββββββββββββββ β βββββββββββββββββββββ β β β TraceId β β β X-B3-TraceId β β β TraceId β β β β β β β β β β β β β β ParentSpanId β β Inject β X-B3-ParentSpanId β Extract β β ParentSpanId β β β β βββΌββββββββ>β βββββββββββΌ>β β β β β SpanId β β β X-B3-SpanId β β β SpanId β β β β β β β β β β β β β β Sampling decision β β β X-B3-Sampled β β β Sampling decision β β β βββββββββββββββββββββ β βββββββββββββββββββββ β βββββββββββββββββββββ β β β β β βββββββββββββββββββββββββ βββββββββββββββββββββββββ
For a trace create a `traceID` and every event in the transaction needs to include this id when it is recorded, so that we can find all of the events and reconstruct the trace. The Zipkin `X-B3-TraceId` headers have since been standardized as W3C Trace-Context headers[1] and becomes:
- version
- trace-id
- parent-id
- trace-flags
It is important that as we're merging systems to the OpenTelemetry standard, we have to ensure that client and server are configured to write and read the correct set of headers for a trace to be propagated.
Instrumenting in the microservice environment generally implies adding code to services to log, monitor and track trace data. To keep a sane cost benefit ratio of deploying tracing, it is therefore paramount that we write the least code possible to enable tracing and don't change much in our codebase.
Before we go into the packages, in our main we setup a logger and a tracer from the two packages we're going to create in the following. We also add the logger and the tracer onto our root context that we start passing around in our application. This avoids global variables.
func main() { log, _ := logger.New(zapcore.InfoLevel, true) ctx := logger.NewContext(context.Background(), log) tp := tracer.NewTraceProvider() tc := tp.Tracer("opentelemetry-logger/main") ctx = tracer.NewContext(ctx, tc) logger.FromContext(ctx).Info("start test without traces") // ... run the app or pass ctx into a logging middleware // these two examples explained below run(ctx) run2(ctx) }
The Go standard log[1] package is horribly unstructured, and I tend to use Uberβs zap logger[2] instead.
We're creating a simple log package that creates a structured zap logger:
func New(level zapcore.Level, structured bool, options ...zap.Option) (*zap.Logger, error) { cfg := zap.Config{ Level: zap.NewAtomicLevelAt(level), Encoding: "json", EncoderConfig: zap.NewProductionEncoderConfig(), OutputPaths: []string{"stderr"}, ErrorOutputPaths: []string{"stderr"}, } return cfg.Build(options...) }
and allows us to pass around the zap logger via context propagation. The semi standard is usually `FromContext(ctx) <type>` and `NewContext(ctx, <type>)` for this pattern.
type ctxKey struct{} func FromContext(ctx context.Context) *zap.Logger { c, _ := ctx.Value(ctxKey{}).(*zap.Logger) return c } func NewContext(parent context.Context, z *zap.Logger) context.Context { return context.WithValue(parent, ctxKey{}, z) }
We then modify the `FromContext()` function and inject, if available the OpenTelemetry trace and span IDs on a child logger:
func FromContext(ctx context.Context) *zap.Logger { childLogger, _ := ctx.Value(ctxKey{}).(*zap.Logger) if traceID := trace.SpanFromContext(ctx).SpanContext().TraceID(); traceID.IsValid() { childLogger = childLogger.With(zap.String("trace-id", traceID.String())) } if spanID := trace.SpanFromContext(ctx).SpanContext().SpanID(); spanID.IsValid() { childLogger = childLogger.With(zap.String("span-id", spanID.String())) } return childLogger }
Setup a trace provider and the context propagation helpers:
func NewTraceProvider() *sdktrace.TracerProvider { var err error exp, err := stdouttrace.New(stdouttrace.WithPrettyPrint()) if err != nil { log.Panicf("failed to initialize stdouttrace exporter %v\n", err) return nil } bsp := sdktrace.NewBatchSpanProcessor(exp) tp := sdktrace.NewTracerProvider( sdktrace.WithSampler(sdktrace.AlwaysSample()), sdktrace.WithSpanProcessor(bsp), ) otel.SetTracerProvider(tp) return tp } type ctxKey struct{} func FromContext(ctx context.Context) trace.Tracer { t, _ := ctx.Value(ctxKey{}).(trace.Tracer) return t } func NewContext(parent context.Context, t trace.Tracer) context.Context { return context.WithValue(parent, ctxKey{}, t) }
We're calling `run(ctx)` and `run2(ctx)` from our main and passing our context.
Note that this could also be a middleware to an http router, injecting trace and span on each request before its handler runs.
func run(ctx context.Context) { var span trace.Span ctx, span = tracer.FromContext(ctx).Start(ctx, "run") defer span.End() logger.FromContext(ctx).Info("span test 1 with trace") logger.FromContext(ctx).Info("span test 1 repeats trace and span id") } func run2(ctx context.Context) { var span trace.Span ctx, span = tracer.FromContext(ctx).Start(ctx, "run2") defer span.End() logger.FromContext(ctx).Info("span test 2 with new trace") run3(ctx) } func run3(ctx context.Context) { var span trace.Span ctx, span = tracer.FromContext(ctx).Start(ctx, "run3") defer span.End() logger.FromContext(ctx).Info("span test 3 repeates trace id from 2 with new span id") }
As we can see from the structured output, the trace and span ID on the first run are identical. For `run2` we started a new pair of trace and span and the sub handler `run3` shares the trace ID with its parent while the span ID was rotated.
{"level":"info","ts":1648373086.577307,"caller":"opentelemetry-logger/main.go:20","msg":"start test without traces"} {"level":"info","ts":1648373086.577437,"caller":"opentelemetry-logger/main.go:34","msg":"span test 1 with trace","trace-id":"505fd06ccf32b4904fc60b916623283a","span-id":"d0c7aaed0d60a5ab"} {"level":"info","ts":1648373086.577467,"caller":"opentelemetry-logger/main.go:35","msg":"span test 1 repeats trace and span id","trace-id":"505fd06ccf32b4904fc60b916623283a","span-id":"d0c7aaed0d60a5ab"} {"level":"info","ts":1648373086.577496,"caller":"opentelemetry-logger/main.go:43","msg":"span test 2 with new trace","trace-id":"9c6e95aa4d6254183108ddb4d62dfa8b","span-id":"65c6fb4b4489f025"} {"level":"info","ts":1648373086.577538,"caller":"opentelemetry-logger/main.go:52","msg":"span test 3 repeates trace id from 2 with new span id","trace-id":"9c6e95aa4d6254183108ddb4d62dfa8b","span-id":"2a7190e5e0ca8f16"} {"level":"info","ts":1648373086.577607,"caller":"opentelemetry-logger/main.go:26","msg":"exit test without traces"}
Unstructured, the output looks like this.
2022-03-27T17:25:22.191+0800 INFO opentelemetry-logger/main.go:20 start test without traces 2022-03-27T17:25:22.191+0800 INFO opentelemetry-logger/main.go:34 span test 1 with trace {"trace-id": "3282793f68a156aabf48335d7ba1f468", "span-id": "469843b9d0d1275a"} 2022-03-27T17:25:22.191+0800 INFO opentelemetry-logger/main.go:35 span test 1 repeats trace and span id {"trace-id": "3282793f68a156aabf48335d7ba1f468", "span-id": "469843b9d0d1275a"} 2022-03-27T17:25:22.191+0800 INFO opentelemetry-logger/main.go:43 span test 2 with new trace {"trace-id": "05525adea5a16be4be049bd336d604b3", "span-id": "339d84828277142f"} 2022-03-27T17:25:22.191+0800 INFO opentelemetry-logger/main.go:52 span test 3 repeates trace id from 2 with new span id {"trace-id": "05525adea5a16be4be049bd336d604b3", "span-id": "3cce817c9c297453"} 2022-03-27T17:25:22.191+0800 INFO opentelemetry-logger/main.go:26 exit test without traces
Context propagation is easily and often misused. It helps us keeping the bag of variables around required for observability and allows to narrow interfaces by implicitly passing the logger and other client packages. This helps us avoid globals and thereby removes shared mutable state which is never a good idea.
Don't wrap types, keep `*zap.Logger` in its native form and pass it around, wrapping types increases complexity unnecessarily and requires more code you need to adapt.
Don't wrap `zap.Info()`, `zap.Error()` etc. You're repeating work for no benefit and an increased LOC count.
Don't use the context object to pass function arguments along the way. The function signature should give away determining factors of the operation of the function and convey that to the reader, don't hide normal arguments.