💾 Archived View for nox.im › posts › 2022 › 0327 › observability-opentelemetry-context-propagation-i… captured on 2024-03-21 at 15:06:44. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Observability: OpenTelemetry Context Propagation in Go

Observability is a superset of monitoring. It is the ability to measure a systems internal state by its outputs such as metrics, events, logs and traces (sometimes referred to as MELT). Its goal is a holistic understanding of one or more applications operation, usually deployed in the service oriented architecture model of distributed computing[1].

1: service oriented architecture model of distributed computing

Heisenberg's uncertainty principle states that more precisely we measure the position of a particle, less precisely we can know its velocity and vice versa.

Something similar is true with observability, the more additinal code we wrap around software interfaces, the more potential faults and performance issues we may have to deal with.

I haven't found a concise, minimal example combining OpenTelemetry with the Uber zap logging library. The example created in this post can be found on GitHub[1].

1: on GitHub

Deploying Instrumentation

For instrumentation to work, we need a set of values that are always accessible in our code, from the beginning of a transaction in the web client, all the way through the to backend services, their application, business and data layers.

A "trace" is a record of the lifecycle of a request, including each service it interacts with including http server and databases. A "span" represents operations or segments that are part of a trace.

OpenTelemetry is becoming the one standard for open source instrumentation and telemetry collection and emerged from a merger of OpenCensus and OpenTracing that were preceeded by the likes of Zipkin. OpenZipkin described an http request carrying propagated traces as follows:

   Client Tracer                                                  Server Tracer     
┌───────────────────────┐                                       ┌───────────────────────┐
│                       │                                       │                       │
│   TraceContext        │          Http Request Headers         │   TraceContext        │
│ ┌───────────────────┐ │         ┌───────────────────┐         │ ┌───────────────────┐ │
│ │ TraceId           │ │         │ X-B3-TraceId      │         │ │ TraceId           │ │
│ │                   │ │         │                   │         │ │                   │ │
│ │ ParentSpanId      │ │ Inject  │ X-B3-ParentSpanId │ Extract │ │ ParentSpanId      │ │
│ │                   ├─┼────────>│                   ├─────────┼>│                   │ │
│ │ SpanId            │ │         │ X-B3-SpanId       │         │ │ SpanId            │ │
│ │                   │ │         │                   │         │ │                   │ │
│ │ Sampling decision │ │         │ X-B3-Sampled      │         │ │ Sampling decision │ │
│ └───────────────────┘ │         └───────────────────┘         │ └───────────────────┘ │
│                       │                                       │                       │
└───────────────────────┘                                       └───────────────────────┘

For a trace create a `traceID` and every event in the transaction needs to include this id when it is recorded, so that we can find all of the events and reconstruct the trace. The Zipkin `X-B3-TraceId` headers have since been standardized as W3C Trace-Context headers[1] and becomes:

1: W3C Trace-Context headers

- version

- trace-id

- parent-id

- trace-flags

It is important that as we're merging systems to the OpenTelemetry standard, we have to ensure that client and server are configured to write and read the correct set of headers for a trace to be propagated.

Instrumenting in the microservice environment generally implies adding code to services to log, monitor and track trace data. To keep a sane cost benefit ratio of deploying tracing, it is therefore paramount that we write the least code possible to enable tracing and don't change much in our codebase.

Observability with OpenTracing and Uber Zap

Before we go into the packages, in our main we setup a logger and a tracer from the two packages we're going to create in the following. We also add the logger and the tracer onto our root context that we start passing around in our application. This avoids global variables.

func main() {
	log, _ := logger.New(zapcore.InfoLevel, true)
	ctx := logger.NewContext(context.Background(), log)

	tp := tracer.NewTraceProvider()
	tc := tp.Tracer("opentelemetry-logger/main")
	ctx = tracer.NewContext(ctx, tc)

	logger.FromContext(ctx).Info("start test without traces")

    // ... run the app or pass ctx into a logging middleware
    // these two examples explained below
    run(ctx)
    run2(ctx)
}

Logging with Uber zap

The Go standard log[1] package is horribly unstructured, and I tend to use Uber’s zap logger[2] instead.

1: log

2: zap logger

We're creating a simple log package that creates a structured zap logger:

func New(level zapcore.Level, structured bool, options ...zap.Option) (*zap.Logger, error) {
    cfg := zap.Config{
        Level:            zap.NewAtomicLevelAt(level),
        Encoding:         "json",
        EncoderConfig:    zap.NewProductionEncoderConfig(),
        OutputPaths:      []string{"stderr"},
        ErrorOutputPaths: []string{"stderr"},
    }

	return cfg.Build(options...)
}

and allows us to pass around the zap logger via context propagation. The semi standard is usually `FromContext(ctx) <type>` and `NewContext(ctx, <type>)` for this pattern.

type ctxKey struct{}

func FromContext(ctx context.Context) *zap.Logger {
	c, _ := ctx.Value(ctxKey{}).(*zap.Logger)
	return c
}

func NewContext(parent context.Context, z *zap.Logger) context.Context {
	return context.WithValue(parent, ctxKey{}, z)
}

We then modify the `FromContext()` function and inject, if available the OpenTelemetry trace and span IDs on a child logger:

func FromContext(ctx context.Context) *zap.Logger {
	childLogger, _ := ctx.Value(ctxKey{}).(*zap.Logger)

	if traceID := trace.SpanFromContext(ctx).SpanContext().TraceID(); traceID.IsValid() {
		childLogger = childLogger.With(zap.String("trace-id", traceID.String()))
	}

	if spanID := trace.SpanFromContext(ctx).SpanContext().SpanID(); spanID.IsValid() {
		childLogger = childLogger.With(zap.String("span-id", spanID.String()))
	}

	return childLogger
}

OpenTelemetry Tracing

Setup a trace provider and the context propagation helpers:

func NewTraceProvider() *sdktrace.TracerProvider {
	var err error
	exp, err := stdouttrace.New(stdouttrace.WithPrettyPrint())
	if err != nil {
		log.Panicf("failed to initialize stdouttrace exporter %v\n", err)
		return nil
	}
	bsp := sdktrace.NewBatchSpanProcessor(exp)
	tp := sdktrace.NewTracerProvider(
		sdktrace.WithSampler(sdktrace.AlwaysSample()),
		sdktrace.WithSpanProcessor(bsp),
	)
	otel.SetTracerProvider(tp)

	return tp
}

type ctxKey struct{}

func FromContext(ctx context.Context) trace.Tracer {
	t, _ := ctx.Value(ctxKey{}).(trace.Tracer)
	return t
}

func NewContext(parent context.Context, t trace.Tracer) context.Context {
	return context.WithValue(parent, ctxKey{}, t)
}

Putting it all together

We're calling `run(ctx)` and `run2(ctx)` from our main and passing our context.

Note that this could also be a middleware to an http router, injecting trace and span on each request before its handler runs.

func run(ctx context.Context) {
	var span trace.Span
	ctx, span = tracer.FromContext(ctx).Start(ctx, "run")
	defer span.End()

	logger.FromContext(ctx).Info("span test 1 with trace")
	logger.FromContext(ctx).Info("span test 1 repeats trace and span id")
}

func run2(ctx context.Context) {
	var span trace.Span
	ctx, span = tracer.FromContext(ctx).Start(ctx, "run2")
	defer span.End()

	logger.FromContext(ctx).Info("span test 2 with new trace")
	run3(ctx)
}

func run3(ctx context.Context) {
	var span trace.Span
	ctx, span = tracer.FromContext(ctx).Start(ctx, "run3")
	defer span.End()

	logger.FromContext(ctx).Info("span test 3 repeates trace id from 2 with new span id")
}

As we can see from the structured output, the trace and span ID on the first run are identical. For `run2` we started a new pair of trace and span and the sub handler `run3` shares the trace ID with its parent while the span ID was rotated.

{"level":"info","ts":1648373086.577307,"caller":"opentelemetry-logger/main.go:20","msg":"start test without traces"}
{"level":"info","ts":1648373086.577437,"caller":"opentelemetry-logger/main.go:34","msg":"span test 1 with trace","trace-id":"505fd06ccf32b4904fc60b916623283a","span-id":"d0c7aaed0d60a5ab"}
{"level":"info","ts":1648373086.577467,"caller":"opentelemetry-logger/main.go:35","msg":"span test 1 repeats trace and span id","trace-id":"505fd06ccf32b4904fc60b916623283a","span-id":"d0c7aaed0d60a5ab"}
{"level":"info","ts":1648373086.577496,"caller":"opentelemetry-logger/main.go:43","msg":"span test 2 with new trace","trace-id":"9c6e95aa4d6254183108ddb4d62dfa8b","span-id":"65c6fb4b4489f025"}
{"level":"info","ts":1648373086.577538,"caller":"opentelemetry-logger/main.go:52","msg":"span test 3 repeates trace id from 2 with new span id","trace-id":"9c6e95aa4d6254183108ddb4d62dfa8b","span-id":"2a7190e5e0ca8f16"}
{"level":"info","ts":1648373086.577607,"caller":"opentelemetry-logger/main.go:26","msg":"exit test without traces"}

Unstructured, the output looks like this.

2022-03-27T17:25:22.191+0800    INFO    opentelemetry-logger/main.go:20 start test without traces
2022-03-27T17:25:22.191+0800    INFO    opentelemetry-logger/main.go:34 span test 1 with trace  {"trace-id": "3282793f68a156aabf48335d7ba1f468", "span-id": "469843b9d0d1275a"}
2022-03-27T17:25:22.191+0800    INFO    opentelemetry-logger/main.go:35 span test 1 repeats trace and span id   {"trace-id": "3282793f68a156aabf48335d7ba1f468", "span-id": "469843b9d0d1275a"}
2022-03-27T17:25:22.191+0800    INFO    opentelemetry-logger/main.go:43 span test 2 with new trace      {"trace-id": "05525adea5a16be4be049bd336d604b3", "span-id": "339d84828277142f"}
2022-03-27T17:25:22.191+0800    INFO    opentelemetry-logger/main.go:52 span test 3 repeates trace id from 2 with new span id   {"trace-id": "05525adea5a16be4be049bd336d604b3", "span-id": "3cce817c9c297453"}
2022-03-27T17:25:22.191+0800    INFO    opentelemetry-logger/main.go:26 exit test without traces

Context Propagation Antipatterns

Context propagation is easily and often misused. It helps us keeping the bag of variables around required for observability and allows to narrow interfaces by implicitly passing the logger and other client packages. This helps us avoid globals and thereby removes shared mutable state which is never a good idea.

Wrapping Types

Don't wrap types, keep `*zap.Logger` in its native form and pass it around, wrapping types increases complexity unnecessarily and requires more code you need to adapt.

Wrapping Packages

Don't wrap `zap.Info()`, `zap.Error()` etc. You're repeating work for no benefit and an increased LOC count.

Pass Arguments

Don't use the context object to pass function arguments along the way. The function signature should give away determining factors of the operation of the function and convey that to the reader, don't hide normal arguments.