Production-Grade Telemetry for Domain-Aware Agent Systems in .NET (Aspire + OpenTelemetry)

Modern agent-based systems tend to fail in subtle ways.

Not with crashes — but with drift:

  • the “right” document slowly stops being selected,
  • a heuristic fires twice,
  • a preference rule silently stops applying,
  • a language fallback kicks in without anyone noticing.

When this happens in production, logs alone are not enough.
You need structured, stable telemetry that tells you what decision was made, why, and whether it actually mattered.

In this post, I’ll walk through how we hardened telemetry for a domain-aware agent pipeline in .NET using OpenTelemetry + Aspire, focusing on:

  • Deterministic tracing contracts
  • Low-cardinality metrics
  • Drift detection without log spam
  • Developer-friendly local visibility (F5 / dotnet run)

All examples are domain-agnostic and apply to any policy-driven RAG or agent system.


The Problem: “Invisible” Correctness Bugs

In agent systems, many critical behaviors are intentional and non-fatal:

  • a domain preference boosts one document over another,
  • a language fallback is applied,
  • a keyword search is skipped to avoid cross-language drift,
  • a rule matches but is evidence-gated and does nothing.

From the outside, the answer may still look “reasonable”.

Without telemetry, you can’t tell:

  • whether a rule fired,
  • whether it mutated ranking,
  • whether it was blocked by missing evidence,
  • whether it ran once or twice.

So we defined a rule early on:

Every deterministic decision must be observable, cheaply, and in a stable shape.


Design Principles

Before writing code, we locked in a few constraints.

1. Telemetry must be deterministic

No dynamic keys, no free-form strings, no query text in tags.

If a span/event changes shape, a test must fail.

2. Zero “silent drift”

If the system falls back (language, evidence, index), telemetry must say so.

3. Cheap by default

  • Tags over logs
  • One event per decision
  • Counters instead of histograms where possible

4. Works with Aspire and without it

Local F5 / dotnet run must show something without standing up Prometheus.


Instrumentation Strategy

We instrumented one specific decision point:
a domain preference engine that can boost or gate retrieval results.

The pattern is reusable for any policy engine.

Span Model

We attach telemetry to an existing agent span and add one child span:

LawAgent.Handle
└── domain.pref

This avoids exploding trace depth while keeping the decision isolated.


Stable Span Tags

Every execution emits the same set of tags — even when no rule matches.

activity.SetTag("domain.pref.applied", outcome.Applied);
activity.SetTag("domain.pref.rule_id", outcome.RuleId ?? "<none>");
activity.SetTag("domain.pref.rule_priority", outcome.RulePriority);
activity.SetTag("domain.pref.evidence_gated", outcome.EvidenceGated);
activity.SetTag("domain.pref.hints_appended", outcome.HintsAppended);
activity.SetTag("domain.pref.score_mutations", outcome.ScoreMutations);
activity.SetTag("domain.pref.score_delta_total", outcome.ScoreDeltaTotal);

Key points:

  • <none> is a sentinel, not null
  • Tags are always present
  • No tag value depends on user input

One Event, One Meaning

We emit exactly one event per decision:

activity.AddEvent(new ActivityEvent(
    "domain.preference",
    tags: new ActivityTagsCollection
    {
        ["rule_id"] = outcome.RuleId ?? "<none>",
        ["match_reason"] = outcome.MatchReason,
        ["candidate_celexes"] = outcome.CandidateCelexesCapped,
        ["top_before"] = outcome.TopBefore,
        ["top_after"] = outcome.TopAfter
    }));

This event answers the operational question:

Did this preference actually change anything?


Payload Guards (Hot-Path Safety)

Telemetry often dies by a thousand cuts:
one extra field here, one longer list there.

We enforced hard caps:

  • Lists (e.g. candidate document IDs) are capped at 10 items
  • Overflow is encoded as a,b,c,...,+N
  • Never log raw queries or user text

This is enforced in code and locked by tests.


Counters: Cheap, Actionable Metrics

We added two low-cardinality counters:

domain_pref_applied_total{rule_id}
domain_pref_evidence_gated_total{rule_id}

Why counters?

  • They’re cheap
  • They show trends
  • They answer questions like:
    • Is this rule still being used?
    • Are we frequently evidence-gated?

Cardinality Guard

To avoid metric explosions:

  • rule_id must match ^[a-z0-9._-]+$
  • Only whitelisted rule IDs are allowed

Again: enforced by unit tests.


Aspire Visibility (The Common Gotchas)

1. Custom ActivitySource must be registered

If you create spans via a custom ActivitySource, Aspire won’t show them unless you do this:

.AddSource("My.AgentSystem")

This is the #1 reason spans “don’t show up”.


2. Custom meters must be registered explicitly

.AddMeter("My.Domain.Preferences.Metrics")

Without this, counters are silently dropped.


3. OTLP endpoint must exist

Aspire sets this automatically.

For dotnet run / F5, we added a dev-only fallback:

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc

No Prometheus required — OTLP only.


Making It Regression-Proof

Every telemetry assumption is locked with tests:

  • Span exists
  • Tags always present
  • Event exists exactly once
  • Payload caps respected
  • Metric labels validated
  • No legacy fields leak back in

This turns telemetry into an API contract, not an afterthought.


Why This Matters

Telemetry like this doesn’t just help debugging.

It lets you:

  • detect semantic drift early,
  • prove determinism,
  • reason about policy behavior under real traffic,
  • refactor safely months later.

And most importantly:

It makes invisible correctness visible.


Final Thoughts

You don’t need to instrument everything.

Instrument:

  • decisions,
  • gates,
  • fallbacks,
  • preference applications.

If a future bug can’t explain itself through telemetry, it will cost you hours.

This setup cost us a bit of upfront discipline —
but it already paid for itself the first time something didn’t drift silently.

That’s all folks!

Cheers!
Gašper Rupnik

{End.}

Leave a comment

Website Powered by WordPress.com.

Up ↑