Modern agent-based systems tend to fail in subtle ways.
Not with crashes — but with drift:
- the “right” document slowly stops being selected,
- a heuristic fires twice,
- a preference rule silently stops applying,
- a language fallback kicks in without anyone noticing.
When this happens in production, logs alone are not enough.
You need structured, stable telemetry that tells you what decision was made, why, and whether it actually mattered.
In this post, I’ll walk through how we hardened telemetry for a domain-aware agent pipeline in .NET using OpenTelemetry + Aspire, focusing on:
- Deterministic tracing contracts
- Low-cardinality metrics
- Drift detection without log spam
- Developer-friendly local visibility (F5 / dotnet run)
All examples are domain-agnostic and apply to any policy-driven RAG or agent system.
The Problem: “Invisible” Correctness Bugs
In agent systems, many critical behaviors are intentional and non-fatal:
- a domain preference boosts one document over another,
- a language fallback is applied,
- a keyword search is skipped to avoid cross-language drift,
- a rule matches but is evidence-gated and does nothing.
From the outside, the answer may still look “reasonable”.
Without telemetry, you can’t tell:
- whether a rule fired,
- whether it mutated ranking,
- whether it was blocked by missing evidence,
- whether it ran once or twice.
So we defined a rule early on:
Continue reading “Production-Grade Telemetry for Domain-Aware Agent Systems in .NET (Aspire + OpenTelemetry)”Every deterministic decision must be observable, cheaply, and in a stable shape.
