From Blind Spots to Clear Insights: Monitoring AI Agent Conversations

Blogs and Articles

A Deep Dive into the Observability Stack that Tamed our Multi-Cloud AI Agents

Iron Mountain logo with blue mountains
Dhanesh Kumar A V
Technical Lead-DevOps
7  mins
AI concept typing on laptop

Building an AI demo is easy. Building a production-grade AI platform that manages costs, latency, and accuracy across thousands of conversations is a different beast entirely.

In our journey to build a robust AI agent platform, we quickly discovered that standard logging approaches, the kind that served us well for traditional web apps, simply weren't sufficient. When a user asks a question, it might trigger a chain reaction: a retrieval step (RAG), a tool execution, and a summarization call, all routed through different providers like AWS Bedrock, Azure AI, or GCP Vertex.

In this blog, you will find a transparent look at the specific engineering challenges we faced, the open-source stack we built (using Grafana, Loki, and Prometheus), and the structured data strategy that turned our blind spots into clear insights.

The Challenge: Why AI Agents Break Traditional Monitoring

In a standard microservice architecture, you monitor HTTP 500s and average latency. If the server is up and responding fast, you’re green. In the world of AI Agents, "green" dashboards can hide massive problems.

We identified four distinct challenges that required a new approach:

  • Cost Unpredictability: A standard API endpoint costs roughly the same per hit. An LLM call can range from a fraction of a cent to a dollar depending on the model (GPT-4o vs. Mini) and token depth.
  • The "Black Box" Workflow: A single user request isn't just one database query; it is a multi-step conversation involving tools and reasoning loops.
  • Provider Chaos: We use a multi-cloud strategy (AWS, Azure, GCP). Debugging errors across three different provider schemas is a nightmare without unification.
  • Performance Nuance: "Total Duration" is a bad metric for streaming. A 10-second response is fine if the first token arrives in 200ms (TTFT). If the user waits 10 seconds for a blank screen, that's a churn event.

We realized we needed to move from asking "Is the service healthy?" to asking "What is the model thinking, and how much did that thought cost?"

The Foundation: Structured JSON Logging

Before we touched a single monitoring tool, we had to fix our data. Text-based logs (logger.info("Agent called LLM")) are useless for high-cardinality analysis.

We standardized structured JSON logging across every microservice. The secret sauce wasn't just using JSON; it was the strict enforcement of Correlation IDs.

Every log entry shares a schema that ties the technical operation to the business context:

JSON

From Blind Spots to Clear Insights: Monitoring AI Agent Conversations

Why this matters: The request_id allows us to trace a single transaction across services. The session_id lets us replay a full user conversation. The company_id allows us to aggregate costs per customer.

The Observability Stack

We needed a stack that could ingest these heavy JSON logs and allow us to query them fast. We settled on a Prometheus/Loki/Grafana architecture.

  • Loki: Aggregates structured logs from all microservices, enriched with Kubernetes metadata (pod name, namespace, cluster)
  • Prometheus: Collects metrics from services and Kubernetes infrastructure for resource monitoring
  • Grafana: Provides unified dashboards that transform logs and metrics into actionable insights

Deep Dive 1: The Gateway (LiteLLM Proxy)

Managing connections to OpenAI, Azure, and Bedrock separately was scaling technical debt. We implemented LiteLLM Proxy as a unified gateway. Every single LLM request flows through this proxy.

This acts as our "toll booth," logging comprehensive event data before the request ever reaches the provider.

The LiteLLM Log Schema:

JSON

From Blind Spots to Clear Insights: Monitoring AI Agent Conversations

The "Retry Loop" Discovery

Using this gateway, we built a Time-Series graph tracking cost trends. We immediately spotted an anomaly: a massive spike in costs from a single agent. Because we had the logs, we saw the agent had entered a retry loop. It was hitting a rate limit error, catching it, and immediately retrying without a backoff, burning tokens on failed attempts. Without this gateway visibility, that bug would have persisted until the monthly invoice arrived.

Deep Dive 2: The Brain (Agent Builder)

While LiteLLM watches the traffic, our Agent Builder service watches the logic.

AI Agents don't just "chat." They use tools. They might search a vector database (RAG), run a calculation, or scrape a website. We needed to trace these "thoughts" and “actions”.

We implemented Conversation Tracing logs. This allows us to see the "steps" the agent took to arrive at an answer.

The Tool Execution Log:

JSON

From Blind Spots to Clear Insights: Monitoring AI Agent Conversations

End-to-End Debugging

Previously, if a user said "The agent gave me the wrong document," we had to manually read code. Now, we filter Grafana by the conversation_id. We see the timeline:

  1. User Query Received.
  2. RAG Tool invoked (Success, 4 docs found).
  3. LLM Summarization invoked.
  4. Final Answer.

We can pinpoint exactly where the logic failed—whether the RAG tool fetched bad data or the LLM hallucinated the summary.

The Payoff: What We Can See Now

Moving to this structured approach transformed our operations.

  1. Granular Cost Attribution
    We don't just know our total bill. We know that Company A is driving 40% of our costs, specifically using the Data Analysis Agent. This allows us to optimize efficiently or adjust pricing models based on real usage.
  2. Performance Reality
    We stopped looking at average latency. We now track p90, p95, and p99 latency and Time To First Token (TTFT). We identified that while our average was fine, our p99 users were waiting 60+ seconds due to cold starts on a specific Azure region.
  3. Unified Multi-Cloud View
    We no longer tab-switch between AWS CloudWatch and Azure Monitor. Our Grafana dashboard displays utilization for Azure deployments next to GCP Vertex AI endpoints. We know when to request quota increases before we hit the limit.
From Blind Spots to Clear Insights: Monitoring AI Agent Conversations

Q & A

Q: Why use a Proxy (LiteLLM) instead of direct SDKs?
A: Decoupling. It allows us to switch providers without changing application code. If Azure goes down, we can route traffic to AWS. It also gives us a centralized place to count tokens and track costs, rather than scattering that logic across every microservice.

Q: How do you handle High Cardinality in logs?
A: This is why we use Loki. Loki indexes the metadata (labels like Service and Cluster) but leaves the log line (the heavy JSON) unindexed until query time. This allows us to store massive amounts of log data without the storage costs of a full-text index like Elasticsearch, while still being able to grep for a request_id instantly.

Q: What is the most critical metric for AI Agents?
A: Cost per Successful Task. Speed is irrelevant if the answer is wrong. Low cost is irrelevant if the user churns. We try to balance the "Quality/Cost" ratio by monitoring which models (e.g., GPT-4o vs Mini) are being used for which tasks.


Key Takeaway

Observability in AI isn't a "nice to have"; it is the difference between a prototype and a product.

You cannot optimize what you cannot measure. By enforcing structured logging and centralizing our view through a proxy, we turned a black box into a transparent system where we can confidently scale.

If you are building AI agents, stop logging strings. Start structuring your data today.