Elevate the power of your work
Get a FREE consultation today!
A Deep Dive into the Observability Stack that Tamed our Multi-Cloud AI Agents

Building an AI demo is easy. Building a production-grade AI platform that manages costs, latency, and accuracy across thousands of conversations is a different beast entirely.
In our journey to build a robust AI agent platform, we quickly discovered that standard logging approaches, the kind that served us well for traditional web apps, simply weren't sufficient. When a user asks a question, it might trigger a chain reaction: a retrieval step (RAG), a tool execution, and a summarization call, all routed through different providers like AWS Bedrock, Azure AI, or GCP Vertex.
In this blog, you will find a transparent look at the specific engineering challenges we faced, the open-source stack we built (using Grafana, Loki, and Prometheus), and the structured data strategy that turned our blind spots into clear insights.
In a standard microservice architecture, you monitor HTTP 500s and average latency. If the server is up and responding fast, you’re green. In the world of AI Agents, "green" dashboards can hide massive problems.
We identified four distinct challenges that required a new approach:
We realized we needed to move from asking "Is the service healthy?" to asking "What is the model thinking, and how much did that thought cost?"
Before we touched a single monitoring tool, we had to fix our data. Text-based logs (logger.info("Agent called LLM")) are useless for high-cardinality analysis.
We standardized structured JSON logging across every microservice. The secret sauce wasn't just using JSON; it was the strict enforcement of Correlation IDs.
Every log entry shares a schema that ties the technical operation to the business context:
JSON
Why this matters: The request_id allows us to trace a single transaction across services. The session_id lets us replay a full user conversation. The company_id allows us to aggregate costs per customer.
We needed a stack that could ingest these heavy JSON logs and allow us to query them fast. We settled on a Prometheus/Loki/Grafana architecture.
Managing connections to OpenAI, Azure, and Bedrock separately was scaling technical debt. We implemented LiteLLM Proxy as a unified gateway. Every single LLM request flows through this proxy.
This acts as our "toll booth," logging comprehensive event data before the request ever reaches the provider.
JSON
Using this gateway, we built a Time-Series graph tracking cost trends. We immediately spotted an anomaly: a massive spike in costs from a single agent. Because we had the logs, we saw the agent had entered a retry loop. It was hitting a rate limit error, catching it, and immediately retrying without a backoff, burning tokens on failed attempts. Without this gateway visibility, that bug would have persisted until the monthly invoice arrived.
While LiteLLM watches the traffic, our Agent Builder service watches the logic.
AI Agents don't just "chat." They use tools. They might search a vector database (RAG), run a calculation, or scrape a website. We needed to trace these "thoughts" and “actions”.
We implemented Conversation Tracing logs. This allows us to see the "steps" the agent took to arrive at an answer.
JSON
Previously, if a user said "The agent gave me the wrong document," we had to manually read code. Now, we filter Grafana by the conversation_id. We see the timeline:
We can pinpoint exactly where the logic failed—whether the RAG tool fetched bad data or the LLM hallucinated the summary.
Moving to this structured approach transformed our operations.
Q: Why use a Proxy (LiteLLM) instead of direct SDKs?
A: Decoupling. It allows us to switch providers without changing application code. If Azure goes down, we can route traffic to AWS. It also gives us a centralized place to count tokens and track costs, rather than scattering that logic across every microservice.
Q: How do you handle High Cardinality in logs?
A: This is why we use Loki. Loki indexes the metadata (labels like Service and Cluster) but leaves the log line (the heavy JSON) unindexed until query time. This allows us to store massive amounts of log data without the storage costs of a full-text index like Elasticsearch, while still being able to grep for a request_id instantly.
Q: What is the most critical metric for AI Agents?
A: Cost per Successful Task. Speed is irrelevant if the answer is wrong. Low cost is irrelevant if the user churns. We try to balance the "Quality/Cost" ratio by monitoring which models (e.g., GPT-4o vs Mini) are being used for which tasks.
Observability in AI isn't a "nice to have"; it is the difference between a prototype and a product.
You cannot optimize what you cannot measure. By enforcing structured logging and centralizing our view through a proxy, we turned a black box into a transparent system where we can confidently scale.
If you are building AI agents, stop logging strings. Start structuring your data today.
Get a FREE consultation today!
