What to monitor in AI agents
AI agents are not traditional web services. Standard APM tools were not designed for their failure modes. Here is what matters:
- Cost per run. A single agent run might make 5 or 50 LLM calls. You need total cost per run and the trend over time.
- Latency. End-to-end duration and per-step latency. Agents that take 3 minutes in staging can take 20 minutes in production with real data.
- Success rate. Completion vs. timeout vs. error rates, at minimum.
- Tool call patterns. Which tools are called, in what order, how often. Calling the same search tool 15 times signals a loop.
- Loop detection. Repeated tool calls with identical parameters. The most common production failure and the hardest to catch with generic monitoring.
- Reasoning traces. Full chain of LLM inputs, outputs, tool calls, and decisions. Essential for debugging and auditing.
Any monitoring tool should cover cost, latency, and traces. Loop detection and budget enforcement separate agent-aware tools from generic observability.
The landscape — open source options
The AI agent monitoring space is young. Most tools launched in 2024 or 2025, and the feature sets are evolving quickly. Here are the four main approaches teams use today:
AgentGuard SDK is an MIT-licensed Python library with zero dependencies. It provides tracing, cost tracking, budget enforcement, loop detection, and a remote kill switch. Traces can go to a local JSONL file or the hosted dashboard. Designed for production safety, not just observability.
Langfuse is an open source LLM observability platform with tracing, prompt management, and a self-hostable web UI. It has the most mature tracing interface of any open source tool, with polished nested span visualization. It does not provide budget enforcement, loop detection, or kill switches.
Arize Phoenix is an open source observability tool focused on tracing, evaluation, and retrieval analysis. Strong on ML-specific metrics like embedding drift. Integrates well with LlamaIndex. Focused on analysis, not runtime enforcement.
Custom logging with stdlib. Python's logging module plus a log aggregator like Elasticsearch or Loki. Maximum flexibility, zero lock-in, but you build everything from scratch.
Feature comparison
This table compares the four approaches across the dimensions that matter most for production agent monitoring. We have tried to be honest about where each tool excels and where it falls short.
| Feature | AgentGuard SDK | Langfuse | Arize Phoenix | Custom stdlib |
|---|---|---|---|---|
| Tracing | Spans + JSONL | Nested spans (best UI) | OpenTelemetry spans | DIY |
| Cost tracking | Per-run, per-step | Per-trace | Basic | DIY |
| Budget enforcement | BudgetGuard | No | No | DIY |
| Loop detection | LoopGuard | No | No | DIY |
| Kill switch | Remote, real-time | No | No | DIY |
| Setup complexity | 2 lines of code | Self-host or cloud | pip install + local | High |
| Dependencies | Zero | Postgres, Redis, etc. | Several Python deps | Your choice |
| License | MIT | MIT (core) | Apache 2.0 | N/A |
The honest takeaway: Langfuse has the best tracing UI. AgentGuard is the only open source SDK with runtime safety (budgets, loops, kill). Phoenix excels at ML-specific analysis. Custom stdlib gives total control at 10x the build effort.
AgentGuard SDK — free, zero dependencies
Install with pip install agentguard47. No database, no containers, no config files. The SDK supports two sinks:
- JsonlSink writes traces to a local file. No server, no API key. Process with Unix tools, pandas, or any log aggregator. Ideal for development, CI, and air-gapped environments.
- HttpSink sends traces to the dashboard over HTTPS. Gives you the web UI, alerts, kill switch, and team features. Free tier: 1,000 traces/month.
Local-only mode works indefinitely. No time limit, no feature gate, no telemetry. The SDK works identically with or without a network connection.
Getting started with local-only monitoring
Here is a complete example of using AgentGuard for local monitoring. No server, no API key, no sign-up. Just pip install and start tracing.
from agentguard import Tracer, JsonlSink, BudgetGuard, LoopGuard # All traces written to a local file — no network needed tracer = Tracer(sink=JsonlSink("traces.jsonl")) # Optional: add safety guards even in local mode tracer.add_guard(BudgetGuard(max_dollars=2.0)) tracer.add_guard(LoopGuard(max_repeats=3)) with tracer.trace("local-dev-run") as run: # Your agent code here result = agent.invoke("Summarize the quarterly report") print(f"Cost: ${run.total_cost:.4f}") print(f"Steps: {run.step_count}") # Read traces back with standard tools # cat traces.jsonl | python -m json.tool # or load into pandas: import pandas as pd df = pd.read_json("traces.jsonl", lines=True) print(df[["run_id", "total_cost", "duration_ms", "status"]])
Each line in traces.jsonl is a self-contained JSON object with the run ID, timestamps, step details, token counts, cost breakdown, and guard events. You get full observability without any external service.
This is valuable in three scenarios: local development (catch loops before production), CI pipelines (budget guards on integration tests), and air-gapped environments (full tracing with zero data exfiltration risk).
When to upgrade to hosted observability
Local-only monitoring works well for individual developers and small teams. But there are clear inflection points where the hosted dashboard becomes worth the upgrade:
- Team size. When more than one person needs to see traces, sharing JSONL files becomes impractical. The dashboard provides role-based access, shared views, and a URL you can paste into Slack when debugging an incident.
- Compliance and audit. Regulated industries need tamper-evident trace storage, retention policies, and export capabilities. The dashboard stores traces in append-only storage with configurable retention. You can export any time window as JSONL for external archiving.
- Remote kill switch. The kill switch requires a server-side signal. Local-only mode cannot support it because there is no server to send the signal. If you need the ability to stop a runaway agent without redeploying, you need the hosted dashboard or a self-hosted equivalent.
- Automated alerts. Webhooks and email alerts evaluate server-side against streaming telemetry. You can set up rules like "if any run exceeds $10, page the on-call" without writing custom monitoring code.
- Visualization. The dashboard provides cost trends, latency percentiles, success rates, and trace timelines out of the box. Building equivalent dashboards on top of JSONL files is possible but requires significant effort with Grafana or a similar tool.
The transition is seamless. You change one line of code, replacing JsonlSink("traces.jsonl") with HttpSink("ag1_your_key"). Everything else, including guards, trace structure, and your agent code, stays identical. There is no migration, no schema change, and no data loss. Your local JSONL files remain on disk as a backup.
Start local. Move to hosted when the team or the stakes grow. That is the design principle behind the two-sink architecture.