The $2,000 overnight problem
You deploy an AI agent on Friday afternoon. Staging looks fine. By Monday, your OpenAI bill reads $2,147. The agent ran unsupervised for 60 hours with no hard dollar limit.
This is not hypothetical. Teams running LLM agents in production hit this regularly. The failure modes are predictable:
- Retry loops. The agent gets a 429 or malformed response and retries with the full conversation context. Each retry costs more because the context window grows. A single stuck task burns hundreds of dollars in minutes.
- Tool call repetition. The agent calls the same tool repeatedly with slightly different parameters, hoping for a different result. Common with search tools where the LLM cannot articulate why results are unsatisfying.
- Context window growth. Multi-step agents accumulate history. By step 20, each LLM call includes everything from steps 1-19. Step 50 costs 10x what step 5 cost, even for identical work.
- Cascading agent calls. One agent delegates to another, which delegates to a third. A planning agent spawning five workers at ten iterations each produces 50 billed LLM calls from one user request.
The common thread: nobody set a hard dollar limit. Iteration limits, token limits, and timeouts do not map to cost, and cost is what appears on the invoice.
Why max_iterations is not enough
Most agent frameworks ship with max_iterations. LangChain, CrewAI, AutoGen all have it. It feels like a safety net, but it only caps loops, not dollars. A single GPT-4 iteration can cost $0.50 or $10+ depending on context length. Setting max_iterations=20 could mean $10 or $200.
Worse, it does not detect loops. An agent calling the same tool with identical parameters five times in a row will burn through all 20 iterations doing useless work. And it is a deploy-time constant: stopping a runaway requires redeployment, which is not realistic at 3 AM.
You need defenses that operate on dollars, detect behavioral anomalies, and can be triggered remotely without a code change.
The four-layer defense model
After working with dozens of teams running agents in production, we have found that reliable cost control requires four independent layers. Any single layer can fail or be misconfigured. Four layers means you need four simultaneous failures for a runaway to go undetected.
Layer 1 — Budget caps
The most direct defense: set a hard dollar limit per run. When cumulative cost exceeds the budget, the agent stops immediately. No exceptions, no grace period.
from agentguard import Tracer, BudgetGuard tracer = Tracer(sink=HttpSink("ag1_your_key")) # Hard stop at $5.00 per run budget = BudgetGuard(max_dollars=5.0) tracer.add_guard(budget) with tracer.trace("research-task") as run: result = agent.invoke(user_query) # If cost exceeds $5, BudgetGuard raises BudgetExceeded # and the trace records exactly where the money went
BudgetGuard sums token usage across every LLM call using your provider's pricing tables. When the limit hits, it raises BudgetExceeded and cleanly unwinds the agent loop.
Layer 2 — Loop detection
Budget caps stop the bleeding, but loop detection stops the waste before it becomes bleeding. LoopGuard watches for repeated patterns in tool calls and halts the agent when it detects the same action being performed with no meaningful progress.
from agentguard import LoopGuard # Halt if the same tool is called 3+ times with identical args loop = LoopGuard(max_repeats=3) tracer.add_guard(loop) # Works across any nesting depth — # detects loops in sub-agents and delegated tasks too
LoopGuard maintains a sliding window across the entire execution graph, including sub-agents. If agent A calls agent B, which calls tool X three times with identical parameters, LoopGuard catches it regardless of nesting depth. This addresses the most common production failure: agents stuck in retrieve-evaluate-retry cycles.
Layer 3 — Remote kill switch
Sometimes you need to stop an agent now. The AgentGuard dashboard provides a remote kill switch that sends a termination signal to any running trace. The SDK polls for kill signals at every guard checkpoint (before each LLM call and tool invocation), stopping the agent within seconds.
An on-call engineer can stop a runaway agent from their phone. No redeployment, no SSH, no infrastructure access needed.
Layer 4 — Automated alerts
The first three layers are reactive. Automated alerts add proactive defense. Define intervention rules in the dashboard: run exceeds $2, fire a Slack webhook. Run exceeds $5, email the on-call engineer. Any agent over 10 minutes, page the team.
Rules evaluate server-side against streaming telemetry and fire within seconds. Every alert is logged with the run ID and threshold, creating an audit trail when finance asks what happened to the API budget.
Putting it all together
Here is a complete production setup combining all four layers. This is the configuration we recommend for any team running agents in production with real API spend.
from agentguard import Tracer, HttpSink, BudgetGuard, LoopGuard # Initialize with your API key tracer = Tracer( sink=HttpSink("ag1_your_key"), enable_kill_switch=True, # Layer 3: remote kill ) # Layer 1: hard budget cap tracer.add_guard(BudgetGuard(max_dollars=5.0)) # Layer 2: loop detection tracer.add_guard(LoopGuard(max_repeats=3)) # Layer 4: alerts configured in the dashboard UI # — no code changes needed, rules evaluate server-side def run_agent(user_query: str): with tracer.trace("production-run") as run: try: result = agent.invoke(user_query) return result except BudgetExceeded: # Graceful degradation: return partial result return run.partial_result() except LoopDetected: # Agent was stuck, return what we have return run.partial_result() except KillSignal: # Operator killed the run manually return None
Each layer is independent. If you disable loop detection, the budget cap still works. If the budget guard has a bug, the kill switch still works. If nobody is watching the dashboard, the automated alerts still fire. Defense in depth means no single point of failure.
Cost attribution — knowing which agent caused the bill
Prevention is half the problem. When costs occur, you need to know exactly where the money went. AgentGuard breaks down cost at four levels:
- Per-agent. Cumulative cost per agent over any time window. See that your research agent costs $12/day while summarization costs $0.80/day.
- Per-run. Total cost per trace. Sort by cost to find outliers. Often, 5% of runs account for 80% of spend.
- Per-step. Each span (LLM call, tool invocation, sub-agent) has its own cost. Drill into an expensive run to find that one GPT-4 call with 50K context tokens cost $3.20.
- Per-tool. Aggregate across runs to see which tools are most expensive. Identify when web search triggers costly follow-up LLM calls.
Attribution data feeds into alert rules. Set a rule that fires when a specific agent exceeds $50/day, or when any single run exceeds $10. Prevention plus attribution closes the loop: stop runaway costs and understand where normal spend goes so you can optimize it.