How Efficient are AI Tool Calls?
LLM Tool Loops Are Slow - Here's What to Actually Do
The standard LLM tool-calling pattern is an anti-pattern for production. Every tool call costs 200-500ms of LLM latency plus tokens for the entire conversation history. Let me show you what actually works.
The Problem With Standard Tool Calling
Here's what happens in the naive implementation:
# What the tutorials show you
while not complete:
response = llm.complete(entire_conversation_history + tool_schemas)
if response.has_tool_call:
result = execute_tool(response.tool_call)
conversation_history.append(result) # History grows every call
Why this sucks:
- Each round-trip: 200-500ms LLM latency
- Token cost grows linearly with conversation length
- Tool schemas sent every single time (often 1000+ tokens)
- Sequential blocking - can't parallelize
Five tool calls = 2.5 seconds minimum. That's before any actual execution time.
Pattern 1: Single-Shot Execution Planning
Don't loop. Make the LLM output all tool calls upfront:
def get_execution_plan(task):
prompt = f"""
Task: {task}
Output a complete execution plan as JSON.
Include all API calls needed, with dependencies marked.
"""
plan = llm.complete(prompt, response_format={"type": "json"})
return json.loads(plan)
# Example output:
{
"parallel_groups": [
{
"group": 1,
"calls": [
{"tool": "get_weather", "args": {"city": "Boston", "date": "2024-01-20"}},
{"tool": "get_weather", "args": {"city": "NYC", "date": "2024-01-20"}}
]
},
{
"group": 2, # Depends on group 1
"calls": [
{"tool": "compare_temps", "args": {"results": "$group1.results"}}
]
}
]
}
Now execute the entire plan locally. One LLM call instead of five.
Pattern 2: Tool Chain Compilation
Common sequences should never hit the LLM:
COMPILED_CHAINS = {
"user_context": [
("get_user", lambda: {"id": "$current_user"}),
("get_preferences", lambda prev: {"user_id": prev["id"]}),
("get_recent_orders", lambda prev: {"user_id": prev["user_id"]}),
("aggregate_context", lambda prev: prev)
]
}
def execute_request(query):
# Try to match against compiled patterns first
if pattern := detect_pattern(query):
return execute_compiled_chain(COMPILED_CHAINS[pattern])
# Only use LLM for novel requests
return llm_tool_loop(query)
80% of your tool calls are repetitive. Compile them.
Pattern 3: Streaming Partial Execution
Start executing before the LLM finishes responding:
async def stream_execute(prompt):
results = {}
pending = set()
async for chunk in llm.stream(prompt):
# Try to parse partial JSON for tool calls
if tool_call := try_parse_streaming_json(chunk):
if tool_call not in pending:
pending.add(tool_call)
# Execute immediately, don't wait for full response
task = asyncio.create_task(execute_tool(tool_call))
results[tool_call.id] = task
# Gather all results
return await asyncio.gather(*results.values())
Saves 100-200ms per request by overlapping LLM generation with execution.
Pattern 4: Context Compression
Never send full conversation history. Send deltas:
class CompressedContext:
def __init__(self):
self.task_summary = None
self.last_result = None
self.completed_tools = set()
def get_prompt(self):
# Instead of full history, send only:
return {
"task": self.task_summary, # 50 tokens vs 500
"last_result": compress(self.last_result), # Key fields only
"completed": list(self.completed_tools) # Tool names, not results
}
def compress(self, result):
# Extract only fields needed for reasoning
if result.type == "weather":
return {"temp": result["temp"], "summary": result["condition"]}
# Full result stored locally, LLM never sees it
return {"id": result.id, "success": True}
Reduces token usage by 85% after 5+ tool calls.
Pattern 5: Tool Batching
Design your tools to accept multiple operations:
# Instead of:
get_weather(city="Boston", date="2024-01-20")
get_weather(city="NYC", date="2024-01-20")
# Design tools that batch:
get_weather_batch(requests=[
{"city": "Boston", "date": "2024-01-20"},
{"city": "NYC", "date": "2024-01-20"}
])
One tool call, parallel execution internally.
Pattern 6: Predictive Execution
Execute likely tools before the LLM asks:
def predictive_execute(query):
# Start executing probable tools immediately
futures = {}
if "weather" in query.lower():
cities = extract_cities(query) # Simple NER, not LLM
for city in cities:
futures[city] = executor.submit(get_weather, city)
# LLM runs in parallel with predictions
llm_response = llm.complete(query)
# If LLM wanted weather, we already have it
if llm_response.tool == "get_weather":
city = llm_response.args["city"]
if city in futures:
return futures[city].result() # Already done!
The Full Optimized Architecture
class OptimizedToolExecutor:
def __init__(self):
self.compiled_chains = load_common_patterns()
self.predictor = ToolPredictor()
self.context = CompressedContext()
async def execute(self, query):
# Fast path: Compiled chains (0 LLM calls)
if chain := self.match_compiled(query):
return await self.execute_chain(chain)
# Start predictive execution
predictions = self.predictor.start_predictions(query)
# Get execution plan (1 LLM call)
plan = await self.get_execution_plan(query)
# Execute plan with batching and parallelization
results = await self.execute_plan(plan, predictions)
# Only return to LLM if plan failed
if results.needs_reasoning:
# Send compressed context, not full history
return await self.llm_complete(self.context.compress(results))
return results
Benchmarks From Production
Standard tool loop (5 sequential weather checks):
Optimized approach:
Implementation Checklist
The Key Insight
LLM tool calling is an interpreter pattern when you need a compiler pattern:
Stop using the LLM as a for-loop controller. Use it as a query planner.
Quick Wins You Can Ship Today
# 1. Parallel execution (easiest win)
async def execute_parallel(tool_calls):
return await asyncio.gather(*[
execute_tool(call) for call in tool_calls
])
# 2. Context caching (huge token savings)
def get_context(full_history):
if len(full_history) > 5:
return summarize(full_history[:-2]) + full_history[-2:]
return full_history
# 3. Tool result compression
def compress_for_llm(tool_result):
# Only fields that affect reasoning
return {k: v for k, v in tool_result.items()
if k in REASONING_FIELDS[tool_result.type]}
The standard tool loop is a teaching example, not a production pattern. Ship something faster.