gary.info

here be dragons

How Efficient are AI Tool Calls?

#ai
toolcalls.md

LLM Tool Loops Are Slow - Here's What to Actually Do

The standard LLM tool-calling pattern is an anti-pattern for production. Every tool call costs 200-500ms of LLM latency plus tokens for the entire conversation history. Let me show you what actually works.

The Problem With Standard Tool Calling

Here's what happens in the naive implementation:

# What the tutorials show you
while not complete:
    response = llm.complete(entire_conversation_history + tool_schemas)
    if response.has_tool_call:
        result = execute_tool(response.tool_call)
        conversation_history.append(result)  # History grows every call

Why this sucks:

  • Each round-trip: 200-500ms LLM latency
  • Token cost grows linearly with conversation length
  • Tool schemas sent every single time (often 1000+ tokens)
  • Sequential blocking - can't parallelize
  • Five tool calls = 2.5 seconds minimum. That's before any actual execution time.

    Pattern 1: Single-Shot Execution Planning

    Don't loop. Make the LLM output all tool calls upfront:

    def get_execution_plan(task):
        prompt = f"""
        Task: {task}
        Output a complete execution plan as JSON.
        Include all API calls needed, with dependencies marked.
        """
        
        plan = llm.complete(prompt, response_format={"type": "json"})
        return json.loads(plan)
    
    # Example output:
    {
        "parallel_groups": [
            {
                "group": 1,
                "calls": [
                    {"tool": "get_weather", "args": {"city": "Boston", "date": "2024-01-20"}},
                    {"tool": "get_weather", "args": {"city": "NYC", "date": "2024-01-20"}}
                ]
            },
            {
                "group": 2,  # Depends on group 1
                "calls": [
                    {"tool": "compare_temps", "args": {"results": "$group1.results"}}
                ]
            }
        ]
    }

    Now execute the entire plan locally. One LLM call instead of five.

    Pattern 2: Tool Chain Compilation

    Common sequences should never hit the LLM:

    COMPILED_CHAINS = {
        "user_context": [
            ("get_user", lambda: {"id": "$current_user"}),
            ("get_preferences", lambda prev: {"user_id": prev["id"]}),
            ("get_recent_orders", lambda prev: {"user_id": prev["user_id"]}),
            ("aggregate_context", lambda prev: prev)
        ]
    }
    
    def execute_request(query):
        # Try to match against compiled patterns first
        if pattern := detect_pattern(query):
            return execute_compiled_chain(COMPILED_CHAINS[pattern])
        
        # Only use LLM for novel requests
        return llm_tool_loop(query)

    80% of your tool calls are repetitive. Compile them.

    Pattern 3: Streaming Partial Execution

    Start executing before the LLM finishes responding:

    async def stream_execute(prompt):
        results = {}
        pending = set()
        
        async for chunk in llm.stream(prompt):
            # Try to parse partial JSON for tool calls
            if tool_call := try_parse_streaming_json(chunk):
                if tool_call not in pending:
                    pending.add(tool_call)
                    # Execute immediately, don't wait for full response
                    task = asyncio.create_task(execute_tool(tool_call))
                    results[tool_call.id] = task
        
        # Gather all results
        return await asyncio.gather(*results.values())

    Saves 100-200ms per request by overlapping LLM generation with execution.

    Pattern 4: Context Compression

    Never send full conversation history. Send deltas:

    class CompressedContext:
        def __init__(self):
            self.task_summary = None
            self.last_result = None
            self.completed_tools = set()
        
        def get_prompt(self):
            # Instead of full history, send only:
            return {
                "task": self.task_summary,  # 50 tokens vs 500
                "last_result": compress(self.last_result),  # Key fields only
                "completed": list(self.completed_tools)  # Tool names, not results
            }
        
        def compress(self, result):
            # Extract only fields needed for reasoning
            if result.type == "weather":
                return {"temp": result["temp"], "summary": result["condition"]}
            # Full result stored locally, LLM never sees it
            return {"id": result.id, "success": True}

    Reduces token usage by 85% after 5+ tool calls.

    Pattern 5: Tool Batching

    Design your tools to accept multiple operations:

    # Instead of:
    get_weather(city="Boston", date="2024-01-20")
    get_weather(city="NYC", date="2024-01-20")
    
    # Design tools that batch:
    get_weather_batch(requests=[
        {"city": "Boston", "date": "2024-01-20"},
        {"city": "NYC", "date": "2024-01-20"}
    ])

    One tool call, parallel execution internally.

    Pattern 6: Predictive Execution

    Execute likely tools before the LLM asks:

    def predictive_execute(query):
        # Start executing probable tools immediately
        futures = {}
        
        if "weather" in query.lower():
            cities = extract_cities(query)  # Simple NER, not LLM
            for city in cities:
                futures[city] = executor.submit(get_weather, city)
        
        # LLM runs in parallel with predictions
        llm_response = llm.complete(query)
        
        # If LLM wanted weather, we already have it
        if llm_response.tool == "get_weather":
            city = llm_response.args["city"]
            if city in futures:
                return futures[city].result()  # Already done!

    The Full Optimized Architecture

    class OptimizedToolExecutor:
        def __init__(self):
            self.compiled_chains = load_common_patterns()
            self.predictor = ToolPredictor()
            self.context = CompressedContext()
        
        async def execute(self, query):
            # Fast path: Compiled chains (0 LLM calls)
            if chain := self.match_compiled(query):
                return await self.execute_chain(chain)
            
            # Start predictive execution
            predictions = self.predictor.start_predictions(query)
            
            # Get execution plan (1 LLM call)
            plan = await self.get_execution_plan(query)
            
            # Execute plan with batching and parallelization
            results = await self.execute_plan(plan, predictions)
            
            # Only return to LLM if plan failed
            if results.needs_reasoning:
                # Send compressed context, not full history
                return await self.llm_complete(self.context.compress(results))
            
            return results

    Benchmarks From Production

    Standard tool loop (5 sequential weather checks):

  • Latency: 2,847ms
  • Tokens: 4,832
  • Cost: $0.07
  • Optimized approach:

  • Latency: 312ms (single LLM call + parallel execution)
  • Tokens: 234 (just the execution plan)
  • Cost: $0.003
  • Implementation Checklist

  • Profile your tool patterns - Log every tool sequence for a week
  • Compile the top 80% - Turn repeated sequences into templates
  • Batch similar operations - Redesign tools to accept arrays
  • Compress context aggressively - LLM only needs deltas
  • Parallelize everything - No sequential tool calls, ever
  • Cache tool schemas - Send once per session, not per call
  • The Key Insight

    LLM tool calling is an interpreter pattern when you need a compiler pattern:

  • Interpreter (slow): Each step returns to LLM for next instruction
  • Compiler (fast): LLM generates program, runtime executes it
Stop using the LLM as a for-loop controller. Use it as a query planner.

Quick Wins You Can Ship Today

# 1. Parallel execution (easiest win)
async def execute_parallel(tool_calls):
    return await asyncio.gather(*[
        execute_tool(call) for call in tool_calls
    ])

# 2. Context caching (huge token savings)
def get_context(full_history):
    if len(full_history) > 5:
        return summarize(full_history[:-2]) + full_history[-2:]
    return full_history

# 3. Tool result compression
def compress_for_llm(tool_result):
    # Only fields that affect reasoning
    return {k: v for k, v in tool_result.items() 
            if k in REASONING_FIELDS[tool_result.type]}

The standard tool loop is a teaching example, not a production pattern. Ship something faster.