March 22, 2026 · 7 min read

Agents Don't Need to Be Smart

The default instinct when building an AI agent is to reach for the most powerful model available. If Claude Opus exists, why would you use anything less? This instinct is understandable, but it is also the single most expensive mistake in agent design.

The truth is that the vast majority of agent tasks do not require frontier intelligence. They require reliable execution of well-defined operations. And for that, small models are not just adequate. They are better.

The "Bigger Is Better" Fallacy

Large models are extraordinary at complex reasoning, nuanced writing, and multi-step problem solving. But most agent tasks are none of those things. Most agent tasks look like this:

Extract the date and amount from this invoice PDF
Classify this email as urgent, normal, or spam
Reformat this JSON into a markdown table
Summarize this article in three bullet points
Check if this text contains any of these keywords

These are pattern matching and formatting tasks. A small, fast model handles them with the same accuracy as a large model, at a fraction of the cost and latency. Using Opus for email classification is like hiring an architect to hang a picture frame.

What Small Models Excel At

Small and mid-tier models (Haiku-class, GPT-4o-mini, Gemini Flash) are genuinely good at a wide range of practical tasks:

Text formatting and conversion: Markdown to HTML, JSON restructuring, CSV generation
Data extraction: Pulling structured fields from unstructured text
Classification: Sorting items into predefined categories
Simple generation: Short descriptions, metadata, tags, summaries
Template filling: Populating templates with provided data
Translation between formats: Converting between data schemas

For these tasks, small models run 5-10x faster and cost 10-30x less per token than frontier models. When your agent runs these operations hundreds of times per day, the savings compound significantly.

What Actually Requires Large Models

Reserve your most capable model for tasks that genuinely demand it:

Complex multi-step reasoning: Tasks where the model must hold multiple constraints in mind and work through them logically
Long-form strategic writing: Blog posts, reports, or analyses that require coherent structure across many paragraphs
Nuanced judgment calls: Deciding whether a piece of content is appropriate, evaluating the quality of a proposal, or assessing risk
Code generation with complex logic: Writing functions that involve algorithmic thinking, not just boilerplate
Synthesis across many sources: Combining information from multiple documents into a coherent whole

These tasks represent, in a typical agent workflow, about 10% of total operations.

The 90% Rule

90% of agent tasks can run on the cheapest model available. The remaining 10% justify the cost of a frontier model. The mistake is running everything on the expensive one.

This is not a theoretical estimate. Look at any agent workflow and categorize each step by the intelligence it actually requires. A research agent, for example, might have this breakdown:

Fetch and parse web pages - no LLM needed, just HTTP and HTML parsing
Extract key facts from each page - small model, extraction task
Classify relevance to the research topic - small model, classification task
Summarize each relevant source - small model, summarization task
Synthesize findings into a coherent briefing - large model, complex synthesis

Only step 5 needs the frontier model. Steps 2-4 run perfectly well on a model that costs a fraction as much.

How to Test If a Smaller Model Works

Before committing to a model for an agent task, run a simple comparison:

Prepare 20 representative inputs for the task your agent will perform
Run them through both the large and small model with identical prompts
Compare outputs side by side. For most structured tasks, the outputs will be functionally identical
Check edge cases. If the small model fails on unusual inputs, you can route just those cases to the larger model

# Quick comparison script
for input in test_inputs/*.txt; do
  echo "=== $(basename $input) ==="
  echo "--- Haiku ---"
  cat "$input" | llm -m haiku "$PROMPT"
  echo "--- Opus ---"
  cat "$input" | llm -m opus "$PROMPT"
  echo ""
done

If the small model produces acceptable output on 18 of 20 test cases, it is the right choice. Handle the edge cases with a fallback to the larger model, or accept the occasional imperfection for the massive cost savings.

Practical Model-Task Matching

Here is a reference table for common agent tasks and the model tier they actually need:

Email triage and classification: Small model (Haiku, Flash)
Data extraction from documents: Small model
Social media post drafting: Mid-tier model (Sonnet, GPT-4o)
Code review comments: Mid-tier model
Research synthesis and analysis: Large model (Opus)
Strategic planning documents: Large model
JSON/CSV reformatting: Small model
Notification summaries: Small model
Meeting notes structuring: Small model
Content quality assessment: Mid-tier to large model

The pattern is clear: the more a task resembles pattern matching or format conversion, the less intelligence it needs. The more it resembles judgment, reasoning, or creative synthesis, the more it benefits from a larger model.

Build your agents with model routing from the start. Default to the smallest viable model, and escalate only when the task demands it. Your cost per agent run will drop dramatically, your latency will improve, and the quality of output on the tasks that matter will stay exactly where it is.

Learn exactly which model to use for every agent task.

Get the Model Selection Guide →