Agents Don't Need to Be Smart
The default instinct when building an AI agent is to reach for the most powerful model available. If Claude Opus exists, why would you use anything less? This instinct is understandable, but it is also the single most expensive mistake in agent design.
The truth is that the vast majority of agent tasks do not require frontier intelligence. They require reliable execution of well-defined operations. And for that, small models are not just adequate. They are better.
The "Bigger Is Better" Fallacy
Large models are extraordinary at complex reasoning, nuanced writing, and multi-step problem solving. But most agent tasks are none of those things. Most agent tasks look like this:
- Extract the date and amount from this invoice PDF
- Classify this email as urgent, normal, or spam
- Reformat this JSON into a markdown table
- Summarize this article in three bullet points
- Check if this text contains any of these keywords
These are pattern matching and formatting tasks. A small, fast model handles them with the same accuracy as a large model, at a fraction of the cost and latency. Using Opus for email classification is like hiring an architect to hang a picture frame.
What Small Models Excel At
Small and mid-tier models (Haiku-class, GPT-4o-mini, Gemini Flash) are genuinely good at a wide range of practical tasks:
- Text formatting and conversion: Markdown to HTML, JSON restructuring, CSV generation
- Data extraction: Pulling structured fields from unstructured text
- Classification: Sorting items into predefined categories
- Simple generation: Short descriptions, metadata, tags, summaries
- Template filling: Populating templates with provided data
- Translation between formats: Converting between data schemas
For these tasks, small models run 5-10x faster and cost 10-30x less per token than frontier models. When your agent runs these operations hundreds of times per day, the savings compound significantly.
What Actually Requires Large Models
Reserve your most capable model for tasks that genuinely demand it:
- Complex multi-step reasoning: Tasks where the model must hold multiple constraints in mind and work through them logically
- Long-form strategic writing: Blog posts, reports, or analyses that require coherent structure across many paragraphs
- Nuanced judgment calls: Deciding whether a piece of content is appropriate, evaluating the quality of a proposal, or assessing risk
- Code generation with complex logic: Writing functions that involve algorithmic thinking, not just boilerplate
- Synthesis across many sources: Combining information from multiple documents into a coherent whole
These tasks represent, in a typical agent workflow, about 10% of total operations.
The 90% Rule
90% of agent tasks can run on the cheapest model available. The remaining 10% justify the cost of a frontier model. The mistake is running everything on the expensive one.
This is not a theoretical estimate. Look at any agent workflow and categorize each step by the intelligence it actually requires. A research agent, for example, might have this breakdown:
- Fetch and parse web pages - no LLM needed, just HTTP and HTML parsing
- Extract key facts from each page - small model, extraction task
- Classify relevance to the research topic - small model, classification task
- Summarize each relevant source - small model, summarization task
- Synthesize findings into a coherent briefing - large model, complex synthesis
Only step 5 needs the frontier model. Steps 2-4 run perfectly well on a model that costs a fraction as much.
How to Test If a Smaller Model Works
Before committing to a model for an agent task, run a simple comparison:
- Prepare 20 representative inputs for the task your agent will perform
- Run them through both the large and small model with identical prompts
- Compare outputs side by side. For most structured tasks, the outputs will be functionally identical
- Check edge cases. If the small model fails on unusual inputs, you can route just those cases to the larger model
# Quick comparison script
for input in test_inputs/*.txt; do
echo "=== $(basename $input) ==="
echo "--- Haiku ---"
cat "$input" | llm -m haiku "$PROMPT"
echo "--- Opus ---"
cat "$input" | llm -m opus "$PROMPT"
echo ""
done
If the small model produces acceptable output on 18 of 20 test cases, it is the right choice. Handle the edge cases with a fallback to the larger model, or accept the occasional imperfection for the massive cost savings.
Practical Model-Task Matching
Here is a reference table for common agent tasks and the model tier they actually need:
- Email triage and classification: Small model (Haiku, Flash)
- Data extraction from documents: Small model
- Social media post drafting: Mid-tier model (Sonnet, GPT-4o)
- Code review comments: Mid-tier model
- Research synthesis and analysis: Large model (Opus)
- Strategic planning documents: Large model
- JSON/CSV reformatting: Small model
- Notification summaries: Small model
- Meeting notes structuring: Small model
- Content quality assessment: Mid-tier to large model
The pattern is clear: the more a task resembles pattern matching or format conversion, the less intelligence it needs. The more it resembles judgment, reasoning, or creative synthesis, the more it benefits from a larger model.
Build your agents with model routing from the start. Default to the smallest viable model, and escalate only when the task demands it. Your cost per agent run will drop dramatically, your latency will improve, and the quality of output on the tasks that matter will stay exactly where it is.
Learn exactly which model to use for every agent task.
Get the Model Selection Guide →