Stop Using One Model for Everything
There is a default behavior in the AI space that costs people significant money and delivers worse results: using the same model for every task. Someone discovers Claude Opus or GPT-4o, it works well, and they route everything through it. Research tasks, formatting jobs, code review, data extraction, simple Q&A. All of it goes to the most expensive, most capable model available.
This is like hiring a senior engineer to update README files. It works, but it is a waste of capacity and budget. The better approach is model routing: matching each task to the cheapest model that can handle it reliably.
Why People Default to the Biggest Model
The reasons are understandable:
- It is simpler. One API key, one model name, one set of quirks to learn. Adding routing logic means more code and more decisions.
- Quality anxiety. People worry that a smaller model will produce bad output, so they avoid the risk by always using the best.
- Lack of benchmarks. Without testing models against your specific tasks, you cannot know where the quality threshold actually is. So you default to overkill.
These are reasonable instincts, but they do not hold up under examination. Most agent workloads are not uniformly hard. They are a mix of trivial tasks and complex ones, and the ratio is heavily skewed toward trivial.
The 75/20/5 Rule
After analyzing hundreds of agent task logs, a consistent pattern emerges. For a typical production agent workload:
- 75% of tasks can be handled by a fast, cheap model (Haiku-class)
- 20% of tasks need a mid-tier model (Sonnet-class)
- 5% of tasks require a frontier model (Opus-class)
If you are sending 100% of tasks to an Opus-class model, you are overpaying by roughly 10-15x on three-quarters of your workload.
What Each Tier Is Good At
Haiku-Class (Fast, Cheap)
These models are optimized for speed and cost. They handle structured, well-defined tasks with high reliability:
- Extracting specific fields from text (names, dates, amounts)
- Classifying inputs into predefined categories
- Reformatting data (JSON to CSV, markdown to HTML)
- Simple text generation from templates
- Yes/no decisions with clear criteria
- Summarizing short documents (under 2,000 words)
Cost: roughly $0.25 per million input tokens, $1.25 per million output tokens.
Sonnet-Class (Balanced)
The middle tier handles tasks that require more reasoning but are not pushing the frontier of model capability:
- Code generation for well-defined functions
- Multi-step analysis with 2-3 reasoning steps
- Writing that needs to match a specific style or tone
- Comparing multiple options and making recommendations
- Debugging straightforward issues
- Summarizing long or complex documents
Cost: roughly $3 per million input tokens, $15 per million output tokens.
Opus-Class (Maximum Capability)
Reserve the frontier model for tasks where quality genuinely matters and where smaller models demonstrably fall short:
- Complex architectural decisions involving multiple tradeoffs
- Novel code design where there is no obvious pattern to follow
- Multi-file refactoring that requires understanding system-wide implications
- Nuanced writing that requires deep understanding of context
- Tasks where a wrong answer has significant consequences
Cost: roughly $15 per million input tokens, $75 per million output tokens.
Concrete Cost Savings
Let us work through a real example. Suppose your agent processes 1,000 tasks per day, averaging 2,000 input tokens and 500 output tokens per task.
All-Opus approach:
Input: 1000 * 2000 * $15/1M = $30.00/day
Output: 1000 * 500 * $75/1M = $37.50/day
Total: $67.50/day = $2,025/month
Routed approach (75/20/5 split):
Haiku: 750 tasks → $0.38 + $0.47 = $0.85/day
Sonnet: 200 tasks → $1.20 + $1.50 = $2.70/day
Opus: 50 tasks → $1.50 + $1.88 = $3.38/day
Total: $6.93/day = $207.90/month
That is a 90% cost reduction with the same output quality, because 95% of your tasks never needed Opus in the first place.
How to Implement Routing Logic
Routing does not need to be complicated. Start with a simple rule-based system:
def select_model(task):
# Tier 1: Simple, structured tasks
if task.type in ["extract", "classify", "format", "template"]:
return "claude-haiku"
# Tier 2: Moderate reasoning tasks
if task.type in ["code_gen", "analyze", "compare", "debug"]:
return "claude-sonnet"
# Tier 3: Complex reasoning tasks
if task.type in ["architect", "refactor", "novel_design"]:
return "claude-opus"
# Default to middle tier for unknown tasks
return "claude-sonnet"
Start coarse and refine over time. Log the model used, the task type, and the output quality for each request. After a week of data, you will see clearly where you can downgrade models without affecting quality, and where you need to upgrade.
The Fallback Pattern
A more robust approach is to start with the cheapest model and escalate on failure:
- Send the task to Haiku
- Check the output against a quality heuristic (length, format, confidence score)
- If it passes, use it. If not, retry with Sonnet
- If Sonnet fails the check, escalate to Opus
This guarantees you always use the cheapest model that produces acceptable output. The overhead of occasional retries is far less than the cost of always using the most expensive model.
The question is not "which model is best?" It is "what is the cheapest model that is good enough for this specific task?" That reframing changes everything about how you build agent systems.
Master model selection with real benchmarks and routing rules.
Get the Model Selection Guide →