March 30, 2026 · 10 min read

What Agents Can't Do Yet

The demos are impressive. An agent that builds a full application in minutes. Another that researches a topic and writes a comprehensive report. A third that manages your calendar, drafts emails, and coordinates meetings. Watching these demonstrations, you'd be forgiven for thinking agents can do almost anything.

Then you try to use one in production, and reality intervenes. The gap between a polished demo and a reliable daily workflow is significant. Understanding where agents actually struggle — not where they might struggle someday, but where they consistently fail right now — is essential for building systems that work.

Browser Automation

Agents that control web browsers are among the most impressive demos and the most fragile production tools. An agent navigating a website looks like magic the first time. By the tenth time, after it's clicked the wrong button, failed to find a form field that moved, or gotten stuck on a CAPTCHA, the magic wears off.

The core problem is that websites are designed for humans, not machines. Layouts change without notice. Elements load asynchronously. Pop-ups, cookie banners, and A/B tests alter the page structure between visits. An agent that worked perfectly yesterday can break today because a site moved a button 20 pixels to the left.

Workaround: Use APIs whenever they exist. If you must automate a browser, target sites with stable structures and build in explicit wait conditions and retry logic. Accept that browser automation will require ongoing maintenance. Budget 20-30% of your initial build time for monthly upkeep.

Complex Multi-Step Reasoning

Agents can follow instructions. They can execute multi-step plans. But when a workflow requires genuine reasoning across many steps — where the output of step 3 changes what step 7 should be, and the combination of steps 4 and 6 creates a constraint on step 9 — accuracy degrades. The longer the chain of reasoning, the more likely the agent is to lose track of a constraint or make an error that compounds through subsequent steps.

Workaround: Break complex workflows into shorter segments with human or programmatic checkpoints. Instead of one agent executing a 15-step process, use three agents each handling 5 steps, with validation between them. Keep individual reasoning chains under 5-7 steps for reliability.

Maintaining Context Over Long Sessions

Every model has a context window, and every context window has practical limits that are well below its theoretical maximum. An agent working on a task for an extended session gradually loses coherence. It forgets instructions from earlier in the conversation. It contradicts decisions it made 30 minutes ago. It starts repeating itself or drifting from the original objective.

This isn't just about token limits. Even within the context window, models pay less attention to information in the middle of a long conversation than to information at the beginning or end. Your carefully crafted system prompt can be effectively forgotten during a long session, even though it's technically still in context.

Workaround: Design workflows with explicit state management. At the end of each major step, have the agent summarize its progress and decisions into a structured format. Feed that summary back at the start of the next step. Think of it as giving the agent a notebook rather than relying on its memory.

# State management pattern
after_each_step:
  agent.summarize(
    decisions_made: [...],
    current_state: {...},
    remaining_tasks: [...],
    constraints: [...]
  )
  # Feed this summary as context for the next step
  # instead of relying on full conversation history

Reliable Tool Use

Agents with tool access are powerful, but tool calls are probabilistic. An agent might call the right function with the wrong parameters. It might hallucinate a tool that doesn't exist. It might call tools in the wrong order, or skip a necessary tool call entirely. The more tools you give an agent, the more likely it is to get confused about which one to use.

Workaround: Keep tool sets small and focused. An agent with 5 well-defined tools outperforms one with 30. Use clear, descriptive tool names and parameter descriptions. Validate tool call parameters before execution. When a tool call fails, provide clear error messages that help the agent self-correct rather than generic exceptions.

Understanding Images and Screenshots

Vision capabilities have improved dramatically, but agents still struggle with consistent screenshot interpretation. They can describe what's in an image, but reliably extracting specific data — reading a table from a screenshot, identifying the exact location of a UI element, or comparing two visually similar screens — remains inconsistent. The same screenshot analyzed twice might produce different readings of the same number.

Workaround: When possible, get data from the source rather than from screenshots. If you need to extract data from a visual, use OCR tools first and feed the text to the agent. For UI testing, use accessibility APIs rather than visual comparison. When vision is the only option, run the analysis multiple times and compare results.

Real-Time Collaboration

Agents work asynchronously. They process a request, generate a response, and wait for the next input. They don't participate in real-time back-and-forth the way a human collaborator does. If you need an agent to monitor a Slack channel and respond contextually to a rapidly evolving conversation, the latency and lack of real-time awareness create awkward interactions.

Workaround: Use agents for preparation and follow-up rather than real-time participation. Have an agent prepare briefing materials before a meeting rather than participating in it. Have it process action items after a conversation rather than during. For chat monitoring, batch processing with short intervals (every 2-5 minutes) works better than attempting real-time responses.

Knowing When They're Wrong

This is perhaps the most fundamental limitation. Agents don't have reliable self-awareness about the quality of their own output. They can produce a completely wrong answer with the same confident tone as a correct one. They don't experience uncertainty the way humans do — there's no internal signal that says "I'm not sure about this."

Confidence scores and hedging language ("I think," "It's possible that") are generated by the same process that generates the content. They correlate roughly with accuracy but are not reliable indicators. An agent saying "I'm 90% confident" is not actually measuring its probability of being correct.

Workaround: Build external validation into every workflow that matters. Use a second agent to check the first agent's work. Compare outputs against known-good examples. For factual claims, require citations. For code, require tests. Never use agent confidence as a substitute for verification.

The Demo-to-Production Gap

Demos showcase best-case scenarios with carefully chosen inputs. Production handles worst-case scenarios with messy, unexpected inputs. This gap is wider with agents than with traditional software because agents fail in non-deterministic ways. A function that works 95% of the time in a demo will fail on the exact 5% of cases your users encounter most.

The honest assessment: agents are extremely capable within well-defined boundaries and increasingly unreliable as those boundaries expand. The skill isn't knowing what agents can do — it's knowing exactly where to draw the line.

What's Likely to Improve Soon

Some limitations are engineering problems that are actively being solved:

Context windows — getting larger and more efficient every few months
Tool use reliability — improving with better training and tool descriptions
Vision accuracy — steady gains with each model generation
Reasoning chains — structured reasoning approaches are showing measurable improvement

Other limitations are more fundamental:

Self-knowledge of accuracy — remains an open research problem
Browser automation fragility — limited by the inherent instability of web interfaces
Real-time interaction — bounded by inference latency and architectural constraints

Build your agent systems around what works today. Design them so that as limitations improve, you can gradually extend automation boundaries. But don't build on the assumption that next quarter's model will solve your current problems. If your workflow only works with a model that doesn't exist yet, it doesn't work.

Navigate the real capabilities and limitations of AI agents.

Read the Buying Guide →