Skip to main content
insights / ai engineering

Agents that actually do the work, not just plan it.

What we learned running tool-calling agents in production for nine months, and why most teardowns miss the hard part.

Digital Adventures2026.03.289 min read

Most "agent" content online is about getting a system to plan a task. We have spent nine months running agents that have to actually finish the task. They are not the same problem.

The tutorials make this look easy. You wire Claude up to a handful of tools, point it at a goal, and watch it work. In a demo it is mesmerising. In production, for a customer who is paying you to deliver a specific outcome, it turns into a reliability exercise that nobody warns you about.

This is what we learned.

Planners feel good. Doers ship.

There are two honest modes for a language model agent.

Planner mode. Given a goal, produce a plan of steps. Someone or something else will execute them.

Doer mode. Given a goal and a real set of tools, go and finish the job. Call tools, read results, write outputs, stop when done.

Most public "agent" demos are planner demos dressed up as doer demos. The model produces a beautiful sequence of steps and you, the developer in the video, nod along. When the same system has to actually close 100 support tickets without a human in the loop, you find out which of those steps it can and cannot execute.

Doer agents in production have three failure modes that planner demos hide.

Failure one: tool hallucination

Claude and GPT both invent tool calls. They confidently use tools you did not define, pass parameters in shapes you did not declare, and call tools with the right name but the wrong signature. In a demo you can forgive this because you have a human who catches it. In production the hallucinated call either fails silently or, worse, takes a real action with wrong arguments.

The fix is not better prompts. The fix is a hard schema at the tool boundary. We validate every tool call through Zod before we route it to the actual implementation. A validation failure is caught, fed back to the model as a tool result explaining what went wrong, and retried.

Equally important: keep the tool surface small. An agent with six well-named tools beats the same agent with sixty. Every additional tool is another chance for the model to guess.

Failure two: loop compulsion

Left alone, agents will call the same tool twice with almost the same arguments. Then three times. They are fishing. They are hoping the output was wrong the first time and will be different this time. It will not be.

Our stop rules:

  1. No retries with identical arguments. If the last tool call had the same name and the same arguments as this one, we block it and surface an error to the model.
  2. Hard turn cap. After twenty tool calls, the agent must produce a final answer or abort. Twenty is plenty for anything that is not a research task.
  3. No more than three consecutive calls to the same tool. The model must use a different tool in between.

These rules catch about 90 per cent of the loop compulsion we see. The remaining 10 per cent is usually a real signal that the task is under-specified and the agent cannot finish it. Which is useful to know before the customer finds out.

Failure three: context bloat

A long-running agent accumulates context. Every tool result gets appended. Every reasoning step gets appended. By turn fifteen the conversation is twelve thousand tokens of stale breadcrumbs, and the model starts making worse choices because the important context is buried.

The fix we use is aggressive. After every five turns, we summarise the progress so far into a short brief, drop the raw history, and restart the agent with the brief plus the original goal. We lose some fidelity, but the model picks up cleanly and does not drown. Cost per task falls by thirty to fifty per cent compared to naive context growth.

We also use prompt caching on the system prompt and tool definitions. Anthropic's cache cuts our per-turn latency meaningfully once the agent is into its work. If you are not using caching, you are paying for context you already paid for.

The verifier pattern

Before an agent commits an action that has side effects, we ask a second, smaller model to check the work. "Given this goal and this action about to happen, is the action appropriate? Answer yes or no with one sentence."

The verifier catches about five per cent of actions per day, almost all of them clearly wrong in hindsight. It costs us very little to run because Haiku is cheap. It is the cheapest reliability win we have added.

The thing the verifier taught us: the agent does not need more intelligence. It needs a second opinion at the action boundary. Human review eventually becomes a bottleneck. Machine review, done right, is a system.

What we would tell ourselves earlier

  • Start with a small tool set. Grow it only when the agent asks for something and you have evidence it will use it well.
  • Evaluate on real tasks from day one, not synthetic ones. A dataset of twenty real customer cases beats a thousand synthetic ones.
  • Do not let the model pick the tool schema. You pick the schema. Your schema is the product.
  • Log every tool call with its arguments, its result, and the turn number. You will rebuild the system around this log.
  • Accept that some tasks are not agent tasks. A deterministic pipeline with one language-model call at the right step is often better than any agent.

An agent that plans for thirty seconds and then quietly gives up is not an agent. It is a demo that got into production by accident. If you are building systems that users pay for, the bar is the same as any other software: does it finish the job, and do you know when it does not?

That is the boring answer, and it is the honest one.

let's build

Ready to build?

Tell us what you're trying to ship. We'll scope it honestly.