AI#031 · January 13, 2026 · 6 min read

Why AI Agents Keep Failing at Long Tasks

The demos are impressive. An AI agent researches a topic, writes a report, sends emails, and books meetings, all autonomously. The reality in production is much messier. Agents that work beautifully on a five-step task fall apart on a twenty-step one. Understanding why reveals something fundamental about the current limits of AI architecture.


The compounding error problem

The core issue with long-horizon tasks isn't any single capability failure. It's error compounding. Every step in a multi-step task has some non-zero error rate. In a five-step task with 95% accuracy per step, your end-to-end success rate is about 77%. In a twenty-step task with the same per-step accuracy, it's 36%. In a fifty-step task, it's 8%.

This isn't a fixable problem with better prompting or more capable models. It's a mathematical reality of chaining uncertain steps. Raising per-step accuracy from 95% to 99% helps significantly, but it doesn't fundamentally change the compounding dynamic. Long-horizon reliability requires either dramatically higher per-step accuracy or the ability to detect and recover from errors mid-task.

The context window constraint

Current language models process a fixed context window. Long tasks generate long histories. By step 30 of a complex task, the model may be working with thousands of tokens of prior context, tool outputs, and intermediate state. The further into a task, the more the model has to 'remember' and the harder it becomes to maintain consistent reasoning.

Context windows have grown dramatically (from 4,000 tokens in 2022 to over 200,000 in 2025), but the quality of attention over very long contexts degrades. Models tend to 'forget' instructions given early in the context and over-weight recent information. For long agentic tasks, this means the original goal can gradually drift as the task progresses.

What it would actually take to fix this

The most promising directions are architectural, not just scaling. Memory systems that let agents externalise and retrieve intermediate state, rather than keeping everything in the context window, address one core constraint. Better error detection and self-correction loops, where the agent can identify when a step went wrong and backtrack, address the compounding problem.

Human-in-the-loop checkpoints are an underrated solution. Rather than fully autonomous agents, the most reliable systems today combine AI with lightweight human review at decision nodes. It's not the sci-fi version of AI autonomy. But for tasks where reliability actually matters, it's what actually works.

XLinkedIn

← Previous
Enterprise AI: Bought, Not Used
Next →
Who Actually Pays for Tariffs?

Enjoyed this issue?

Get the next one in your inbox.

Free, weekly, and worth your five minutes.

Preferences

No spam. Unsubscribe anytime.