AI#031 · January 13, 2026 · 6 min read

Why AI Agents Keep Failing at Long Tasks

The demos are impressive. An AI agent researches a topic, writes a report, sends emails, and books meetings, all autonomously. The reality in production is much messier. Agents that work beautifully on a five-step task fall apart on a twenty-step one. Understanding why reveals something fundamental about the current limits of AI architecture.

The compounding error problem

The core issue with long-horizon tasks isn't any single capability failure. It's error compounding. Every step in a multi-step task has some non-zero error rate. In a five-step task with 95% accuracy per step, your end-to-end success rate is about 77%. In a twenty-step task with the same per-step accuracy, it's 36%. In a fifty-step task, it's 8%.

This isn't a fixable problem with better prompting or more capable models. It's a mathematical reality of chaining uncertain steps. Raising per-step accuracy from 95% to 99% helps significantly, but it doesn't fundamentally change the compounding dynamic. Long-horizon reliability requires either dramatically higher per-step accuracy or the ability to detect and recover from errors mid-task.

The context window constraint

Current language models process a fixed context window. Long tasks generate long histories. By step 30 of a complex task, the model may be working with thousands of tokens of prior context, tool outputs, and intermediate state. The further into a task, the more the model has to 'remember' and the harder it becomes to maintain consistent reasoning.

Context windows have grown dramatically (from 4,000 tokens in 2022 to over 200,000 in 2025), but the quality of attention over very long contexts degrades. Models tend to 'forget' instructions given early in the context and over-weight recent information. For long agentic tasks, this means the original goal can gradually drift as the task progresses.

What it would actually take to fix this

The most promising directions are architectural, not just scaling. Memory systems that let agents externalise and retrieve intermediate state, rather than keeping everything in the context window, address one core constraint. Better error detection and self-correction loops, where the agent can identify when a step went wrong and backtrack, address the compounding problem.

Human-in-the-loop checkpoints are an underrated solution. Rather than fully autonomous agents, the most reliable systems today combine AI with lightweight human review at decision nodes. It's not the sci-fi version of AI autonomy. But for tasks where reliability actually matters, it's what actually works.

Keep reading

AI6 min read

GPT-5 and the Arms Race Nobody Is Talking About

AI6 min read

The Quiet AI Takeover of White-Collar Work

AI5 min read

Enterprise AI: Bought, Not Used

← Previous

Enterprise AI: Bought, Not Used

Who Actually Pays for Tariffs?