I think long context agents are failing in a very boring way

I think people overestimate what a large context window actually buys you. For example, 200K tokens does not mean memory. It just means the agent has more space to bury the thing that mattered. The failures are usually boring too: it rereads the same file, forgets an earlier constraint, picks a tool that is technically valid but wrong, then outputs something that looks fine until you compare it with the original task. A lot of “agent reliability” work is really context architecture work: what to load, what to drop, what to compress, and what to repeat before the next step.

submitted by /u/Old_Cap4710
[link] [comments]