I built a benchmark for AI “memory” in coding agents. looking for others to beat it.
I built a benchmark for AI “memory” in coding agents. looking for others to beat it.

I built a benchmark for AI “memory” in coding agents. looking for others to beat it.

I built a benchmark for AI “memory” in coding agents. looking for others to beat it.

Most AI memory benchmarks test semantic recall. But coding agents don't really fail like that. They don't just "forget", they break their own earlier decisions while they're still in the code. So I built a benchmark for that.

It checks if an agent can actually stay consistent with project rules WHILE it's working, not just after the fact.

It looks at things like:

  • whether edits actually respect earlier architectural decisions
  • if behavior stays consistent across multiple sessions (even when you throw noise at it)
  • whether retrieval kicks in at the right moment — not just "yeah it's in memory somewhere"

Repo (full harness + dataset + scoring): https://github.com/Alienfader/continuity-benchmarks

Early numbers vs baseline + the usual RAG-style memory setups:

  • ~3× better action alignment
  • way stronger multi-session consistency
  • retrieval timing matters way more than retrieval just being there

I'm not saying this is the final word on agent memory. But it's exposing a failure mode most benchmarks aren't even looking at.

So heres the challenge

If you're building an agent memory system, RAG for code, long-context coding agents, persistent state / memory layers, run it on this benchmark. Drop your results, your setup, your comparisons.

I really wanna see how tools like LangChain, LlamaIndex, and custom RAG stacks hold up in mutation-heavy workflows.

We need memory systems we can actually compare, not just ones that sound good on paper.

https://preview.redd.it/dkm2ulxsyzzg1.png?width=2624&format=png&auto=webp&s=67f0299395708818aa3d7346ddae2ad0c5c4a6ba

submitted by /u/Alienfader
[link] [comments]