Llion Jones said “2026 is the continual learning year” in the recent Post-Transformer debate. Sutton/Silver call the next phase the "era of experience”.
What’s continual learning? Simply put, it’s a model’s ability to continuously improve as it gains experience – without exhibiting catastrophic forgetting. Essentially the stability-plasticity tradeoff for a reasoning model. Essentially it comes down to: where does the memory live?
- Outside the model. Memory files, vector dbs, graphs. Text is retrieved and pasted back into context. The model stays frozen.
- In the model's running state. Hidden states or fast weights that change while the model processes input.
- In the model's weights. What it actually knows. Encoded within the model weights to improve decision making patterns without forgetting.
Dev docs today hint at #1 - memory outside the model. But the “2026 is continual learning year” notion does not come from it. Why?
Part 1: The Memento stack (today’s stack)
There are engineering fixes for the LLM’s memory problem. Julian Togelius & a16z compared it to Memento. In the movie, Leonard functions with his Polaroid and notes. But everyday he is the same man as day 0. Progress around these include:
- Anthropic's Dreaming: an async job to manage “memories”, explicitly modeled on sleep consolidation.
- Long context as memory: Visibly good, but with 3 problems. a) Position bias and "lost in the middle" challenge. b) Longer LLM windows come with bigger costs and we’re already discussing “token economics”. c). KV cache bottleneck, and everything evaporates when the request ends.
- Mem0, Letta, Zep: the popular memory-layer products from startups.
- AGENTS.md and git-style memory files: But, in this ETH Zurich paper (arXiv 2602.11988) it showed that LLM-generated context files actually reduce task success by about 3% while raising cost over 20%. And human-written ones barely helped too.
Part 2: Continual learning, memory within the model (the big bet)
Weight updates in large networks trigger catastrophic forgetting. A January 2026 paper tried continual fine-tuning on LRMs (arXiv 2601.18699) but catastrophic forgetting didn’t fade but rather increased. Promising directions that could solve this:
- TTT layers (arXiv 2407.04620, ICML 2025): the hidden state of the sequence layer is a small model, updated by gradient descent on tokens as they stream in. Matches or beats Transformer / Mamba baselines upto 1.3B params.
- Titans & Atlas: Titans add a neural long-term memory that decides what to store using a surprise signal. Atlas upgrades the memory's learning rule.
- Nested Learning + HOPE: Architecture updates different blocks at different frequencies. RNNs are also coming closer to Transformers via viral Memory Caching papers.
- Dragon Hatchling (BDH): From AI lab Pathway (arXiv 2509.26507). Working memory lives in Hebbian synapses rather than in a KV cache, allowing for an "infinite context window" without quadratic cost.
AMI Labs, LFMs, etc. also mention continual learning but I didn’t find much specific info on them in this front.
Current State and Future Outlook
Where is continual learning in mid-2026?
- Solved with public access: nothing.
- Shipping in production: only the dossier stack, all frozen models.
- Demonstrated at research scale (< 2B params): TTT, Titans, Memory Caching, HOPE, and BDH.
What would move the needle imo: Ship memory within the model with forgetting measurably controlled.
Two questions though:
- What OpenAI is brewing in all of this?
- What’s the blocker to adoption for continual learning models: the missing breakthrough itself, or evals, serving economics, etc?
[link] [comments]