Your explanation is largely correct. The reason “memory” has become the dominant systems problem for LLMs is that modern transformers are increasingly memory-bandwidth bound, not compute-bound.
The key shift is this:
Training large models was mostly about FLOPs.
Serving large models at scale is increasingly about moving KV cache data around fast enough.
A single token generation step only performs a relatively modest amount of math compared to the amount of KV data that must be fetched from memory every step.
Why this happens
During inference, every new token attends to all prior tokens.
So for token t, the model needs access to all prior K/V tensors:
\text{KV Cache Size} \propto 2 \times L \times S \times H \times d
Where:
L = layers
S = sequence length
H = attention heads
d = head dimension
The killer is the S term.
As context grows:
8K → manageable
128K → huge
1M → infrastructure problem
A 70B model with long context can require hundreds of GBs of KV cache across concurrent users.
Why bandwidth matters more than raw compute
Modern GPUs like the NVIDIA H100 or NVIDIA Blackwell can perform enormous amounts of compute.
But every generated token requires:
Loading KV cache from memory
Running attention
Writing updated KV back
That means inference speed often depends more on:
HBM bandwidth
memory locality
cache management
than tensor core throughput.
This is why:
HBM3E
NVLink
unified memory
memory compression
have become strategic bottlenecks.
Why the KV cache can exceed model weights
Model weights are static.
KV cache is dynamic and scales with:
users
context length
output length
batch size
Example intuition:
70B model weights might occupy ~140 GB FP16
But serving thousands of users with long contexts can require multiple TBs of KV cache
So operators increasingly optimize:
cache reuse
eviction
paging
quantization
instead of just model size.
Why vLLM and PagedAttention mattered so much
Before systems like vLLM, memory fragmentation was catastrophic.
PagedAttention essentially borrowed ideas from operating systems:
divide KV into pages
allocate dynamically
avoid contiguous memory assumptions
That dramatically improved:
utilization
batching
throughput
This was one of the biggest inference infrastructure breakthroughs of the last few years because it improved economics without changing the model itself.
The deeper issue: transformers scale poorly with context
Standard attention fundamentally has a retrieval problem:
Each token potentially references every prior token.
Even though compute optimizations exist, the architecture still requires huge memory movement.
That’s why researchers are exploring:
Grouped Query Attention (GQA)
Multi-Query Attention (MQA)
sliding window attention
recurrent memory
state-space models
hybrid retrieval systems
The industry increasingly believes:
infinite-context transformers using naive KV scaling are economically unsustainable.
Why inference economics are now the focus
Training frontier models is expensive.
But operating them continuously at global scale is potentially even larger economically.
For many providers:
inference cost dominates
memory dominates inference cost
That’s why companies across the stack are racing on memory:
NVIDIA → HBM + NVLink + Grace
AMD → MI300 unified memory
Cerebras → wafer-scale SRAM
Groq → deterministic low-latency SRAM-heavy architecture
Marvell Technology → custom memory fabrics
The bottleneck has shifted from:
“Can we train bigger models?”
to:
“Can we serve them cheaply and fast enough?”
[link] [comments]