flash models. quantized variants. distilled twins.
not breakthroughs, patches. because the real problem isn’t model capability, it’s infra stupidity. everyone’s racing to scale training runs, but inference is where things break:
– token bottlenecks kill latency
– cloud bills scale faster than use cases
– throughput ≠ performance if your routing sucks
Moore’s Law doesn’t apply here anymore, compute gets bigger, but deployment doesn’t get cheaper. So we’re hacking around it:
– same weights, slimmer runtime
– speculative decoding to fake speed
– routing layers to dodge expensive calls
most prod LLM apps don’t run full models. they run approximations. and that’s fine until it silently fails on the one request that mattered. what we’re seeing is the shift from “best model” to “best pipeline.” and in that world - infra design > parameter count.
so who’s actually optimizing for cost per correct token, not just bragging about eval scores?
[link] [comments]