New architecture scaling

The new Alibaba QwQ 32B is exceptional for its size and is pretty much SOTA in terms of benchmarks, we had deepseek r1 lite a few days ago which should be 15B parameters if it's like the last DeepSeek Lite. It got me thinking what would happen if we had this architecture with the next generation of scaled up base models (GPT-5), after all the efficiency gains we've had since GPT-4's release(Yi-lightning was around GPT-4 level and the training only costed 3 million USD), it makes me wonder what would happen in the next few months along with the new inference scaling laws and test time training. What are your thoughts?

submitted by /u/user0069420
[link] [comments]