![]() | A tiny LoRA adapter and a simple JSON prompt turn a 7B LLM into a powerful reward model that beats much larger ones - saving massive compute. It even helps a 7B model outperform top 70B baselines on GSM-8K using online RLHF [link] [comments] |
![]() | A tiny LoRA adapter and a simple JSON prompt turn a 7B LLM into a powerful reward model that beats much larger ones - saving massive compute. It even helps a 7B model outperform top 70B baselines on GSM-8K using online RLHF [link] [comments] |