Hi! I've developed ModelClash, an open-source framework for LLM evaluation that could offer some potential advantages over static benchmarks:
The project is in early stages, but initial tests with GPT and Claude models show promising results. I'm eager to hear your thoughts about this! [link] [comments] |