I have been thinking a lot about the actual limits of AI-driven scientific discovery, specifically how we evaluate models when they are proposing genuinely new hypotheses where no "answer key" exists.
When we test LLMs on standard benchmarks, we have a clean dataset with known solutions. But if we task a frontier model with proposing a novel chemical compound for carbon capture, or finding an undocumented biological pathway, there is literally no ground truth in the literature.
The immediate response is usually "just run the physical experiment." But wet-labs are incredibly slow and expensive. You can't synthesize thousands of candidate compounds blindly. This means the bottleneck for AI in science isn't our ability to generate hypotheses, it's our ability to verify them under absolute uncertainty.
The traditional way to check model outputs is self-reflection or self-grading. But this is a dead-end for discovery. If you ask a model to double-check its own chemical structure, it has the exact same theoretical blind spots that generated it in the first place. It just agrees with itself louder.
I was reading about a new multi-agent research engine called Apodex that launched earlier this month, and they rely heavily on this split. Instead of a single model doing the work, they use independent verifier agents that are completely blind to the generator's internal prompts.
The verifier's job is to take the proposed hypothesis, re-derive the underlying physical logic from first principles, and find contradictions. Those contradictions are then fed back to the generator as constraints for a revision pass.
Instead of a self-check, making verification a completely distinct, adversarial step is the only way to squeeze out actual science from these models. If we can't verify, we can't truly discover. If the AI doesn't have an isolated checker, then we are just generating highly plausible guesses.
How are your teams handling this transition? When a model proposes a candidate solution in your research, what is your standard of evidence before you spend actual physical or computational resources to test it?
[link] [comments]