AIs can lie, even in their chain of thought. How is that possible?

AI is rapidly becoming more capable – the time horizon for coding tasks is doubling every 4-7 months. But we don’t actually know what these increasingly capable models are thinking. And that’s a problem. If we can’t tell what a model is thinking, then we can’t tell when it is downplaying its capabilities, cheating on tests, or straight up working against us.

Luckily we do have a lead: the chain of thought (CoT). This CoT is used in all top-performing language models. It's a scratch pad where the model can pass notes to itself and, coincidentally, a place where we might find out what it is thinking. Except, the CoT isn’t always faithful. That means that the stated reasoning of the model is not always its true reasoning. And we are not sure yet how to improve that.

However, some researchers now argue that we don’t need complete faithfulness. They argue monitorability is sufficient. While faithfulness means you can read the model’s mind and know what it is thinking. Monitorability means you can observe the model’s stated reasoning and predict what it will do (Baker et al., 2025).

We may now have a lead on good monitorability, but this quality is fragile. In this explainer, we’ll walk you through the details of how all this works and what you need to know, starting with…

submitted by /u/ExplorAI
[link] [comments]