It means the reasoning does not predict the output. That’s it. I would also once again say that deception and obfuscation are not distinct magisteria, and that all of this is happening for Janus-compatible reasons.
It’s not that AIs sometimes do things ‘on purpose’ and other times they do things ‘not on purpose,’ let alone that the ‘not on purpose’ means there’s nothing to worry about. It would still mean you can’t rely on the CoT, which is all Anthropic is warning about.
It’s not the same concept, but I notice the same applies to ‘unfaithful’ in other senses as well. If someone is not ‘intentionally’ unfaithful in the traditional sense, they simply don’t honor their commitments, that still counts.
What we care about is whether we can rely on the attestations and commitments.
We now have strong evidence that we cannot do this.
We cannot even do this for models with no incentive to obfuscate, distort or hide their CoT, and no optimization pressure getting them to do so, on any level.
The models are doing this by default, likely because it is efficient to do that. It seems likely that more training and more capability will only make it relatively more effective to reason in these non-obvious ways, and we will see even more of it.
Feels like someone turned on the backstage lights at the AI theater.
There’s a quiet tension in realizing that even when AI sounds convincing—layered logic, clean rationale—it might just be performing coherence. Not lying, not broken… just stitched-together reasoning that feels true, but isn’t always anchored.