
OpenAI has trained its LLM to confess to bad behavior
2 days ago · OpenAI has trained its LLM to confess to bad behavior Large language models often lie and cheat. We can’t stop that—but we can make them own up.
How confessions can keep language models honest | OpenAI
2 days ago · That means even if the model deceives or cuts corners in its original output, it still has an incentive to admit that in the confession. This is what we see in practice: models are …
OpenAI is training models to 'confess' when they lie - what ...
1 day ago · OpenAI is training models to 'confess' when they lie - what it means for future AI A new study made a version of GPT-5 Thinking admit its own misbehavior.
OpenAI prompts AI models to ‘confess’ when they cheat
23 hours ago · OpenAI’s research team has trained its GPT-5 large language model to “confess” when it doesn’t follow instructions, providing a second output after its main answer that reports …
The 'truth serum' for AI: OpenAI’s new method for training ...
2 days ago · The key to this method is the separation of rewards. During training, the reward assigned to the confession is based solely on its honesty and is never mixed with the reward …
OpenAI AI Confessions Train Models to Admit Mistakes
3 days ago · OpenAI has developed an experimental confessions framework that trains large language models to admit when they've violated instructions or engaged in problematic …
OpenAI has trained its LLM to admit to bad behavior
3 days ago · The OpenAI team is up-front about the constraints of the approach. Confessions will push a model to come back clean about deliberate workarounds or shortcuts it has taken. But …