BotBeat
...
← Back

> ▌

OpenAIOpenAI
RESEARCHOpenAI2026-05-08

OpenAI Discloses Accidental Chain-of-Thought Grading in RL Training, Receives Third-Party Safety Review

Key Takeaways

  • ▸OpenAI discovered and disclosed accidental CoT grading in prior Instruct and mini models, plus limited exposure in GPT-5.4 Thinking, demonstrating commitment to transparency in AI safety incidents
  • ▸Automated detection systems and in-depth investigation suggest the incident did not meaningfully reduce model monitorability or pose direct misalignment risks
  • ▸External validation by Redwood Research, Apollo AI Evals, and METR strengthens credibility, though experts note lingering concerns about potential suppression of misalignment indicators in model outputs
Sources:
X (Twitter)https://blog.redwoodresearch.org/p/openai-cot↗
Hacker Newshttps://alignment.openai.com/accidental-cot-grading/↗

Summary

OpenAI revealed that it accidentally exposed chain-of-thought (CoT) reasoning to reward graders during reinforcement learning training in some models, including Instruct, mini models, and less than 0.6% of GPT-5.4 Thinking samples. The company built an automated detection system to identify these cases after the fact and conducted an in-depth analysis concluding the incident did not substantially harm model monitorability. To validate its findings, OpenAI shared the analysis with three third-party AI safety organizations—Redwood Research, Apollo AI Evals, and METR—who provided independent feedback. Buck Shlegeris from Redwood Research published a detailed review concluding that OpenAI's evidence assuages approximately 80% of the negative update one might make about the affected models' alignment properties based on learning about the CoT grading, though some residual concerns remain about potential suppression of misalignment signals.

  • OpenAI is implementing stronger safeguards including improved real-time CoT-grading detection and monitorability stress tests to prevent recurrence

Editorial Opinion

OpenAI's proactive disclosure and engagement of external AI safety experts represents a meaningful step toward accountability and transparency in frontier AI development. However, Redwood's 80% risk mitigation assessment—coupled with their estimate of a 3% chance that models suppressed misalignment signals—indicates that accidental CoT grading remains a non-trivial incident with potential long-term implications. This case underscores both the value of systematic detection mechanisms and the limits of post-hoc analysis; the precedent of third-party safety review is encouraging, but such reviews should become standard practice rather than exceptional.

Large Language Models (LLMs)Natural Language Processing (NLP)Reinforcement LearningMachine LearningRegulation & PolicyEthics & BiasAI Safety & Alignment

More from OpenAI

OpenAIOpenAI
PARTNERSHIP

Amazon Drops Sam Altman Biopic After Announcing Major OpenAI Partnership

2026-06-19
OpenAIOpenAI
RESEARCH

As Little as 13 Words Can Manipulate AI Search Results, Cornell Research Shows

2026-06-19
OpenAIOpenAI
PARTNERSHIP

OpenAI Joins Rust Foundation as Platinum Member

2026-06-18

Comments

Suggested

Z.aiZ.ai
PRODUCT LAUNCH

Z.ai Launches GLM-5.2, Claims Fable 5-Class Model Coming Within Months

2026-06-20
Moebius Research ProjectMoebius Research Project
RESEARCH

Moebius: Lightweight Image Inpainting Framework Achieves 10B-Level Quality with Just 0.2B Parameters

2026-06-20
KlueKlue
POLICY & REGULATION

Klue OAuth Breach Expands: Icarus Hackers Claim Attack, Multiple Tech Firms Affected

2026-06-20
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us