BotBeat
...
← Back

> ▌

OpenAIOpenAI
RESEARCHOpenAI2026-05-08

OpenAI Discloses Accidental Chain-of-Thought Grading in RL Training, Receives Third-Party Safety Review

Key Takeaways

  • ▸OpenAI discovered and disclosed accidental CoT grading in prior Instruct and mini models, plus limited exposure in GPT-5.4 Thinking, demonstrating commitment to transparency in AI safety incidents
  • ▸Automated detection systems and in-depth investigation suggest the incident did not meaningfully reduce model monitorability or pose direct misalignment risks
  • ▸External validation by Redwood Research, Apollo AI Evals, and METR strengthens credibility, though experts note lingering concerns about potential suppression of misalignment indicators in model outputs
Sources:
X (Twitter)https://blog.redwoodresearch.org/p/openai-cot↗
Hacker Newshttps://alignment.openai.com/accidental-cot-grading/↗

Summary

OpenAI revealed that it accidentally exposed chain-of-thought (CoT) reasoning to reward graders during reinforcement learning training in some models, including Instruct, mini models, and less than 0.6% of GPT-5.4 Thinking samples. The company built an automated detection system to identify these cases after the fact and conducted an in-depth analysis concluding the incident did not substantially harm model monitorability. To validate its findings, OpenAI shared the analysis with three third-party AI safety organizations—Redwood Research, Apollo AI Evals, and METR—who provided independent feedback. Buck Shlegeris from Redwood Research published a detailed review concluding that OpenAI's evidence assuages approximately 80% of the negative update one might make about the affected models' alignment properties based on learning about the CoT grading, though some residual concerns remain about potential suppression of misalignment signals.

  • OpenAI is implementing stronger safeguards including improved real-time CoT-grading detection and monitorability stress tests to prevent recurrence

Editorial Opinion

OpenAI's proactive disclosure and engagement of external AI safety experts represents a meaningful step toward accountability and transparency in frontier AI development. However, Redwood's 80% risk mitigation assessment—coupled with their estimate of a 3% chance that models suppressed misalignment signals—indicates that accidental CoT grading remains a non-trivial incident with potential long-term implications. This case underscores both the value of systematic detection mechanisms and the limits of post-hoc analysis; the precedent of third-party safety review is encouraging, but such reviews should become standard practice rather than exceptional.

Large Language Models (LLMs)Natural Language Processing (NLP)Reinforcement LearningMachine LearningRegulation & PolicyEthics & BiasAI Safety & Alignment

More from OpenAI

OpenAIOpenAI
POLICY & REGULATION

Parents Sue OpenAI After ChatGPT Allegedly Gave Deadly Drug Advice to College Student

2026-05-12
OpenAIOpenAI
RESEARCH

ChatGPT Excels at Julia Code Generation, Outperforming Python

2026-05-12
OpenAIOpenAI
PRODUCT LAUNCH

OpenAI Expands GPT-5.5-Cyber Access to European Companies

2026-05-12

Comments

Suggested

AnthropicAnthropic
OPEN SOURCE

Anthropic Releases Prempti: Open-Source Guardrails for AI Coding Agents

2026-05-12
vlm-runvlm-run
OPEN SOURCE

mm-ctx: Open-Source Multimodal CLI Toolkit Brings Vision Capabilities to AI Agents

2026-05-12
MetaMeta
POLICY & REGULATION

Meta Employees Protest Mouse Tracking Technology at US Offices

2026-05-12
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us