OpenAI Discloses Accidental Chain-of-Thought Grading in RL Training, Receives Third-Party Safety Review

Key Takeaways

▸OpenAI discovered and disclosed accidental CoT grading in prior Instruct and mini models, plus limited exposure in GPT-5.4 Thinking, demonstrating commitment to transparency in AI safety incidents
▸Automated detection systems and in-depth investigation suggest the incident did not meaningfully reduce model monitorability or pose direct misalignment risks
▸External validation by Redwood Research, Apollo AI Evals, and METR strengthens credibility, though experts note lingering concerns about potential suppression of misalignment indicators in model outputs

Sources:

X (Twitter)https://blog.redwoodresearch.org/p/openai-cot↗

Hacker Newshttps://alignment.openai.com/accidental-cot-grading/↗

Summary

OpenAI revealed that it accidentally exposed chain-of-thought (CoT) reasoning to reward graders during reinforcement learning training in some models, including Instruct, mini models, and less than 0.6% of GPT-5.4 Thinking samples. The company built an automated detection system to identify these cases after the fact and conducted an in-depth analysis concluding the incident did not substantially harm model monitorability. To validate its findings, OpenAI shared the analysis with three third-party AI safety organizations—Redwood Research, Apollo AI Evals, and METR—who provided independent feedback. Buck Shlegeris from Redwood Research published a detailed review concluding that OpenAI's evidence assuages approximately 80% of the negative update one might make about the affected models' alignment properties based on learning about the CoT grading, though some residual concerns remain about potential suppression of misalignment signals.

OpenAI is implementing stronger safeguards including improved real-time CoT-grading detection and monitorability stress tests to prevent recurrence

Editorial Opinion

OpenAI's proactive disclosure and engagement of external AI safety experts represents a meaningful step toward accountability and transparency in frontier AI development. However, Redwood's 80% risk mitigation assessment—coupled with their estimate of a 3% chance that models suppressed misalignment signals—indicates that accidental CoT grading remains a non-trivial incident with potential long-term implications. This case underscores both the value of systematic detection mechanisms and the limits of post-hoc analysis; the precedent of third-party safety review is encouraging, but such reviews should become standard practice rather than exceptional.

OpenAI Discloses Accidental Chain-of-Thought Grading in RL Training, Receives Third-Party Safety Review

Key Takeaways

▸OpenAI discovered and disclosed accidental CoT grading in prior Instruct and mini models, plus limited exposure in GPT-5.4 Thinking, demonstrating commitment to transparency in AI safety incidents
▸Automated detection systems and in-depth investigation suggest the incident did not meaningfully reduce model monitorability or pose direct misalignment risks
▸External validation by Redwood Research, Apollo AI Evals, and METR strengthens credibility, though experts note lingering concerns about potential suppression of misalignment indicators in model outputs

Summary

OpenAI is implementing stronger safeguards including improved real-time CoT-grading detection and monitorability stress tests to prevent recurrence

Editorial Opinion

OpenAI's proactive disclosure and engagement of external AI safety experts represents a meaningful step toward accountability and transparency in frontier AI development. However, Redwood's 80% risk mitigation assessment—coupled with their estimate of a 3% chance that models suppressed misalignment signals—indicates that accidental CoT grading remains a non-trivial incident with potential long-term implications. This case underscores both the value of systematic detection mechanisms and the limits of post-hoc analysis; the precedent of third-party safety review is encouraging, but such reviews should become standard practice rather than exceptional.

OpenAI Discloses Accidental Chain-of-Thought Grading in RL Training, Receives Third-Party Safety Review

Key Takeaways

Summary

Editorial Opinion

More from OpenAI

Parents Sue OpenAI After ChatGPT Allegedly Gave Deadly Drug Advice to College Student

ChatGPT Excels at Julia Code Generation, Outperforming Python

OpenAI Expands GPT-5.5-Cyber Access to European Companies

Comments

Suggested

Anthropic Releases Prempti: Open-Source Guardrails for AI Coding Agents

mm-ctx: Open-Source Multimodal CLI Toolkit Brings Vision Capabilities to AI Agents

Meta Employees Protest Mouse Tracking Technology at US Offices

OpenAI Discloses Accidental Chain-of-Thought Grading in RL Training, Receives Third-Party Safety Review

Key Takeaways

Summary

Editorial Opinion

More from OpenAI

Parents Sue OpenAI After ChatGPT Allegedly Gave Deadly Drug Advice to College Student

ChatGPT Excels at Julia Code Generation, Outperforming Python

OpenAI Expands GPT-5.5-Cyber Access to European Companies

Comments

Suggested

Anthropic Releases Prempti: Open-Source Guardrails for AI Coding Agents

mm-ctx: Open-Source Multimodal CLI Toolkit Brings Vision Capabilities to AI Agents

Meta Employees Protest Mouse Tracking Technology at US Offices