Critical Analysis: The Buried Finding in OpenAI's o1 Clinical Study
Key Takeaways
- ▸OpenAI's o1-preview meets or beats both GPT-4 and physician baselines on most clinical tasks, with largest gaps in data-sparse scenarios like initial ER triage
- ▸Comparing o1 to unaided physicians is outdated; the relevant 2026 baseline is physicians actively using AI tools, not physicians working alone
- ▸Physicians provided with GPT-4 sometimes underperformed the model working independently, suggesting human-AI collaboration may paradoxically degrade clinical decision-making
Summary
A rigorous paper by Brodeur et al. (2026) shows OpenAI's o1-preview outperforming human physicians on clinical diagnosis tasks across multiple benchmarks, including NEJM clinicopathologic conferences and real emergency department cases. However, critical analysis reveals that the headline finding—comparing o1 to unaided physicians—reflects a 2024 baseline that no longer applies in 2026, when 81% of US physicians now routinely use AI in clinical practice.
The truly interesting finding, which the paper underplays, is that physicians given AI tools often underperform the AI system alone. On landmark diagnostic cases, for example, physicians with GPT-4 achieved 76% accuracy compared to GPT-4 alone at 92%. This suggests that human-AI collaboration in clinical settings may actually degrade AI performance rather than enhance it—a phenomenon the paper acknowledges but fails to explore.
The author, writing from a physician-researcher perspective, argues that the study's rigorous methodology cannot overcome a fundamental problem with its framing: the comparator that matters in 2026 is not the unaided physician but the physician-with-tool system. Until that collaboration dynamic is understood and tested, the real story remains untold.
The paper's reliance on clinical vignettes rather than real-world workflow integration also raises questions about whether laboratory performance translates to meaningful clinical impact.
- The paper fails to engage with the most important question: why physician-AI collaborative configurations underperform solo AI on complex diagnostic tasks
Editorial Opinion
The paper's headline result feels like yesterday's news framed as tomorrow's breakthrough. What makes this work genuinely valuable isn't that o1 beats unaided physicians—that was already established in 2024—but rather the accidental exposure of a collaboration failure that the paper itself doesn't adequately investigate. In an era when most physicians now use AI, asking whether AI beats solo physicians is asking the wrong question; the real interrogation should focus on why giving physicians AI tools sometimes makes them worse at their job, not better.


