Study Finds GPT-5.5 Exhibits Authorship and Order Biases in Plan Evaluation

Key Takeaways

▸GPT-5.5 demonstrates a strong authorship bias, ranking its own plans last in 5 out of 6 cases, indicating it cannot fairly evaluate its own work
▸Presentation order dramatically influences the model's rankings in approximately 56% of cases, revealing susceptibility to arbitrary input sequencing
▸Increasing reasoning levels (high and xhigh) failed to eliminate these biases, suggesting they may be fundamental limitations rather than solvable through more computation

Source:

Hacker Newshttps://blog.valmont.dev/posts/gpt-5-5-is-a-biased-evaluator-authorship-and-order-effects/↗

Summary

A technical study reveals that OpenAI's recently released GPT-5.5 model exhibits significant biases when evaluating and ranking alternative plans. The research demonstrates two major problems: an authorship effect where the model ranks its own plans last in 5 out of 6 test cases, and an order effect where rankings match presentation sequence approximately 56% of the time. Testing across multiple reasoning levels (medium, high, and xhigh) showed that increasing reasoning complexity does not mitigate these biases, suggesting they may be fundamental rather than solvable through additional computation.

The findings directly challenge OpenAI's marketing claims that GPT-5.5 can reliably 'plan, use tools, check its work, navigate through ambiguity, and keep going' autonomously. Researchers concluded that LLM-based plan evaluation remains unreliable in practical scenarios and that human or external validation is still necessary, undermining the utility of fully autonomous AI planning systems.

Inter-model agreement on rankings was low, indicating inconsistency across different instances of GPT-5.5
Human oversight remains essential despite OpenAI's autonomous planning claims

Editorial Opinion

This research exposes a significant credibility gap between OpenAI's marketing claims about GPT-5.5's autonomous reasoning and its actual performance in objective evaluation tasks. The authorship and order biases aren't minor quirks—they're large enough to completely invalidate the model's ability to fairly compare alternatives, which is fundamental to genuine autonomous planning. While limited to a specific use case, the findings suggest deeper issues with LLM objectivity that could affect many real-world applications where fairness and consistency are critical.

OpenAI

RESEARCH OpenAI2026-04-25

Study Finds GPT-5.5 Exhibits Authorship and Order Biases in Plan Evaluation

Key Takeaways

▸GPT-5.5 demonstrates a strong authorship bias, ranking its own plans last in 5 out of 6 cases, indicating it cannot fairly evaluate its own work
▸Presentation order dramatically influences the model's rankings in approximately 56% of cases, revealing susceptibility to arbitrary input sequencing
▸Increasing reasoning levels (high and xhigh) failed to eliminate these biases, suggesting they may be fundamental limitations rather than solvable through more computation

Source:

Hacker Newshttps://blog.valmont.dev/posts/gpt-5-5-is-a-biased-evaluator-authorship-and-order-effects/↗

Summary

Inter-model agreement on rankings was low, indicating inconsistency across different instances of GPT-5.5
Human oversight remains essential despite OpenAI's autonomous planning claims

Editorial Opinion

This research exposes a significant credibility gap between OpenAI's marketing claims about GPT-5.5's autonomous reasoning and its actual performance in objective evaluation tasks. The authorship and order biases aren't minor quirks—they're large enough to completely invalidate the model's ability to fairly compare alternatives, which is fundamental to genuine autonomous planning. While limited to a specific use case, the findings suggest deeper issues with LLM objectivity that could affect many real-world applications where fairness and consistency are critical.

Study Finds GPT-5.5 Exhibits Authorship and Order Biases in Plan Evaluation

Key Takeaways

Summary

Editorial Opinion

More from OpenAI

ChatGPT Solves 60-Year-Old Math Problem With Novel Method, 23-Year-Old Amateur Succeeds

Acutus News Site Exposed as AI-Generated Content Operation Funded by OpenAI Super PAC

Researchers Find LLMs Produce 'Trendslop' When Giving Strategic Advice

Comments

Suggested

Anduril's AI Surveillance Tower Faces Privacy Backlash Over California Coastal Deployment

Anthropic Launches Claude Research Capabilities With Multi-Agent System Architecture

GitHub Announces Copilot SDK for Developer Integration

Study Finds GPT-5.5 Exhibits Authorship and Order Biases in Plan Evaluation

Key Takeaways

Summary

Editorial Opinion

More from OpenAI

ChatGPT Solves 60-Year-Old Math Problem With Novel Method, 23-Year-Old Amateur Succeeds

Acutus News Site Exposed as AI-Generated Content Operation Funded by OpenAI Super PAC

Researchers Find LLMs Produce 'Trendslop' When Giving Strategic Advice

Comments

Suggested

Anduril's AI Surveillance Tower Faces Privacy Backlash Over California Coastal Deployment

Anthropic Launches Claude Research Capabilities With Multi-Agent System Architecture

GitHub Announces Copilot SDK for Developer Integration