Study Finds GPT-5.5 Exhibits Authorship and Order Biases in Plan Evaluation
Key Takeaways
- ▸GPT-5.5 demonstrates a strong authorship bias, ranking its own plans last in 5 out of 6 cases, indicating it cannot fairly evaluate its own work
- ▸Presentation order dramatically influences the model's rankings in approximately 56% of cases, revealing susceptibility to arbitrary input sequencing
- ▸Increasing reasoning levels (high and xhigh) failed to eliminate these biases, suggesting they may be fundamental limitations rather than solvable through more computation
Summary
A technical study reveals that OpenAI's recently released GPT-5.5 model exhibits significant biases when evaluating and ranking alternative plans. The research demonstrates two major problems: an authorship effect where the model ranks its own plans last in 5 out of 6 test cases, and an order effect where rankings match presentation sequence approximately 56% of the time. Testing across multiple reasoning levels (medium, high, and xhigh) showed that increasing reasoning complexity does not mitigate these biases, suggesting they may be fundamental rather than solvable through additional computation.
The findings directly challenge OpenAI's marketing claims that GPT-5.5 can reliably 'plan, use tools, check its work, navigate through ambiguity, and keep going' autonomously. Researchers concluded that LLM-based plan evaluation remains unreliable in practical scenarios and that human or external validation is still necessary, undermining the utility of fully autonomous AI planning systems.
- Inter-model agreement on rankings was low, indicating inconsistency across different instances of GPT-5.5
- Human oversight remains essential despite OpenAI's autonomous planning claims
Editorial Opinion
This research exposes a significant credibility gap between OpenAI's marketing claims about GPT-5.5's autonomous reasoning and its actual performance in objective evaluation tasks. The authorship and order biases aren't minor quirks—they're large enough to completely invalidate the model's ability to fairly compare alternatives, which is fundamental to genuine autonomous planning. While limited to a specific use case, the findings suggest deeper issues with LLM objectivity that could affect many real-world applications where fairness and consistency are critical.



