Research Shows Reasoning LLMs Can Accurately Answer Multiple-Choice Questions Using Only Answer Choices
Key Takeaways
- ▸Reasoning-enhanced LLMs can accurately answer multiple-choice questions using only answer choices without seeing the original question
- ▸Success on choices-only inputs stems from legitimate reasoning strategies like question inference, not shallow shortcuts or data artifacts
- ▸Reasoning traces pass faithfulness tests, validating that models engage in genuine problem-solving rather than post-hoc rationalization
Summary
A new research paper reveals that large language models equipped with test-time reasoning capabilities can accurately answer multiple-choice questions using only the answer options, without access to the actual question text. This finding challenges the widespread assumption that such partial-input success indicates data contamination or relies on trivial shortcuts.
The researchers conducted extensive analysis of how reasoning-enhanced LLMs approach multiple-choice question answering under both full-input and choices-only conditions. When equipped with reasoning abilities, models showed performance improvements in both scenarios. Critically, examination of the reasoning traces revealed that the models' success on choices-only inputs was driven by sophisticated reasoning strategies—particularly question inference—rather than shallow pattern matching or memorized responses.
Using faithfulness tests to validate their findings, the researchers demonstrated that the reasoning traces reflect genuine problem-solving rather than post-hoc rationalization. This directly contradicts the assumption that partial-input success automatically signals problematic data artifacts. The work proposes a more nuanced framework for evaluating LLM performance, distinguishing between truly problematic shortcuts and less problematic reasoning-based strategies.
The implications extend across LLM evaluation methodologies and our understanding of how these models solve standardized test questions, with potential applications to improving benchmark design and interpretation.
- Challenges the assumption that partial-input success in MCQA always indicates data contamination or evaluation flaws
- Proposes a more nuanced framework for LLM evaluation that separates problematic shortcuts from sophisticated reasoning-based performance
Editorial Opinion
This research fundamentally reshapes how we interpret LLM performance on multiple-choice exams. Rather than dismissing partial-input success as evidence of data leakage or cheap pattern-matching, the paper demonstrates that sophisticated reasoning underlies these capabilities—models actively infer missing context and deploy legitimate problem-solving strategies. These findings are essential for properly evaluating LLM capabilities and ensuring our benchmarks actually measure what we intend to measure.



