Researchers Expose 'Benchmark Illusion' in Compressed LLMs: Multiple-Choice Scores Don't Reflect Real Usability
Key Takeaways
- ▸Pruned LLMs pass multiple-choice benchmarks at high rates but fail at the same questions in open-ended generation—a phenomenon researchers term 'recognition-only errors'
- ▸The correct answer isn't erased by pruning; rather, it's demoted in ranking, often recoverable through beam search or sampling techniques
- ▸Standard benchmarks significantly overstate the real-world usability of compressed LLMs, creating an evaluation blind spot for model selection and deployment
Summary
A new research paper published on arXiv reveals a critical evaluation gap in compressed language models: pruned LLMs can achieve high scores on multiple-choice benchmarks while failing to produce correct answers in open-ended generation tasks. The researchers found that under high-sparsity pruning techniques like Wanda, models often cannot generate the correct answer as the top output, despite being able to recognize it when presented with multiple choices. The study involved multilingual question-answering tasks and demonstrates that the correct answer isn't necessarily erased by pruning—instead, it's demoted in the model's output distribution, sometimes recoverable through beam search, sampling, or minimal in-context examples. This finding exposes what the researchers call a 'benchmark illusion' that makes compressed models appear more capable than they actually are in practical applications.
- Researchers recommend testing compressed models on generative tasks they can actually produce, not just on what they can recognize
Editorial Opinion
This research challenges a widespread industry practice of relying on benchmark scores to validate model compression. As companies increasingly deploy pruned and quantized models to reduce latency and costs, this work serves as a critical wake-up call: standard evaluation methods mask real degradation in model capabilities. The distinction between recognition and generation is fundamental, and practitioners need more rigorous evaluation protocols before deploying compressed models in production systems.



