Researcher Discovers 'Few-Shot Collapse': Adding Examples Can Degrade LLM Performance by 50%

Key Takeaways

▸Testing revealed three failure patterns: peak regression (64% → 33% accuracy), ranking reversal (zero-shot leaders falling behind), and selection method sensitivity (50%+ → 35% collapse)
▸The phenomenon affects multiple LLMs systematically, not just edge cases, and aligns with recent academic research on 'over-prompting' and 'context rot'
▸AdaptGauge provides automatic detection and classification of few-shot collapse patterns across different shot counts (0, 1, 2, 4, 8 examples)

Source:

Hacker Newshttps://github.com/ShuntaroOkuma/adapt-gauge-core↗

Summary

Independent researcher Shuntaro Okuma has released AdaptGauge, an open-source tool that reveals a counterintuitive phenomenon in large language models: adding more training examples can actively harm performance. Testing 8 LLMs across 4 tasks with varying numbers of few-shot examples (0, 1, 2, 4, and 8 shots), Okuma documented three distinct failure patterns. The most dramatic was 'peak regression,' where Google's Gemini Flash achieved 64% accuracy with 4 examples, then plummeted to 33% when given 8 examples—a near 50% performance collapse.

The research identified two additional concerning patterns: 'ranking reversal,' where models that performed best with zero examples fell to third place once examples were added, and severe sensitivity to example selection methods. When switching from hand-picked examples to TF-IDF-based selection, one model's performance crashed from over 50% to 35%. These findings align with emerging research on 'over-prompting' (Tang et al. 2025) and 'context rot' documented by Chroma Research, suggesting this is a systemic issue rather than isolated edge cases.

AdaptGauge addresses a critical blind spot in LLM evaluation: standard benchmarks measure accuracy at a single point, missing how models behave as more context is added. The tool automatically tracks learning curves across different shot counts and flags collapse patterns, classifying them as immediate, gradual, or peak regression. Released under MIT license with pre-computed demo results, AdaptGauge enables developers to test models without API keys. The research challenges the common assumption that more examples always improve performance and raises questions about production deployment strategies that rely heavily on few-shot prompting.

Standard LLM benchmarks miss this issue by testing only at single points, while production systems often rely on few-shot prompting that may unknowingly degrade performance

Editorial Opinion

This research exposes a troubling gap between how we evaluate LLMs and how we use them in production. The industry has largely assumed that providing more examples improves performance—a reasonable assumption that turns out to be dangerously wrong for certain model-task combinations. The 50% performance drops documented here aren't minor degradations; they could mean the difference between a functional production system and a broken one. Perhaps most concerning is that leaderboard rankings reverse based on shot count, meaning teams may be selecting models based on benchmarks that don't reflect their actual use case.

Researcher Discovers 'Few-Shot Collapse': Adding Examples Can Degrade LLM Performance by 50%

Key Takeaways

▸Testing revealed three failure patterns: peak regression (64% → 33% accuracy), ranking reversal (zero-shot leaders falling behind), and selection method sensitivity (50%+ → 35% collapse)
▸The phenomenon affects multiple LLMs systematically, not just edge cases, and aligns with recent academic research on 'over-prompting' and 'context rot'
▸AdaptGauge provides automatic detection and classification of few-shot collapse patterns across different shot counts (0, 1, 2, 4, 8 examples)

Summary

Standard LLM benchmarks miss this issue by testing only at single points, while production systems often rely on few-shot prompting that may unknowingly degrade performance

Editorial Opinion

This research exposes a troubling gap between how we evaluate LLMs and how we use them in production. The industry has largely assumed that providing more examples improves performance—a reasonable assumption that turns out to be dangerously wrong for certain model-task combinations. The 50% performance drops documented here aren't minor degradations; they could mean the difference between a functional production system and a broken one. Perhaps most concerning is that leaderboard rankings reverse based on shot count, meaning teams may be selecting models based on benchmarks that don't reflect their actual use case.

Researcher Discovers 'Few-Shot Collapse': Adding Examples Can Degrade LLM Performance by 50%

Key Takeaways

Summary

Editorial Opinion

More from Independent Research

VeriCache: New Framework Enables Lossless Compression for KV Cache in LLM Inference

Program Synthesis Enables Interpretable Explanations of Transformer Attention Mechanisms

HRM-Text Achieves Competitive LLM Performance With 100-900x Fewer Training Tokens

Comments

Suggested

Alibaba's Elements Claw AI Agent Discovers Four New Superconductors

Nvidia Moves Beyond Chip Sales to Finance AI Infrastructure Boom

Apple Container 1.0 Reaches Stable Release: Native macOS Docker Alternative Now GA

Researcher Discovers 'Few-Shot Collapse': Adding Examples Can Degrade LLM Performance by 50%

Key Takeaways

Summary

Editorial Opinion

More from Independent Research

VeriCache: New Framework Enables Lossless Compression for KV Cache in LLM Inference

Program Synthesis Enables Interpretable Explanations of Transformer Attention Mechanisms

HRM-Text Achieves Competitive LLM Performance With 100-900x Fewer Training Tokens

Comments

Suggested

Alibaba's Elements Claw AI Agent Discovers Four New Superconductors

Nvidia Moves Beyond Chip Sales to Finance AI Infrastructure Boom

Apple Container 1.0 Reaches Stable Release: Native macOS Docker Alternative Now GA